Instruction Fetch Exception seen on TMS320C6678 board

B D86135

Other Parts Discussed in Thread: SYSBIOS

We encountered a problem where one of DSP cores running RLDSP functionality goes into �weeds�. Our app still does not have exception hooks installed, but by hooking CCS after the failure we found out that there was �instruction fetch exception� that occurred.
Analysis:

The failure is very sensitive to changes in stack utilization. This is default/global stack, so it is shared with BIOS.
We verified that there were no corruptions in .text segment
It does not appear to be size of stack issue. Failure happens very early during startup, DSP app did not go fully �operational� yet. Stack usage is minimal.
The failure is happening while in ISR context (in this case IPC). IPC HWI is configured by BIOS to allow nesting. For all HWIs that we attach, we do not allow nesting. There were other interrupts, possibly overlapping with IPC ISR. We recompiled IPC code to disallow interrupt nesting, but failure still occurred.
Declaring a function local variable as volatile (this is in ISR callback we register with IPC) makes failure go away.
Calling different HWI_ APIs during application startup is enough for failure not to show up.
Through the flag in config file, we recompile BIOS to remove asserts/diags to reduce the size which increased by bringing in IPC module (this is based on direction from TI). This was necessary early on to fit in available memory. Tried using default BIOS lib provided with tools, we were able to fit in memory, and with this failure does not show up.

Is BIOS without asserts/diags fully qualified by TI, to the same level as �stock� BIOS which is installed with tools?

over 13 years ago

0 Karl Wechsler over 13 years ago

TI__Mastermind 20805 points

BIOS without asserts/diags is fully qualified and it should work.

Which version of SYS/BIOS are you using? Are you able to use ROV? If so, you should be able to look in the Exception module and get register information. The 'NRP' register contains the nmi return pointer. NRP contains the address at/near the code that caused the exception. Put 'NRP' address in disassembly window. This should give a good clue as to what went wrong.

-Karl-

0 B D86135 over 13 years ago in reply to Karl Wechsler

Intellectual 275 points

> BIOS without asserts/diags is fully qualified and it should work.

thanks.

> Which version of SYS/BIOS are you using?

bios_6_32_05_54

>Are you able to use ROV? If so, you should be able to look in the Exception module and get register information. The 'NRP' register contains the nmi return pointer. NRP contains the address at/near the code that caused the exception. Put 'NRP' address in disassembly window. This should give a good clue as to what went wrong.

NRP is set to NULL (0x00000000) and the Exception is "Instruction Fetch Exception"

0 judahvang over 13 years ago in reply to B D86135

TI__Mastermind 32475 points

Hi,

NRP = 0x0 means that your code branched to address 0x0 and tried to execute there....thus giving you the Instruction Fetch Exception.

Are you getting any print outs from the Exception? If so can you paste that information here. The value of B3 when the Exception occurs can tell you a lot since this is the return addr. It should give you a clue as to where your code was expecting to return. Once you have that, you should be able to determine how the code was branching to address 0x0 [most likely branching through another register that is set to 0x0].

Have you used ROV to check your stacks to make sure they are not overflowing?

Judah

0 B D86135 over 13 years ago in reply to judahvang

Intellectual 275 points

Hi Judah,

we have not registered for the Exception so the only output is what we see post-mortem using CCS and using ROV we see

Exception Address: 0x00000000

Exception Decoded: Internal: Instruction fetch exception;

$addr: 0x0087cca0 (inside the stack, Stack-Origin:00872e00, Stack-size:0x0000a000 )

B3: inside MessageQ_put

And this matches what we've seen through other debug info, mainly DSP-Core-0 receives it's first "Notify" message from Core-1, invokes the callback in our code-space, and after we complete and give the thread back to the OS, it dies.

Regarding the stack, and as you can see above, it is sized at a relatively large 40K and at the time of failure we see it at only a few hundred bytes deep.

Looking around, there are quite a few registers set to zero. Any suggestions on which to focus and is there a key to determine what their current use is?

Etc ...

The slightest modification to the listener function's stack , e.g. adding a local variable, causes the failure to disappear.

Also while the error is reproducible the overwhelming majority of the time, every now and then, with no build change and just a power cycle, it works.

-0-

0 judahvang over 13 years ago in reply to B D86135

TI__Mastermind 32475 points

Hi,

You don't need to register for Exceptions. They should be enabled by default. You should be getting some printout in the console window if using CCS so I'm not sure why you are not seeing those prints when an Exception occurs.

What are you doing in the callback since you say that when it gives the thread back to the OS, it dies?

Your stack is large so that doesn't seem to be the problem.

Just to clarify, you need to know what the registers are at the time of the Exception. Typically BIOS prints this information to the console window. Its also stored as part of the Exception Module state through the pointer named 'excContext'. Assuming the B3 you mentioned above is at the time of the Exception, you should look at the code in the addresses right before the B3. At some point there should be a Branch instruction to one of the bad registers that contain 0. Basically you are trying to unwind what happened. That's about as best I can tell you in decoding an Exception.

Judah

0 B D86135 over 13 years ago in reply to judahvang

Intellectual 275 points

> What are you doing in the callback since you say that when it gives the thread back to the OS, it dies?

I've even stripped it down to just a simple global variable increment and it still dies.

Here's the info, let me know if you have any hints.

Up to now we were connecting CCS only post-mortem, hence we were not seeing the prints. After some hacks, we finally got it in that mode. Here's the output.

[C66xx_0] A0=0x1 A1=0x0

A2=0x0 A3=0x0
A4=0x14 A5=0x0
A6=0xc009888 A7=0x0
A8=0x1844018 A9=0x1
A10=0x866838 A11=0xffff
A12=0x866418 A13=0x0
A14=0x100 A15=0x1b4
A16=0xc3f1b88 A17=0x8
A18=0xc06f698 A19=0xc122d10
A20=0x0 A21=0x0
A22=0x51c A23=0x3ffffff
A24=0xff00ffff A25[C66xx_0] =0xc0864e0
A26=0xc073910 A27=0xc1ee120
A28=0x4 A29=0xff0012
A30=0x8 A31=0x0
B0=0x0 B1=0x0
B2=0x1 B3=0xc15e134
B4=0xc3f1d80 B5=0x15000103
B6=0x2000 B7=0xc005b20
B8=0x1000 B9=0x0
B10=0xc007424 B11=0x0
B12=0x0 B13=0xc007424
B14=0x820000 B15=0x87c978
B16=0xc06[C66xx_0] d716 B17=0x0
B18=0xc0ede10 B19=0xc074d78
B20=0xc0ed610 B21=0xffffffc0
B22=0x20f B23=0x0
B24=0x1 B25=0x1
B26=0x80 B27=0xc06e160
B28=0xff0000 B29=0x40
B30=0x0 B31=0x3fffffff
NTSR=0x1020f
ITSR=0x20f
IRP=0xc1617f4
SSR=0x0
AMR=0x0
RILC=0x0
ILC=0x0
Exception at [C66xx_0] 0x0
EFR=0x2 NRP=0x0
Internal exception: IERR=0x1
Instruction fetch exception
ti.sysbios.family.c64p.Exception: line 248: E_exceptionMin: pc = 0x0c1617f4, sp = 0x0087c978.
xdc.runtime.Error.raise: terminating execution

The Code Around pc=0x0c1617f4

0c1617d8: 01888162 ADDKPC.S2 $C$RL406 (PC+32 = 0x0c1617e0),B3,4
0c1617dc: E7000000 .fphead n, l, W, BU, nobr, nosat, 0111000b
$C$DW$L$ti_sdo_ipc_gates_GateHWSem_enter__E$5$B, $C$DW$L$ti_sdo_ipc_gates_GateHWSem_enter__E$4$E, $C$RL406:
0c1617e0: 02286264 LDW.D1T1 *+A10[3],A4
0c1617e4: 6C6E NOP 4
0c1617e6: 003C LDW.D1T1 *A4[0],A3
0c1617e8: 00006000 NOP 4
0c1617ec: 018C6264 LDW.D1T1 *+A3[3],A3
0c1617f0: 00006000 NOP 4
0c1617f4: 000C1362 B.S2X A3
0c1617f8: 01888162 ADDKPC.S2 $C$RL407 (PC+32 = 0x0c161800),B3,4
0c1617fc: E0400000 .fphead n, l, W, BU, nobr, nosat, 0000010b
$C$DW$L$ti_sdo_ipc_gates_GateHWSem_enter__E$6$B, $C$DW$L$ti_sdo_ipc_gates_Gate

Register B3 (0x0c15e134), presumably the return pointer is pointing to the function MessageQ_put and here's a snippet around the value

0c15e100: 01907C40 ADDAW.D1 A4,A3,A3
0c15e104: 7330 ADD.L1X A3,B6,A3
0c15e106: 0C6E NOP 1
0c15e108: D00C0264 [!A0] LDW.D1T1 *+A3[0],A0
0c15e10c: 00006000 NOP 4
0c15e110: 01800264 LDW.D1T1 *+A0[0],A3
0c15e114: 8046 MV.L1 A0,A4
0c15e116: 4C6E NOP 3
0c15e118: 018C8264 LDW.D1T1 *+A3[4],A3
0c15e11c: E4400000 .fphead n, l, W, BU, nobr, nosat, 0100010b
0c15e120: 00006000 NOP 4
0c15e124: 000C1362 B.S2X A3
0c15e128: 01838162 ADDKPC.S2 $C$RL78 (PC+12 = 0x0c15e12c),B3,4
$C$RL78:
0c15e12c: 01BC92E6 LDW.D2T2 *++B15[4],B3
0c15e130: 0246 MV.L1 A4,A0
0c15e132: 0626 MVK.L1 0,A4
0c15e134: D27CA358 [!A0] MVK.L1 -1,A4
0c15e138: 0C6E NOP 1
$C$L89:
0c15e13a: A1EF BNOP.S2 B3,5
0c15e13c: EA000000 .fphead n, l, W, BU, nobr, nosat, 1010000b
__c6xabi_divlli:
0c15e140: 36F7 STW.D2T2 B13,*B15--[2]
0c15e142: A5C7 || MV.L2 B3,B13

0 judahvang over 13 years ago in reply to B D86135

TI__Mastermind 32475 points

Is the code snipet starting at 0xc15e100 in MessageQ_put?

It is sort of odd that B3=0x0c15e134, I would have expected B3=0x0c15e12c since this is what line 0x0c15e128 is doing, nevertheless it looks like the bad branch is on line 0x0c15e124. A3=0x0 so that would make sense why NRP=0x0. The question then becomes, why is A3=0x0? A3 is gathered from A0[0], but A0=1 so A0 looks like its bad too. Unfortunately, we lost the original value of A3 which is where A0 is derived.

Next question is, does this always fail the very first time you get to this code?

I think in the thread you said you are running something on core0 and something on core1. Are these the same executables or different? If different have you made sure that your memory maps for the programs don't collide? Can you attach the two *.maps here?

Judah

0 B D86135 over 13 years ago in reply to judahvang

Intellectual 275 points

>Is the code snipet starting at 0xc15e100 in MessageQ_put?

that's correct.

>Next question is, does this always fail the very first time you get to this code?

Correct, it's always during receipt of the first "Notify" message sent from Core1 to Core0

>I think in the thread you said you are running something on core0 and something on core1. Are these the same executables or different? If different have you made sure that your memory maps for the programs don't collide?

They're different executables.

>Can you attach the two *.maps here?

The only shared section between the two cores should be BIOS.ipc.Shared...

Here's are the overall "Memory Configuration" portions, let me know if you need more.

Core0.map

MEMORY CONFIGURATION

name origin length used unused attr fill
---------------------- -------- --------- -------- -------- ---- --------
L2SRAM_IBL_UNINIT 00800000 00020000 00020000 00000000 RW
L2SRAM 00820000 00052e00 00052c70 00000190 RW X
L2SRAM_RBL_UNINIT 00872e00 0000d200 0000a000 00003200 RW
CORE0_DATA_MAJOR_TILE0 0c000000 00100000 000ff959 000006a7 RW X
CORE0_CODE_MAJOR 0c100000 00080000 0006d120 00012ee0 R X
CORE1_DATA_MAJOR_TILE1 0c180000 00040000 00000000 00040000 RW X
CORE0_CODE_MINOR 0c1c0000 00020000 00000000 00020000 R X
CORE0_DATA_MINOR_TILE0 0c1e0000 00010000 0000e268 00001d98 RW X
CORE0_DATA_MINOR_TILE1 0c1f0000 00010000 00000000 00010000 RW X
CORE0_DATA_MAJOR_TILE1 0c200000 00100000 00000000 00100000 RW X
CORE1_CODE_MAJOR 0c300000 00080000 00000000 00080000 R X
CORE1_DATA_MAJOR_TILE0 0c380000 00040000 00000000 00040000 RW X
CORE1_CODE_MINOR 0c3c0000 00020000 00000000 00020000 R X
CORE1_DATA_MINOR_TILE0 0c3e0000 00008000 00000000 00008000 RW X
CORE1_DATA_MINOR_TILE1 0c3e8000 00008000 00000000 00008000 RW X
BIOS.ipc.SharedRegion 0c3f0000 00010000 00010000 00000000 RW X

Core1.map

MEMORY CONFIGURATION

name origin length used unused attr fill
---------------------- -------- --------- -------- -------- ---- --------
L2SRAM_IBL_UNINIT 00800000 00020000 00020000 00000000 RW
L2SRAM 00820000 00052e00 00052ce2 0000011e RW X
L2SRAM_RBL_UNINIT 00872e00 0000d200 0000a000 00003200 RW
CORE0_DATA_MAJOR_TILE0 0c000000 00100000 00000000 00100000 RW X
CORE0_CODE_MAJOR 0c100000 00080000 00000000 00080000 R X
CORE1_DATA_MAJOR_TILE1 0c180000 00040000 00000000 00040000 RW X
CORE0_CODE_MINOR 0c1c0000 00020000 00000000 00020000 R X
CORE0_DATA_MINOR_TILE0 0c1e0000 00010000 00000000 00010000 RW X
CORE0_DATA_MINOR_TILE1 0c1f0000 00010000 00000000 00010000 RW X
CORE0_DATA_MAJOR_TILE1 0c200000 00100000 00000000 00100000 RW X
CORE1_CODE_MAJOR 0c300000 00080000 0006ddc0 00012240 R X
CORE1_DATA_MAJOR_TILE0 0c380000 00040000 0003fcc8 00000338 RW X
CORE1_CODE_MINOR 0c3c0000 00020000 00000000 00020000 R X
CORE1_DATA_MINOR_TILE0 0c3e0000 00008000 00007470 00000b90 RW X
CORE1_DATA_MINOR_TILE1 0c3e8000 00008000 00000000 00008000 RW X
BIOS.ipc.SharedRegion 0c3f0000 00010000 00010000 00000000 RW X

0 judahvang over 13 years ago in reply to B D86135

TI__Mastermind 32475 points

Since this is failing the very first time you get here, can you put a breakpoint in MessageQ_put() and try to catch it before it fails?

What version of IPC are you using?

Correct me if I'm wrong but core1 is sending the message to core0 right? So a MessageQ_alloc must have been called somewhere on core1. Do you know from what heapMP the MessageQ was alloc'ed? Just wanting to make sure the message being sent to core0 is from the SharedRegion.

0 B D86135 over 13 years ago in reply to judahvang

Intellectual 275 points

>Since this is failing the very first time you get here, can you put a breakpoint in MessageQ_put() and try to catch it before it fails?

I am unable to get the breakpoints to trip. I tested on the DSK/Eval board and it works, but no on real target.

The CCS Version I'm using is N201105110900.

Any issues with this CCS version on a C6678 board?

>What version of IPC are you using?

ipc_1_24_00_16

>Correct me if I'm wrong but core1 is sending the message to core0 right? So a MessageQ_alloc must have been called somewhere on core1. Do you know from what heapMP the MessageQ was alloc'ed? Just wanting to make sure the message being sent to core0 is from the SharedRegion.

Actually it's a "Notify" message. For the MessageQ messages, we do perform an explicit mem-allocation from that shared region. Perhaps for the Notify the allocation is performed internally?

0 judahvang over 13 years ago in reply to B D86135

TI__Mastermind 32475 points

So now I'm confused. Its a "Notify" message but the exception is happening in MessageQ_put(). That doesn't make sense. MessageQ uses Notify but if its a standalone Notify message, that should not be going through MessageQ.

There aren't really "messages" associated with Notify. Notify does not pass a message between cores it simply sends an interrupt to the other side. Notify does supports a 32-bit payload so you could pass a pointer to a message.

Judah

0 B D86135 over 13 years ago in reply to judahvang

Intellectual 275 points

Correct, Core1 sends a Notify message to Core0. Upon receipt and during processing of the Notification, Core0 dies during MessageQ_put().

I've not been able to track why Core0, which is the receiver, is even invoking MessageQ_put. It is very unlikely that it was caused by a application based MessageQ-Send as the system is dying before completing processing of the received Notify-notification.

0 judahvang over 13 years ago in reply to B D86135

TI__Mastermind 32475 points

A Notify_send() should not invoke a MessageQ_put() on the receiving core. If its doing this then some configuration or something is not right.

Have you used the ROV tool, its a good debugging tool for BIOS programs.

Judah

0 B D86135 over 13 years ago in reply to judahvang

Intellectual 275 points

I've inspected it and have not anything untoward. Perhaps you can give us some hints during today's meeting.

Additionally is there some low level event-logging mechanism where we can access the sequence of events (stack, interrupts, etc) that preceded the failure?

I was reading about a "trace" module in one of TI's documents, would this be useful for this type of debugging?

0 judahvang over 13 years ago in reply to B D86135

TI__Mastermind 32475 points

I think "trace" could help. I haven't used it very much myself but we can try today if we can't make any sense of what's going on.

Judah

0 Philipp Schmidbauer over 12 years ago in reply to judahvang

Intellectual 330 points

Is this problem already solved? Facing the same problem, NRP is set to 0x0.... please post your solution!

0 Philipp Schmidbauer over 12 years ago in reply to Philipp Schmidbauer

Intellectual 330 points

Solved! My task-stack section was in shared memory... moved it to L2SRAM and everything works fine!

Processors

Processors forum

Instruction Fetch Exception seen on TMS320C6678 board