This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Instruction Fetch Exception seen on TMS320C6678 board

Other Parts Discussed in Thread: SYSBIOS

We encountered a problem where one of DSP cores running RLDSP functionality goes into �weeds�. Our app still does not have exception hooks installed, but by hooking CCS after the failure we found out that there was �instruction fetch exception� that occurred.
Analysis:

  • The failure is very sensitive to changes in stack utilization. This is default/global stack, so it is shared with BIOS.
  • We verified that there were no corruptions in .text segment
  • It does not appear to be size of stack issue. Failure happens very early during startup, DSP app did not go fully �operational� yet. Stack usage is minimal.
  • The failure is happening while in ISR context (in this case IPC). IPC HWI is configured by BIOS to allow nesting. For all HWIs that we attach, we do not allow nesting. There were other interrupts, possibly overlapping with IPC ISR. We recompiled IPC code to disallow interrupt nesting, but failure still occurred.
  • Declaring a function local variable as volatile (this is in ISR callback we register with IPC) makes failure go away.
  • Calling different HWI_ APIs during application startup is enough for failure not to show up.
  • Through the flag in config file, we recompile BIOS to remove asserts/diags to reduce the size which increased by bringing in IPC module (this is based on direction from TI). This was necessary early on to fit in available memory. Tried using default BIOS lib provided with tools, we were able to fit in memory, and with this failure does not show up.
  • Is BIOS without asserts/diags fully qualified by TI, to the same level as �stock� BIOS which is installed with tools?
  •  

 
 
  

  • BIOS without asserts/diags is fully qualified and it should work.

    Which version of SYS/BIOS are you using?   Are you able to use ROV?   If so, you should be able to look in the Exception module and get register information.  The 'NRP' register contains the nmi return pointer.   NRP contains the address at/near the code that caused the exception.   Put 'NRP' address in disassembly window.  This should give a good clue as to what went wrong.

    -Karl-

  • > BIOS without asserts/diags is fully qualified and it should work.

    thanks.  

    > Which version of SYS/BIOS are you using?  

    bios_6_32_05_54

    >Are you able to use ROV?   If so, you should be able to look in the Exception module and get register information.  The 'NRP' register contains the nmi return pointer.   NRP contains the address at/near the code that caused the exception.   Put 'NRP' address in disassembly window.  This should give a good clue as to what went wrong.

    NRP is set to NULL (0x00000000) and the Exception is "Instruction Fetch Exception"

  • Hi,

    NRP = 0x0 means that your code branched to address 0x0 and tried to execute there....thus giving you the Instruction Fetch Exception.

    Are you getting any print outs from the Exception?  If so can you paste that information here.  The value of B3 when the Exception occurs can tell you a lot since this is the return addr.  It should give you a clue as to where your code was expecting to return.  Once you have that, you should be able to determine how the code was branching to address 0x0 [most likely branching through another register that is set to 0x0].

    Have you used ROV to check your stacks to make sure they are not overflowing?

    Judah

  • Hi Judah,

       we have not registered for the Exception so the only output is what we see post-mortem using CCS and using ROV we see 

                  Exception Address: 0x00000000

                  Exception Decoded:  Internal: Instruction fetch exception; 

                  $addr: 0x0087cca0 (inside the stack, Stack-Origin:00872e00, Stack-size:0x0000a000  ) 

                  B3: inside MessageQ_put

     

    And this matches what we've seen through other debug info, mainly DSP-Core-0 receives it's first  "Notify" message from Core-1, invokes the callback in our code-space, and after we complete and give the thread back to the OS, it dies.  

    Regarding the stack, and as you can see above, it is sized at a relatively large 40K and at the time of failure we see it at only a few hundred bytes deep.

    Looking around, there are quite a few registers set to zero.  Any suggestions on which to focus and is there a key to determine what their current use is?

    Etc ...

    The slightest modification to the listener function's stack , e.g. adding a local variable, causes the failure to disappear.

    Also while the error is reproducible the overwhelming majority of the time, every now and then, with no build change and just a power cycle, it works.

    -0-

  • Hi,

    You don't need to register for Exceptions.  They should be enabled by default.  You should be getting some printout in the console window if using CCS so I'm not sure why you are not seeing those prints when an Exception occurs.

    What are you doing in the callback since you say that when it gives the thread back to the OS, it dies?

    Your stack is large so that doesn't seem to be the problem.

    Just to clarify, you need to know what the registers are at the time of the Exception.  Typically BIOS prints this information to the console window.  Its also stored as part of the Exception Module state through the pointer named 'excContext'.  Assuming the B3 you mentioned above is at the time of the Exception, you should look at the code in the addresses right before the B3.  At some point there should be a Branch instruction to one of the bad registers that contain 0.  Basically you are trying to unwind what happened.  That's about as best I can tell you in decoding an Exception.

    Judah

  • > What are you doing in the callback since you say that when it gives the thread back to the OS, it dies?

    I've even stripped it down to just a simple global variable increment and it still dies.

    Here's the info, let me know if you have any hints.

    Up to now we were connecting CCS only post-mortem, hence we were not seeing the prints.  After some hacks, we finally got it in that mode.  Here's the output.

    [C66xx_0] A0=0x1 A1=0x0

    A2=0x0 A3=0x0
    A4=0x14 A5=0x0
    A6=0xc009888 A7=0x0
    A8=0x1844018 A9=0x1
    A10=0x866838 A11=0xffff
    A12=0x866418 A13=0x0
    A14=0x100 A15=0x1b4
    A16=0xc3f1b88 A17=0x8
    A18=0xc06f698 A19=0xc122d10
    A20=0x0 A21=0x0
    A22=0x51c A23=0x3ffffff
    A24=0xff00ffff A25[C66xx_0] =0xc0864e0
    A26=0xc073910 A27=0xc1ee120
    A28=0x4 A29=0xff0012
    A30=0x8 A31=0x0
    B0=0x0 B1=0x0
    B2=0x1 B3=0xc15e134
    B4=0xc3f1d80 B5=0x15000103
    B6=0x2000 B7=0xc005b20
    B8=0x1000 B9=0x0
    B10=0xc007424 B11=0x0
    B12=0x0 B13=0xc007424
    B14=0x820000 B15=0x87c978
    B16=0xc06[C66xx_0] d716 B17=0x0
    B18=0xc0ede10 B19=0xc074d78
    B20=0xc0ed610 B21=0xffffffc0
    B22=0x20f B23=0x0
    B24=0x1 B25=0x1
    B26=0x80 B27=0xc06e160
    B28=0xff0000 B29=0x40
    B30=0x0 B31=0x3fffffff
    NTSR=0x1020f
    ITSR=0x20f
    IRP=0xc1617f4
    SSR=0x0
    AMR=0x0
    RILC=0x0
    ILC=0x0
    Exception at [C66xx_0] 0x0
    EFR=0x2 NRP=0x0
    Internal exception: IERR=0x1
    Instruction fetch exception
    ti.sysbios.family.c64p.Exception: line 248: E_exceptionMin: pc = 0x0c1617f4, sp = 0x0087c978.
    xdc.runtime.Error.raise: terminating execution

    The Code Around pc=0x0c1617f4

    0c1617d8: 01888162 ADDKPC.S2 $C$RL406 (PC+32 = 0x0c1617e0),B3,4
    0c1617dc: E7000000 .fphead n, l, W, BU, nobr, nosat, 0111000b
    $C$DW$L$ti_sdo_ipc_gates_GateHWSem_enter__E$5$B, $C$DW$L$ti_sdo_ipc_gates_GateHWSem_enter__E$4$E, $C$RL406:
    0c1617e0: 02286264 LDW.D1T1 *+A10[3],A4
    0c1617e4: 6C6E NOP 4
    0c1617e6: 003C LDW.D1T1 *A4[0],A3
    0c1617e8: 00006000 NOP 4
    0c1617ec: 018C6264 LDW.D1T1 *+A3[3],A3
    0c1617f0: 00006000 NOP 4
    0c1617f4: 000C1362 B.S2X A3
    0c1617f8: 01888162 ADDKPC.S2 $C$RL407 (PC+32 = 0x0c161800),B3,4
    0c1617fc: E0400000 .fphead n, l, W, BU, nobr, nosat, 0000010b
    $C$DW$L$ti_sdo_ipc_gates_GateHWSem_enter__E$6$B, $C$DW$L$ti_sdo_ipc_gates_Gate

    Register B3 (0x0c15e134), presumably the return pointer is pointing to the function MessageQ_put and here's a snippet around the value

    0c15e100: 01907C40 ADDAW.D1 A4,A3,A3
    0c15e104: 7330 ADD.L1X A3,B6,A3
    0c15e106: 0C6E NOP 1
    0c15e108: D00C0264 [!A0] LDW.D1T1 *+A3[0],A0
    0c15e10c: 00006000 NOP 4
    0c15e110: 01800264 LDW.D1T1 *+A0[0],A3
    0c15e114: 8046 MV.L1 A0,A4
    0c15e116: 4C6E NOP 3
    0c15e118: 018C8264 LDW.D1T1 *+A3[4],A3
    0c15e11c: E4400000 .fphead n, l, W, BU, nobr, nosat, 0100010b
    0c15e120: 00006000 NOP 4
    0c15e124: 000C1362 B.S2X A3
    0c15e128: 01838162 ADDKPC.S2 $C$RL78 (PC+12 = 0x0c15e12c),B3,4
    $C$RL78:
    0c15e12c: 01BC92E6 LDW.D2T2 *++B15[4],B3
    0c15e130: 0246 MV.L1 A4,A0
    0c15e132: 0626 MVK.L1 0,A4
    0c15e134: D27CA358 [!A0] MVK.L1 -1,A4
    0c15e138: 0C6E NOP 1
    $C$L89:
    0c15e13a: A1EF BNOP.S2 B3,5
    0c15e13c: EA000000 .fphead n, l, W, BU, nobr, nosat, 1010000b
    __c6xabi_divlli:
    0c15e140: 36F7 STW.D2T2 B13,*B15--[2]
    0c15e142: A5C7 || MV.L2 B3,B13

  • Is the code snipet starting at 0xc15e100 in MessageQ_put?

    It is sort of odd that B3=0x0c15e134, I would have expected B3=0x0c15e12c since this is what line 0x0c15e128 is doing, nevertheless it looks like the bad branch is on line 0x0c15e124.  A3=0x0 so that would make sense why NRP=0x0.  The question then becomes, why is A3=0x0?  A3 is gathered from A0[0], but A0=1 so A0 looks like its bad too.  Unfortunately, we lost the original value of A3 which is where A0 is derived.

    Next question is, does this always fail the very first time you get to this code?

    I think in the thread you said you are running something on core0 and something on core1.  Are these the same executables or different?  If different have you made sure that your memory maps for the programs don't collide?  Can you attach the two *.maps here?

    Judah

  • >Is the code snipet starting at 0xc15e100 in MessageQ_put?

       that's correct.

    >Next question is, does this always fail the very first time you get to this code?

    Correct, it's always during receipt of the first "Notify" message sent from Core1 to Core0

    >I think in the thread you said you are running something on core0 and something on core1.  Are these the same executables or different?  If different have you made sure that your memory maps for the programs don't collide? 

    They're different executables.

    >Can you attach the two *.maps here?

    The only shared section between the two cores should be BIOS.ipc.Shared...

    Here's are the overall "Memory Configuration" portions, let me know if you need more.

    Core0.map

    MEMORY CONFIGURATION

    name origin length used unused attr fill
    ---------------------- -------- --------- -------- -------- ---- --------
    L2SRAM_IBL_UNINIT 00800000 00020000 00020000 00000000 RW
    L2SRAM 00820000 00052e00 00052c70 00000190 RW X
    L2SRAM_RBL_UNINIT 00872e00 0000d200 0000a000 00003200 RW
    CORE0_DATA_MAJOR_TILE0 0c000000 00100000 000ff959 000006a7 RW X
    CORE0_CODE_MAJOR 0c100000 00080000 0006d120 00012ee0 R X
    CORE1_DATA_MAJOR_TILE1 0c180000 00040000 00000000 00040000 RW X
    CORE0_CODE_MINOR 0c1c0000 00020000 00000000 00020000 R X
    CORE0_DATA_MINOR_TILE0 0c1e0000 00010000 0000e268 00001d98 RW X
    CORE0_DATA_MINOR_TILE1 0c1f0000 00010000 00000000 00010000 RW X
    CORE0_DATA_MAJOR_TILE1 0c200000 00100000 00000000 00100000 RW X
    CORE1_CODE_MAJOR 0c300000 00080000 00000000 00080000 R X
    CORE1_DATA_MAJOR_TILE0 0c380000 00040000 00000000 00040000 RW X
    CORE1_CODE_MINOR 0c3c0000 00020000 00000000 00020000 R X
    CORE1_DATA_MINOR_TILE0 0c3e0000 00008000 00000000 00008000 RW X
    CORE1_DATA_MINOR_TILE1 0c3e8000 00008000 00000000 00008000 RW X
    BIOS.ipc.SharedRegion 0c3f0000 00010000 00010000 00000000 RW X

    Core1.map

    MEMORY CONFIGURATION

    name origin length used unused attr fill
    ---------------------- -------- --------- -------- -------- ---- --------
    L2SRAM_IBL_UNINIT 00800000 00020000 00020000 00000000 RW
    L2SRAM 00820000 00052e00 00052ce2 0000011e RW X
    L2SRAM_RBL_UNINIT 00872e00 0000d200 0000a000 00003200 RW
    CORE0_DATA_MAJOR_TILE0 0c000000 00100000 00000000 00100000 RW X
    CORE0_CODE_MAJOR 0c100000 00080000 00000000 00080000 R X
    CORE1_DATA_MAJOR_TILE1 0c180000 00040000 00000000 00040000 RW X
    CORE0_CODE_MINOR 0c1c0000 00020000 00000000 00020000 R X
    CORE0_DATA_MINOR_TILE0 0c1e0000 00010000 00000000 00010000 RW X
    CORE0_DATA_MINOR_TILE1 0c1f0000 00010000 00000000 00010000 RW X
    CORE0_DATA_MAJOR_TILE1 0c200000 00100000 00000000 00100000 RW X
    CORE1_CODE_MAJOR 0c300000 00080000 0006ddc0 00012240 R X
    CORE1_DATA_MAJOR_TILE0 0c380000 00040000 0003fcc8 00000338 RW X
    CORE1_CODE_MINOR 0c3c0000 00020000 00000000 00020000 R X
    CORE1_DATA_MINOR_TILE0 0c3e0000 00008000 00007470 00000b90 RW X
    CORE1_DATA_MINOR_TILE1 0c3e8000 00008000 00000000 00008000 RW X
    BIOS.ipc.SharedRegion 0c3f0000 00010000 00010000 00000000 RW X

  • Since this is failing the very first time you get here, can you put a breakpoint in MessageQ_put() and try to catch it before it fails?

    What version of IPC are you using?

    Correct me if I'm wrong but core1 is sending the message to core0 right?  So a MessageQ_alloc must have been called somewhere on core1.  Do you know from what heapMP the MessageQ was alloc'ed?  Just wanting to make sure the message being sent to core0 is from the SharedRegion.

  • >Since this is failing the very first time you get here, can you put a breakpoint in MessageQ_put() and try to catch it before it fails?

       I am unable to get the breakpoints to trip.  I tested on the DSK/Eval board and it works, but no on real target.

       The CCS  Version I'm using is N201105110900.

       Any issues with this CCS version on a C6678 board?

    >What version of IPC are you using?

    ipc_1_24_00_16

    >Correct me if I'm wrong but core1 is sending the message to core0 right?  So a MessageQ_alloc must have been called somewhere on core1.  Do you know from what heapMP the MessageQ was alloc'ed?  Just wanting to make sure the message being sent to core0 is from the SharedRegion.

    Actually it's a "Notify" message.  For the MessageQ messages, we do perform an explicit mem-allocation from that shared region.  Perhaps for the Notify the allocation is performed internally?

  • So now I'm confused.  Its a "Notify" message but the exception is happening in MessageQ_put().  That doesn't make sense.  MessageQ uses Notify but if its a standalone Notify message, that should not be going through MessageQ.

    There aren't really "messages" associated with Notify.  Notify does not pass a message between cores it simply sends an interrupt to the other side.  Notify does supports a 32-bit payload so you could pass a pointer to a message.

    Judah

  • Correct, Core1 sends a Notify message to Core0.  Upon receipt and during processing of the Notification, Core0 dies during MessageQ_put().  

    I've not been able to track why Core0, which is the receiver, is even invoking MessageQ_put.  It is very unlikely that it was caused by a application based MessageQ-Send as the system is dying before completing processing of the received Notify-notification.

  • A Notify_send() should not invoke a MessageQ_put() on the receiving core.  If its doing this then some configuration or something is not right.

    Have you used the ROV tool, its a good debugging tool for BIOS programs.

    Judah

  • I've inspected it and have not anything untoward.  Perhaps you can give us some hints during today's meeting.

    Additionally is there some low level event-logging mechanism where we can access the sequence of events (stack, interrupts, etc) that preceded the failure?  

    I was reading about a "trace" module in one of TI's documents, would this be useful for this type of debugging?

  • I think "trace" could help.  I haven't used it very much myself but we can try today if we can't make any sense of what's going on.

    Judah

  • Is this problem already solved? Facing the same problem, NRP is set to 0x0.... please post your solution!

  • Solved! My task-stack section was in shared memory... moved it to L2SRAM and everything works fine!