RTOS/AM5716: DSP opcode exception

Butters

Expert 2001 points

Part Number: AM5716
Other Parts Discussed in Thread: SYSBIOS, SYSCONFIG

Tool/software: TI-RTOS

Hello, I am having an intermittent (but repeatable) problem with my SYS/BIOS application. I'm hoping that someone can help me to determine how best to proceed in tracking down this unusual problem.

I am running SYS/BIOS on the C66x core of an AM5716 processor. I have a single interrupt in the system; it is triggered on the rising edge of a GPIO pin. I have found that the system occasionally becomes unresponsive, and when I break in with the XDS200 JTAG emulator, I always find the following error signature:

The PC is at abort() in exit.c, meaning that SYS/BIOS has shut down.
The Exception module in the ROV shows that an internal exception has occurred.
- The decoded exception is: Internal: Instruction fetch exception; Opcode exception;
- The address of the exception (the ERP register) is the address of ti_sysbios_family_c64p_Hwi_int15; this is the vector for the interrupt that I am using.
The exception TSR snapshot register (ETSR) has the value 0x10204, indicating that an interrupt was being processed when the exception occurred.
The stack pointer is in the valid range, and ROV does not show stack or heap overflows.
The interrupt return pointer (IRP) is set to the address of a particular trampoline function in my firmware (let's call it $Tramp$S$$MyFunctionName). The target of the trampoline is the function MyFunctionName, which lives in external memory (DDR).

The best clue that I have is the consistent interrupt return pointer when the exception occurs. Based on the exception return pointer, the exception occurs upon reaching ti_sysbios_family_c64p_Hwi_int15, not returning from it (i.e., at the very beginning of the interrupt handling, not the end). I have verified that there is nothing wrong with the opcode at ti_sysbios_family_c64p_Hwi_int15, but the decoded internal exception values are instruction fetch exception and opcode exception.

So it seems that the exception happens only when the interrupt occurs at the precise moment that the PC is at the address $Tramp$S$$MyFunctionName (the very beginning of the trampoline, not even after the branch)... but this doesn't make any sense to me! Thanks in advance for any guesses or ideas on what else I can do to debug this issue.

Best regards,
Dave

over 8 years ago

0 Biser Gatchev-XID over 8 years ago

TI__Guru**** 393215 points

The RTOS team have been notified. They will respond here.

0 Rahul Prabhu over 8 years ago

TI__Guru** 116020 points

Dave,

Is your GPIO interrupt hooked up to HWI15 ? Have you observed any change on the GPIO line when you connect to JTAG that could trigger the interrupt or is that your hypothesis. Does this issue occur only when you halt the core with debugger or do you believe that this occurs even during run time? Ca you indicate when you are running the code on the DSP, what is running on the ARM core? Is there any timer/clock running in your SYSBIOS configuration.

Can you check to see if you are changing the suspend source for any timer that you may have used :
processors.wiki.ti.com/.../AM572x_GP_EVM_Hardware_Setup

Regards,
Rahul

0 Butters over 8 years ago in reply to Rahul Prabhu

Expert 2001 points

Hi Rahul,

Yes, the GPIO interrupt is hooked up to Hwi_int15. The interrupt occurs on a periodic basis, and the interrupt is correctly serviced many times before the error occurs (my interrupt service routine increments a counter, so I can see that it has been running for some time).

The issue is actually occurring during run time; I do not break in with the emulator until after the issue has occurred. I'm just using the emulator to interrogate the state of the system after the error takes place. I can tell when the error has occurred because the communication protocol that I use to talk to the system will stop responding.

I am not running any code on the ARM; it simply runs a bootloader when the device is powered up and then sits in a while(1).

I am not using any hardware timers; the GPIO interrupt provides the system heartbeat, and I read the value of the TSC (time stamp counter) register to track the passage of "real time" when necessary.

Thanks,
Dave

0 Christopher Peters over 8 years ago in reply to Butters

Genius 3380 points

Not an answer, but a suggestion on debugging. When faced with a similar issue, I looked at the trace back in the debugger from the Abort() function. I then go back in function calls as far as the debugger will show me and set a breakpoint at that oldest function. Then, when the exception happens the next time I can see more back tracing and sometimes get a better idea what was really happening when the abort was called. Also, be sure to open up the ROV tool and check the SysMin log.

Also, BIOS can use interrupts without telling you about it. Starting with Processor SDK 1.0.6 the GPIO driver uses the Event Combiner instead of dedicated interrupt 15 as was previously used. See the file GPIO_soc.c. Maybe switching to the new processor SDK might solve your issue.

Also, when I used the older SDK, up to 1.0.5, I had to remap the EventId in the GPIO_soc.c file. For example, for GPIO2, they specify line1EventId = 56. However, 56 was also used by Notify. So, I changed it to 76 which I believe is unused by anyone. I believe GPIO1, GPIO3, GPIO6, GPIO7 all have EventID conflicts. GPIO8 uses EventId 0 which I believe is not mapped, so you need to pick a new one for that also. Also, I changed to interrupt 13 in my GPIO_soc.c file, not sure why, but it does seem to work this way for me with alot if GPIO interrupts. These issues are likely not an issue with the newer Processor SDK 1.0.6.

0 Christopher Peters over 8 years ago in reply to Butters

Genius 3380 points

Another question. What is your interrupt rate? I've seen an interrupt response time of about 5-8 usec before any user callback is called. This time can be much worse however if the GPIO block is not set to No Idle Mode. I am using GPIO8 and GPIO8 and use the following code to force No Idle Mode and greatly speed up the GPIO latency.

// Manually set the GPIO_SYSCONFIG register to force no IDLE mode
// and vastly improve interrupt response time.
*(volatile UInt32*)(0x48053010) = 0x00000008; // GPIO8
*(volatile UInt32*)(0x4805D010) = 0x00000008; // GPIO6

Here is the thread that discusses this:

e2e.ti.com/.../557865

Here is fairly long and detailed discussion on GPIO interrupts you might find interesting.

e2e.ti.com/.../561674

0 judahvang over 8 years ago

TI__Mastermind 32475 points

Dave,

The IRP is not that useful in decoding an exception. The IRP simply points to the last interrupt that was executed which may but most likely does not have anything to do with the exception. What you want to look at is the NRP and B3 registers.

The NRP (this is like the IRP for exceptions) gives you an idea of where the exception occurred. Typically it is a few instruction cycles before the NRP because when an exception occurs, it takes a few cycles before it jumps to the exception handler. B3 is the return pointer and that can give you a clue to what function it was expecting to return to.

By default, SYSBIOS should print out what these are. Are you getting these prints on the console window?

Judah

0 Butters over 8 years ago in reply to judahvang

Expert 2001 points

Christopher Peters, thank you very much for your thoughts. I will look over the GPIO information that you provided.

Judah, while I certainly agree that the IRP is not generally useful when decoding an exception, I am inclined to believe that it is a legitimate clue in this case because, over tens of occurrences of the exception across different builds of my software, the IRP was always pointing exactly to the address of $Tramp$S$$MyFunctionName (even when the address of that symbol changed across software builds). It just doesn't seem possible to me that this is a coincidence.

The NRP is always pointing to exactly the address of ti_sysbios_family_c64p_Hwi_int15(); this is another reason why I am thinking that the exception has something to do with my system interrupt. The Exception context data in the ROV shows that B3 is pointing to the instruction that follows the call to the trampoline, as shown in this piece of the disassembly window (my comments added in blue):

008295c0: C61002E5 [ A0] LDW.D2T1 *+B4[0],A12
008295c4: 102CBA13 || CALLP.S2 $Tramp$S$$MyFunctionName (PC+91600 = 0x0083fb90)xv(eA>;C? <-- This red text is weird...?
008295c8: 91C6 || MV.L1X B3,A4
008295ca: 2586 MV.L1 A11,A1 <-- B3 points here.

I wonder what the text that I highlighted in red is about... I would expect to see the text ",B3" there instead.

Thanks for the continued support,
Dave

0 judahvang over 8 years ago in reply to Butters

TI__Mastermind 32475 points

Dave,

Can you show me more code snippet before and after the CALLP? Before is more important.

You said your function is in DDR. Is it being cached? Can you place the function in L2 memory? If not can you try disabling the cache for that memory that the code is currently in? Since it sounds like a timing issue.

Judah

0 Butters over 8 years ago in reply to judahvang

Expert 2001 points

Judah, here is more of the disassembly window before and after the CALLP:

          BeginningOfFunction():
0082953c:   1FE67C10            CALLP.S1      __push_rts (PC-52256 = 0x0081c900),A3
00829540:   A246                MV.L1         A4,A5
00829542:   4647     ||         MV.L2         B4,B10
00829544:   07FFF052 ||         ADDK.S2       -32,B15
00829548:   00946324            LDNDW.D1T1    *+A5[3],A1:A0
0082954c:   01944324            LDNDW.D1T1    *+A5[2],A3:A2
00829550:   03942324            LDNDW.D1T1    *+A5[1],A7:A6
00829554:   04940324            LDNDW.D1T1    *+A5[0],A9:A8
00829558:   023C805A            ADD.L2        4,B15,B4
0082955c:   E0200003            .fphead       n, l, W, BU, nobr, nosat, 0000001b
00829560:   009063F4            STNDW.D2T1    A1:A0,*+B4[3]
00829564:   019043F4            STNDW.D2T1    A3:A2,*+B4[2]
00829568:   039023F4            STNDW.D2T1    A7:A6,*+B4[1]
0082956c:   049003F4            STNDW.D2T1    A9:A8,*+B4[0]
00829570:   DC4D                LDW.D2T2      *B15[2],B4
00829572:   B5C6                MV.L1X        B3,A13
00829574:   02BC202A            MVK.S2        0x7840,B5
00829578:   0280406A            MVKH.S2       0x800000,B5
0082957c:   E2000000            .fphead       n, l, W, BU, nobr, nosat, 0010000b
00829580:   018C876E            LDW.D2T2      *+B14[3207],B3
00829584:   00110BDA            CMPLTU.L2     0x8,B4,B0
00829588:   2052A121     [ B0]  BNOP.S1       $C$L84 (PC+328 = 0x008296c8),5
0082958c:   30148AE6 ||  [!B0]  LDW.D2T2      *+B5[B4],B0
00829590:   0080A362            BNOP.S2       B0,5
          $C$L80:
00829594:   0005022A            MVK.S2        0x0a04,B0
00829598:   030C0572            MPYLI.M2      B0,B3,B7:B6
0082959c:   12856E7E            ADDAW.D2      B14,1390,B5
008295a0:   058C59D8            CMPGTU.L1X    0x2,B3,A11
008295a4:   82C7                MV.L2         B5,B4
008295a6:   0586                MV.L1         A11,A0
008295a8:   0298A07A ||         ADD.L2        B5,B6,B5
008295ac:   C2107C43     [ A0]  ADDAW.D2      B4,B3,B4
008295b0:   02850052 ||         ADDK.S2       2560,B5
008295b4:   C20A0453     [ A0]  ADDK.S2       5128,B4
008295b8:   059402E6 ||         LDW.D2T2      *+B5[0],B11
008295bc:   E0400008            .fphead       n, l, W, BU, nobr, nosat, 0000010b
008295c0:   C61002E5     [ A0]  LDW.D2T1      *+B4[0],A12
008295c4:   102CBA13 ||         CALLP.S2      $Tramp$S$$MyFunctionName (PC+91600 = 0x0083fb90)S(ê>;ÒB
008295c8:   91C6     ||         MV.L1X        B3,A4
008295ca:   2586                MV.L1         A11,A1
008295cc:   9184A35A     [!A1]  MVK.L2        1,B3
008295d0:   00AC1FDA            MV.L2X        A11,B1
008295d4:   960C1FD8     [!A1]  MV.L1X        B3,A12
008295d8:   2606                MV.L1         A12,A1
008295da:   1607                MV.L2X        A12,B0
008295dc:   E8800000            .fphead       n, l, W, BU, nobr, nosat, 1000100b
008295e0:   5000A35B     [!B1]  MVK.L2        0,B0
008295e4:   9508A358 ||  [!A1]  MVK.L1        2,A10
008295e8:   2500A359     [ B0]  MVK.L1        0,A10
008295ec:   2427     ||         MVK.L2        1,B0
008295ee:   024E     ||         MV.S1         A4,A0
008295f0:   00AC49A2 ||         SHRU.S2       B11,0x2,B1
008295f4:   01840F9B            ANDN.L2       B0,B1,B3
008295f8:   C190A359 ||  [ A0]  MVK.L1        4,A3
008295fc:   E10000C0            .fphead       n, l, W, BU, nobr, nosat, 0001000b

Yes, the function that is the target of the trampoline is in DDR. Both DSP L1 and L2 cache are already disabled.

I could try moving the function, but from my experience, this will almost certainly temporarily fix the problem. Basically, if I make any change to my software at all, it is likely that this problem will go away for a week or so, until some other unrelated change causes it to come back again. That is why I am eager to conclusively determine the cause of the problem -- so I can be confident that it won't come back in a week, month, or year.

Thanks,
Dave

0 judahvang over 8 years ago in reply to Butters

TI__Mastermind 32475 points

Dave,

I took a look at the code snippet but nothing jumps out at me and its compiled code so I would expect it to be correct generally.

Your description of the Exception soundly exactly like whats going on.
Code is executing the trampoline to your function, then an interrupt occurs and exception is generated (looks like before it actually executes your function)

Since the NRP is Hwi_int15, I assume the PC got there and the IFR got cleared for this interrupt (It would interesting if it did not get cleared).

If you are able to hook up a debugger and reproduce the problem, I would set a breakpoint at Hwi1 (Exception handler) and look at the state of
the registers/memory when you hit the breakpoint.

I have a feeling that your function in DDR is a variable in this.

I think L1P cache should not be disabled (Only for code). I don't have much more to offer than suggesting things to try out.

Judah

0 Butters over 8 years ago in reply to judahvang

Expert 2001 points

Judah,

I'm not sure if you are personally familiar with the internal architecture of the C66x DSP, but if you are (or know somebody at TI who is), perhaps the fact that I am getting two exceptions -- an instruction fetch exception and an opcode exception -- could provide a clue as to what is happening in the DSP at the hardware level. I'm wondering if you could take these resulting exceptions and work backward to derive a theory as to how they could be occurring in the first place. The problem is so repeatable that I'm confident a definitive cause can be found, but I don't have the internal DSP knowledge that would be necessary to work through it. Do you think this is an approach that we can take?

Thanks,
Dave

0 judahvang over 8 years ago in reply to Butters

TI__Mastermind 32475 points

Dave,

I am not familiar with the C66x DSP at a hardware level. I am a software guy and have worked many years with the C66x from a software perspective.

My gut feeling says this this sounds sort of like a hardware issue because of what you are seeing in the IRP, NRP, and B3.
I will attempt to get someone on the hardware sides attention.

Judah

0 jian35385 over 8 years ago in reply to judahvang

TI__Mastermind 23125 points

Dave:
I will review with a sysbios expect on the symptom and debugs done so far. please give me 1-2 days to report back.
Jian

0 Butters over 8 years ago in reply to jian35385

Expert 2001 points

Hello Jian, thank you for your offer of assistance, but I have good news! We finally figured out the cause of the problem with additional help through our local TI FAE. I actually just confirmed the solution with additional testing yesterday. What we discovered is that the problem does not occur when I disable L2 EDC (error detection and correction). To confirm this, I went into my binary image, found the STW instruction that enables L2 EDC, and replaced it with a NOP. I then ran tests for several hours and the exception did not occur once.

We believe the cause of the problem is an error in our software where we are not correctly following all of the "L2 EDC setup sequence" steps from the EDC chapter of the C66x DSP CorePac User Guide. In the unlikely event that we fix our software and the problem reoccurs, I will create a new forum post at that time.

Thanks again to everyone who read this post and contributed their ideas, as well as to those at TI who supported us through our local FAE.

Best regards,
Dave

0 Christopher Peters over 8 years ago in reply to Butters

Genius 3380 points

Is L2 EDC enabled by default or was it something you turned on?

0 Butters over 8 years ago in reply to Christopher Peters

Expert 2001 points

Christopher, no, L2 EDC is disabled by default. We had turned it on by adapting some code from a GEL file and must not have looked at it closely. If you want to check your system, the L2EDSTAT register (location 0x01846004 in DSP memory) will have bit 0 set if EDC is enabled, or bit 2 set if EDC is disabled.

0 Christopher Peters over 8 years ago in reply to Butters

Genius 3380 points

Thank you.

Processors

Processors forum

RTOS/AM5716: DSP opcode exception