This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
Tool/software: TI-RTOS
Hello, I am having an intermittent (but repeatable) problem with my SYS/BIOS application. I'm hoping that someone can help me to determine how best to proceed in tracking down this unusual problem.
I am running SYS/BIOS on the C66x core of an AM5716 processor. I have a single interrupt in the system; it is triggered on the rising edge of a GPIO pin. I have found that the system occasionally becomes unresponsive, and when I break in with the XDS200 JTAG emulator, I always find the following error signature:
The best clue that I have is the consistent interrupt return pointer when the exception occurs. Based on the exception return pointer, the exception occurs upon reaching ti_sysbios_family_c64p_Hwi_int15, not returning from it (i.e., at the very beginning of the interrupt handling, not the end). I have verified that there is nothing wrong with the opcode at ti_sysbios_family_c64p_Hwi_int15, but the decoded internal exception values are instruction fetch exception and opcode exception.
So it seems that the exception happens only when the interrupt occurs at the precise moment that the PC is at the address $Tramp$S$$MyFunctionName (the very beginning of the trampoline, not even after the branch)... but this doesn't make any sense to me! Thanks in advance for any guesses or ideas on what else I can do to debug this issue.
Best regards,
Dave
Hi Rahul,
Yes, the GPIO interrupt is hooked up to Hwi_int15. The interrupt occurs on a periodic basis, and the interrupt is correctly serviced many times before the error occurs (my interrupt service routine increments a counter, so I can see that it has been running for some time).
The issue is actually occurring during run time; I do not break in with the emulator until after the issue has occurred. I'm just using the emulator to interrogate the state of the system after the error takes place. I can tell when the error has occurred because the communication protocol that I use to talk to the system will stop responding.
I am not running any code on the ARM; it simply runs a bootloader when the device is powered up and then sits in a while(1).
I am not using any hardware timers; the GPIO interrupt provides the system heartbeat, and I read the value of the TSC (time stamp counter) register to track the passage of "real time" when necessary.
Thanks,
Dave
Dave,
The IRP is not that useful in decoding an exception. The IRP simply points to the last interrupt that was executed which may but most likely does not have anything to do with the exception. What you want to look at is the NRP and B3 registers.
The NRP (this is like the IRP for exceptions) gives you an idea of where the exception occurred. Typically it is a few instruction cycles before the NRP because when an exception occurs, it takes a few cycles before it jumps to the exception handler. B3 is the return pointer and that can give you a clue to what function it was expecting to return to.
By default, SYSBIOS should print out what these are. Are you getting these prints on the console window?
Judah
Christopher Peters, thank you very much for your thoughts. I will look over the GPIO information that you provided.
Judah, while I certainly agree that the IRP is not generally useful when decoding an exception, I am inclined to believe that it is a legitimate clue in this case because, over tens of occurrences of the exception across different builds of my software, the IRP was always pointing exactly to the address of $Tramp$S$$MyFunctionName (even when the address of that symbol changed across software builds). It just doesn't seem possible to me that this is a coincidence.
The NRP is always pointing to exactly the address of ti_sysbios_family_c64p_Hwi_int15(); this is another reason why I am thinking that the exception has something to do with my system interrupt. The Exception context data in the ROV shows that B3 is pointing to the instruction that follows the call to the trampoline, as shown in this piece of the disassembly window (my comments added in blue):
008295c0: C61002E5 [ A0] LDW.D2T1 *+B4[0],A12
008295c4: 102CBA13 || CALLP.S2 $Tramp$S$$MyFunctionName (PC+91600 = 0x0083fb90)xv(eA>;C? <-- This red text is weird...?
008295c8: 91C6 || MV.L1X B3,A4
008295ca: 2586 MV.L1 A11,A1 <-- B3 points here.
I wonder what the text that I highlighted in red is about... I would expect to see the text ",B3" there instead.
Thanks for the continued support,
Dave
Judah, here is more of the disassembly window before and after the CALLP:
BeginningOfFunction(): 0082953c: 1FE67C10 CALLP.S1 __push_rts (PC-52256 = 0x0081c900),A3 00829540: A246 MV.L1 A4,A5 00829542: 4647 || MV.L2 B4,B10 00829544: 07FFF052 || ADDK.S2 -32,B15 00829548: 00946324 LDNDW.D1T1 *+A5[3],A1:A0 0082954c: 01944324 LDNDW.D1T1 *+A5[2],A3:A2 00829550: 03942324 LDNDW.D1T1 *+A5[1],A7:A6 00829554: 04940324 LDNDW.D1T1 *+A5[0],A9:A8 00829558: 023C805A ADD.L2 4,B15,B4 0082955c: E0200003 .fphead n, l, W, BU, nobr, nosat, 0000001b 00829560: 009063F4 STNDW.D2T1 A1:A0,*+B4[3] 00829564: 019043F4 STNDW.D2T1 A3:A2,*+B4[2] 00829568: 039023F4 STNDW.D2T1 A7:A6,*+B4[1] 0082956c: 049003F4 STNDW.D2T1 A9:A8,*+B4[0] 00829570: DC4D LDW.D2T2 *B15[2],B4 00829572: B5C6 MV.L1X B3,A13 00829574: 02BC202A MVK.S2 0x7840,B5 00829578: 0280406A MVKH.S2 0x800000,B5 0082957c: E2000000 .fphead n, l, W, BU, nobr, nosat, 0010000b 00829580: 018C876E LDW.D2T2 *+B14[3207],B3 00829584: 00110BDA CMPLTU.L2 0x8,B4,B0 00829588: 2052A121 [ B0] BNOP.S1 $C$L84 (PC+328 = 0x008296c8),5 0082958c: 30148AE6 || [!B0] LDW.D2T2 *+B5[B4],B0 00829590: 0080A362 BNOP.S2 B0,5 $C$L80: 00829594: 0005022A MVK.S2 0x0a04,B0 00829598: 030C0572 MPYLI.M2 B0,B3,B7:B6 0082959c: 12856E7E ADDAW.D2 B14,1390,B5 008295a0: 058C59D8 CMPGTU.L1X 0x2,B3,A11 008295a4: 82C7 MV.L2 B5,B4 008295a6: 0586 MV.L1 A11,A0 008295a8: 0298A07A || ADD.L2 B5,B6,B5 008295ac: C2107C43 [ A0] ADDAW.D2 B4,B3,B4 008295b0: 02850052 || ADDK.S2 2560,B5 008295b4: C20A0453 [ A0] ADDK.S2 5128,B4 008295b8: 059402E6 || LDW.D2T2 *+B5[0],B11 008295bc: E0400008 .fphead n, l, W, BU, nobr, nosat, 0000010b 008295c0: C61002E5 [ A0] LDW.D2T1 *+B4[0],A12 008295c4: 102CBA13 || CALLP.S2 $Tramp$S$$MyFunctionName (PC+91600 = 0x0083fb90)S(ê>;ÒB 008295c8: 91C6 || MV.L1X B3,A4 008295ca: 2586 MV.L1 A11,A1 008295cc: 9184A35A [!A1] MVK.L2 1,B3 008295d0: 00AC1FDA MV.L2X A11,B1 008295d4: 960C1FD8 [!A1] MV.L1X B3,A12 008295d8: 2606 MV.L1 A12,A1 008295da: 1607 MV.L2X A12,B0 008295dc: E8800000 .fphead n, l, W, BU, nobr, nosat, 1000100b 008295e0: 5000A35B [!B1] MVK.L2 0,B0 008295e4: 9508A358 || [!A1] MVK.L1 2,A10 008295e8: 2500A359 [ B0] MVK.L1 0,A10 008295ec: 2427 || MVK.L2 1,B0 008295ee: 024E || MV.S1 A4,A0 008295f0: 00AC49A2 || SHRU.S2 B11,0x2,B1 008295f4: 01840F9B ANDN.L2 B0,B1,B3 008295f8: C190A359 || [ A0] MVK.L1 4,A3 008295fc: E10000C0 .fphead n, l, W, BU, nobr, nosat, 0001000b
Yes, the function that is the target of the trampoline is in DDR. Both DSP L1 and L2 cache are already disabled.
I could try moving the function, but from my experience, this will almost certainly temporarily fix the problem. Basically, if I make any change to my software at all, it is likely that this problem will go away for a week or so, until some other unrelated change causes it to come back again. That is why I am eager to conclusively determine the cause of the problem -- so I can be confident that it won't come back in a week, month, or year.
Thanks,
Dave
Dave,
I took a look at the code snippet but nothing jumps out at me and its compiled code so I would expect it to be correct generally.
Your description of the Exception soundly exactly like whats going on.
Code is executing the trampoline to your function, then an interrupt occurs and exception is generated (looks like before it actually executes your function)
Since the NRP is Hwi_int15, I assume the PC got there and the IFR got cleared for this interrupt (It would interesting if it did not get cleared).
If you are able to hook up a debugger and reproduce the problem, I would set a breakpoint at Hwi1 (Exception handler) and look at the state of
the registers/memory when you hit the breakpoint.
I have a feeling that your function in DDR is a variable in this.
I think L1P cache should not be disabled (Only for code). I don't have much more to offer than suggesting things to try out.
Judah
Judah,
I'm not sure if you are personally familiar with the internal architecture of the C66x DSP, but if you are (or know somebody at TI who is), perhaps the fact that I am getting two exceptions -- an instruction fetch exception and an opcode exception -- could provide a clue as to what is happening in the DSP at the hardware level. I'm wondering if you could take these resulting exceptions and work backward to derive a theory as to how they could be occurring in the first place. The problem is so repeatable that I'm confident a definitive cause can be found, but I don't have the internal DSP knowledge that would be necessary to work through it. Do you think this is an approach that we can take?
Thanks,
Dave
Dave,
I am not familiar with the C66x DSP at a hardware level. I am a software guy and have worked many years with the C66x from a software perspective.
My gut feeling says this this sounds sort of like a hardware issue because of what you are seeing in the IRP, NRP, and B3.
I will attempt to get someone on the hardware sides attention.
Judah
Hello Jian, thank you for your offer of assistance, but I have good news! We finally figured out the cause of the problem with additional help through our local TI FAE. I actually just confirmed the solution with additional testing yesterday. What we discovered is that the problem does not occur when I disable L2 EDC (error detection and correction). To confirm this, I went into my binary image, found the STW instruction that enables L2 EDC, and replaced it with a NOP. I then ran tests for several hours and the exception did not occur once.
We believe the cause of the problem is an error in our software where we are not correctly following all of the "L2 EDC setup sequence" steps from the EDC chapter of the C66x DSP CorePac User Guide. In the unlikely event that we fix our software and the problem reoccurs, I will create a new forum post at that time.
Thanks again to everyone who read this post and contributed their ideas, as well as to those at TI who supported us through our local FAE.
Best regards,
Dave