I am working on a single core of a custom TMS320C6474 board (rev 2.1 silicon). (The other two cores are being held suspended by the CCS v4 debugger.)
I am running DSP/BIOS 5, I have two TSK objects and one SWI, and in addition I have the ethernet stack and what ever it adds in the way of SWIs, HWIs and TSKs.
I am having problems with internal exceptions. These are mostly Resource Conflicts "RCX", but occasionally Opcode "OPX" or Instruction Fetch "IPX" Exceptions.
My application runs various DSP algorithms on data held in DDR2 RAM. One TSK contains the DSP code to do this, the other just initialises and monitors the SWI that copies dummy data into an input buffer. (The SWI would normally take real ADC data from SRIO, but enabling SRIO causes even more exceptions - so that is disabled currently).
I previously had an internal resource conflict exception "RCX" running the same code but on the C6748. This was due to bugs in the TI fast math libraries. http://e2e.ti.com/support/dsp/tms320c6000_high_performance_dsps/f/115/t/133991.aspx#481095
In this case I have removed the fastrts64x.lib library but my problems have not gone away.
I have a break point on EXC_dispatch that is catching the exceptions when they occur. In my current build, the exception seems to always be RCX with NRP/ERP pointing to the same address - a CALL instruction. The call is to a function in another C file - not to a TI library. It takes in the order of a minute for the exception to occur, so I am certain that the CALL instruction code has run many times already (and hence I don't want to pout a breakpoint on it). I think that something else must be happening to trigger the exception.
I have been working on this for a long time now, and with previous RCX exceptions I have found by trial and error that making very small changes to the code will cure the current exception and move the problem elsewhere.
Things I am not sure about:
Looking forward to some helpful suggestions - Regards Paul
This is another symptom I get - but only very occasionally.
(The code carries on running, but I lose my debug that is being output to a web browser.)
00052.358 Illegal reentrant call to llEnter()00052.359 Illegal reentrant call to llEnter()Service Status: Telnet : Disabled : : 00000052.361 Illegal reentrant call to llEnter()Service Status: HTTP : Disabled : : 000Network Removed: If-1:10.0.0.200052.364 Illegal reentrant call to llEnter()00052.365 Illegal reentrant call to llEnter()00052.366 Illegal reentrant call to llEnter()
As you learned before, finding the exact cause is not easy, but the CPU Exception support does make it a lot easier.
Depending on the exact exception reported, there are different tables and figures in the CPU & Instruction Set Reference Guide to help you understand the information that gets presented in the exception registers. The NRP value may be a few instructions later than the instruction that had the exception. The explanations in CPU RG says that NRP will have the address of the first instruction packet that was annulled, so it might not be the actual instruction that failed.
Start by looking at where the NRP is pointing in the Disassembly Window and scroll up a few lines. See if anything looks unusual. Lookup the address in your linker .map file to find out which object module you are executing in, then try to determine what might be unique about that area.
It is very rare to find this to be a problem with compiled code or libraries, in spite of the fact that you have already encountered this once. The more common cause in my experience has been memory corruption for one of the following reasons:
These are the most common causes that I can think of. You should be able to tell if the memory is corrupted or not by inspecting the area prior to the NRP address for incorrect code.
You may also want to look for correlation in the addresses where errors occur. Is it always the same address, or always a similar address?
Search for answers, Ask a question, click Verify when complete, Help others, Learn more.
Thanks for the reply.
I am working through some analysis now, and will get back to you for some more input.
Regards - Paul
I have checked the L2RAM and cache are not overlapped, and just to make sure, I tried disabling all cache, and exceptions still occur but at a slightly different point in the code. (I think I mentioned before that if I change anything (like adding an extra ASSERT statement), the problem moves.)
I am very limited on what I can do with the hardware as it is buried inside several layers of metal in a fan cooled unit - hence I am unable to check for power supply noise.
However we have run an extensive DDR2 memory soak test and are happy that the DDR2 memory is behaving itself when we don't have a full application running.
I have spent many days now examining the registers after the exceptions occur.
The current symptom I am seeing is that the processor triggers an RCX + PFX exception. The ERP/NRP/B3 registers all point to this occurring in the middle of data on my heap, so it appears the code has jumped somewhere stupid.
Looking at the stack pointer, the location pointed to by the stack pointer contains the address I see in the ERP/NRP/B3 registers. This has lead me to believe that the stack pointer has been corrupted, and then a return from function call caused a jump to the data heap area.
Looking at the data stored on the stack on either side of the stack pointer, I can see other return addresses. At SP minus two words and at SP plus ten words. These return addresses are consistent, and if I continue to follow the stack to the bottom (increasing SP addresses) I can see the whole call chain with correct SP adjustments at each call.
It appears that during the second to last function call, the SP was decremented by 12 words. Subsequently the last function was called and B3 was placed at the SP. But, when that function came to return the SP was now 2 words greater than it should have been, and the value loaded into B3 from the SP actually points to the data heap and not the caller.
Currently the scenario is repeatable, the address in the heap is not always the same, and the SP is often (but not always) the same. My explanation of it being a corrupted SP feels consistent with the symptoms, but I can't work out why it is happening.
I suspect that it may be due to an interrupt going off during the last function call. On the stack above the current SP (decrementing address) there is a lot of data including the odd code address. This data does not match the call signature of any function that might have been called. I have seen that when I single step through the code the stack can "explode" during a single step - hence I know that interrupts can add lots of data to the top of the stack.
I have followed what appear to be a few red herrings. One such lead was that on making a very small change to the code, the exceptions began to occur in the ethernet tx interrupt. Hence I added counters to that interrupt to determine if it was going off whilst the code was executing. Initially it did look like this might be the culprit, but eventually I caught an occurrence where it was apparent that the ethernet interrupt had not gone off recently.
So I am a bit stuck.
I know that turning off the SRIO interrupts does not make the problem go away (it does however make it take a lot longer to happen).
I could try disabling the ethernet traffic, but that will leave me without any indication of whether the code is running. (If I use debug printfs, it adds very large delays to the code and I lose sync with the SRIO data.)
I have looked for possible causes of stack corruptions by the code that is running. For example, I am performing FFTs - I have found that the FFTs are running on data in the heap and not on the stack. But, this still would not explain the corruption of the SP itself.
Can you throw me anymore ideas please?
If I understand you correctly, you have debugged a tremendous amount and have a lot of new information. The original question was about how to fix the particular exception you were getting, and now you have traced the problem much deeper and more specific, even if it does not sound like it in a "manager's summary".
It appears that the problem has been tracked back to executing from data. A corrupted stack is the easiest way for this to happen since a return address is often pulled off the stack and then the branch back/return occurs to that popped address.
I am a little surprised that you said ERP/NRP/B3 lead you to the data execution. ERP/NRP will point you to the approximate location where the exception occurred. But B3 has to be loaded with a value for a return address, and that would be a very strange thing to execute instructions in the heap that will save the current address into B3 and branch to it. Am I reading too much into this?
Since you mentioned DDR2, I would guess that this is where your heap resides. If it were in L2RAM then there are memory protection features available that could cause an exception trap as soon as you start executing in the heap area. In fact, since you are not using cores 1 & 2 right now, you might be able to setup those cores to be running a minimum amount of code, just enough to use the Memory Protection features and handle exceptions, then allow Core0 to put its heap in the L2 for one of the other cores. You could try just relocating the heap, if it fits, and see if the exception still occurs in that area before trying to implement the small footprint code to get a trap.
CCS supports hardware breakpoints that can cause a halt when you execute from an address or range of addresses. I have not used the C6474 for quite a while, so I do not remember the exact syntax or the menus to use. But through the CCS Breakpoint Manager, you can set the type and features of a breakpoint. This may be a good way to capture the first time an execution tries to happen in the data area.
What timeframes are you talking about before corruption occurs with the different interrupt configurations? Minutes, seconds, hours?
I do have a lot of pride and confidence in the libraries that TI supplies, including the run-time support libraries. We have found bugs in the past, but that is very rare. Any bugs like what you are seeing are rare. At this point, I would recommend that you try to inventory all the libraries and sources of code that could have any chance of causing the stack pointer corruption that you have observed. C-compiled functions should be 99% safe (really 100% but we can allow for any possibility), so any object libraries and assembly functions will be useful places to start.
Those are my ideas for now.
Hi Again, thanks for the quick response.
Yes lots of debugging, lots of single stepping through stuff I can only guess at - so deep in fact that I have a decompression chamber on standby in case of getting the Bends.
I had to keep the detail to a "managers" description else I would have been typing all day - I have weeks of notes now.
The original question was what might be causing the RCX, however since the mode of failure has changed yet again (it is now very obvious I am trying to execute data) the original question is now obsolete.
To make sure I don't miss any of your points, I have broken down your reply and answered it piece meal below:
ERP/NRP/B3 all the same:
Thanks again - Paul
I am trying to configure a hardware breakpoint as suggested.
I can't find any useful documentation that tells me how to setup a hardware breakpoint that uses a range of addresses.
The document SPRS552H (TMS320C6474 data manual) says that the 6474 can have hardware breakpoints that work over a range of addresses:
7.20.1 Advanced Event Triggering (AET)The C6474 device supports Advanced Event Triggering (AET). This capability can be used to debugcomplex problems as well as understand performance characteristics of user applications. AET providesthe following capabilities:• Hardware Program Breakpoints: specify addresses or address ranges that can generate events suchas halting the processor or triggering the trace capture.
But when I follow the links to other the suggested other documents, SPRA753 and SPRA387, these were written over ten years ago and the menus they refer to don't exist in CCS v4.
Does CCSv4 allow setting of hardware address range breakpoints - or just single address ones?
I was actually single stepping through the HWI_F_dispatch code when your reply arrived.
I think I have just seen that on entry the SP was 0x00898D60, but on exit (at the B IRP), it was 0x0089D68.
Is that a co-incidence, or am I missing something?
I think this is evidence of skilled debug. Having the SP change by 8 should crash the system every time. Unless there is some detail about the way the interrupt processing works so that this is always the case, this must be an envelope around the actual cause.
If you are using SYS/BIOS, you can get the full source code by running a custom build. If that works smoothly, you can leave it that way although builds tends to take a bit longer with all that BIOS source code getting included. But it could help if you could add some debug code into HWI_F_dispatch to help you figure out what is going wrong.
It would be good to know if this SP corruption occurs only on a particular HWI, and which one, and what it calls. Then you can insert debug points into your own code to catch whether the SP changes in the HWI_F_dispatch routine or in the called ISR. Just to toss out a random information point, you would know if anyone used the interrupt keyword mistakenly with a function that is called by the HWI Dispatcher, because that would cause bad results.
There are debug tools built into the C6474 that can help you figure out where the processor has been executing before reaching a breakpoint. The ETB module is mentioned in the datasheet in Section 7.20 along with a pair of application notes that may be helpful. Even if you do not have a 60-pin header on your board, there are ways to access the ETB through code or emulation. If you decide that will be helpful, please post a new thread on the E2E Code Composer Forum under Development Tools since the emulation experts live there. If you do that, please post links from this thread to that one and vice versa to help people to easily find your eventual solution and to let the experts understand your problem.
Previously I could not see how to create an address range break point. Well I finally figured out how to set a hardware "address range" breakpoint in CCS v42.5.00005:
Having set up a breakpoint for addresses in the heap, the code that was reliably failing all morning when it jumped to the heap, now fails at a completely different location.
This time however, I think the assembler code may be suspect. (There is a conditional branch just prior to this assembler so this code may not have executed before).
The ERP/NRP is 0x84035480, IERR = RCX, the SP appears to be correctly alligned to the current function's stack data. The exception pointer is no where near a call instruction as it has been in the past.
Would you please tell me if there is an issue with the following assembler:
506 *pdSnk++ = _itod(_cmpyr1(_hi(*pdSnk), _hi(*pdWin)),_cmpyr1(_lo(*pdSnk), _lo(*pdWin))); C$DW$L$_function_name$31$B, C$L90: 0x84035468: 02005FEE LDW.D2T2 *+B15,B40x8403546C: 02805EEC LDW.D2T1 *+B15,A50x84035470: 0F805EEE LDW.D2T2 *+B15,B310x84035474: 00004000 NOP 30x84035478: 01901FD9 OR.L1X 0,B4,A30x8403547C: 04140364 || LDDW.D1T1 *+A5,A9:A80x84035480: 02140FD9 OR.L1 0,A5,A40x84035484: 030C0365 || LDDW.D1T1 *+A3,A7:A60x84035488: 02100364 || LDDW.D1T1 *+A4,A5:A40x8403548C: 021003E6 LDDW.D2T2 *+B4,B5:B40x84035490: 00004000 NOP 30x84035494: 12188330 CMPYR1.M1 A4,A6,A40x84035498: 12953330 CMPYR1.M1X A9,B5,A50x8403549C: 02FD005A ADD.L2 8,B31,B50x840354A0: 02805EFE STW.D2T2 B5,*+B150x840354A4: 00000000 NOP0x840354A8: 027C03C4 STDW.D2T1 A5:A4,*+B31
Thanks - Paul
I have just re-loaded the above code and single stepped through it.
The code in the previous post has the value 030C0365 at 0x84035484 which makes the next instruction parallel.
In the freshly re-loaded version, the value is 030C0364 with the LSBit set to zero, the next instruction is now NOT parallel and there is no exception..
(Manually modifying the value to 030C0365, I have verified that it does cause an exception.)
So in this case, I appear to have had a corruption of some sort.
A memory corruption like this could cause all sort of other symptoms too.
I'll keep debugging....
Yes, there are problems with this code. The listing above implies that the C code with the _itod intrinsic compiled to generate the code shown. If this is the case, then it is a serious compiler bug and we would need a small example project from you that duplicates this code generation. Is this from a .lst file or from the disassembly window?
The two parallel instructions at 0x84035484 & 0x84035488 both use the same functional unit, and that will cause a resource conflict exception.
And the move from A5 to A4 (OR 0) at 0x84035480 seems like an odd (unnecessary) instruction since A4 will be overwritten before that value is used.
More context is needed, or the example test case so this can be duplicated by our compiler team. And if this is the case, we will need to move this to the Compiler Forum for someone to address from there.
I think our last replies crossed over.
I have seen that the program memory is corrupt, and this leads to the illegal parallel instruction combination.
I currently do NOT believe this a compiler or library issue.
I think I have now come full circle back to two of your original suggestions!
Thank you for all the help - it is very much appreciated.
Yes, my previous post was written before I saw you post a few minutes ahead of mine. Sorry. That makes it more difficult to figure out context.
A single-bit error like this is more likely to be a device or DDR-EMIF timing issue. An errant |= (OR-equals) instruction could do it, but the chance seems low to me.
Do you have other boards that have the same or similar issues?
It may help you to keep track of the address, value, and bit position of the words that you find are wrong. Some correlation or no correlation could help point you to the inside or outside of the SDRAM and the DSP.
From an historical perspective, this debug process you are going through is much easier than it used to be. The C64x+ MegaModule added the Exception module to allow you to catch things like this. Before that, the problem would not show up until much farther down the code execution stream when the DSP is totally lost. That is the good news.
The other good news is that you seem to be getting close, and your management should be impressed with your skills.
Apologies for not providing any details related to my query,
what is the conclusion for OPX+RCX exceptions.
btw ,I am facing a similar issue in Math Lib.
If you have a new problem, please start a new thread. You will want to supply a lot more information than you have told us above.
Since this thread was marked as answered, the conclusions are either within this thread or else the solution was found from following this thread.
In your new thread, you may want to say what "Math Lib" is. I am not familiar with this.
All content and materials on this site are provided "as is". TI and its respective suppliers and providers of content make no representations about the suitability of these materials for any purpose and disclaim all warranties and conditions with regard to these materials, including but not limited to all implied warranties and conditions of merchantability, fitness for a particular purpose, title and non-infringement of any third party intellectual property right. TI and its respective suppliers and providers of content make no representations about the suitability of these materials for any purpose and disclaim all warranties and conditions with respect to these materials. No license, either express or implied, by estoppel or otherwise, is granted by TI. Use of the information on this site may require a license from a third party, or a license from TI.
TI is a global semiconductor design and manufacturing company. Innovate with 100,000+ analog ICs andembedded processors, along with software, tools and the industry’s largest sales/support staff.