How do I resolve Internal Resource Conflict Exceptions?

Paul Newton

Other Parts Discussed in Thread: TMS320C6474

Hello,

I am working on a single core of a custom TMS320C6474 board (rev 2.1 silicon). (The other two cores are being held suspended by the CCS v4 debugger.)

I am running DSP/BIOS 5, I have two TSK objects and one SWI, and in addition I have the ethernet stack and what ever it adds in the way of SWIs, HWIs and TSKs.

I am having problems with internal exceptions. These are mostly Resource Conflicts "RCX", but occasionally Opcode "OPX" or Instruction Fetch "IPX" Exceptions.

My application runs various DSP algorithms on data held in DDR2 RAM. One TSK contains the DSP code to do this, the other just initialises and monitors the SWI that copies dummy data into an input buffer. (The SWI would normally take real ADC data from SRIO, but enabling SRIO causes even more exceptions - so that is disabled currently).

I previously had an internal resource conflict exception "RCX" running the same code but on the C6748. This was due to bugs in the TI fast math libraries. http://e2e.ti.com/support/dsp/tms320c6000_high_performance_dsps/f/115/t/133991.aspx#481095

In this case I have removed the fastrts64x.lib library but my problems have not gone away.

I have a break point on EXC_dispatch that is catching the exceptions when they occur. In my current build, the exception seems to always be RCX with NRP/ERP pointing to the same address - a CALL instruction. The call is to a function in another C file - not to a TI library. It takes in the order of a minute for the exception to occur, so I am certain that the CALL instruction code has run many times already (and hence I don't want to pout a breakpoint on it). I think that something else must be happening to trigger the exception.

I have been working on this for a long time now, and with previous RCX exceptions I have found by trial and error that making very small changes to the code will cure the current exception and move the problem elsewhere.

Things I am not sure about:

the NRP/ERP point to the instruction where the exception happened, and this is a CALL instruction - Has the CALL executed yet? e.g. did the exception occur before of after the function was called? (If is it after then I need to consider the called function.)
what are the possible causes of the RCX exception? The problem with the TI library code was it used a register and did not allow enough cycles before returning - so I know that re-using registers can be an issue. Are there other causes - the documentation is not very clear about the exceptions.
could interrupts be causing the exceptions? Previously this code ran OK on the same silicon in a loopback mode where I did not have an SWI feeding in data. (Note that this loopback test used ethernet to send and receive data OK).
generally - how am I supposed to go about debugging this?

Looking forward to some helpful suggestions - Regards Paul

over 13 years ago

0 Paul Newton over 13 years ago

Intellectual 340 points

This is another symptom I get - but only very occasionally.

(The code carries on running, but I lose my debug that is being output to a web browser.)

Console output:

00052.358 Illegal reentrant call to llEnter()
00052.359 Illegal reentrant call to llEnter()
Service Status: Telnet : Disabled : : 000
00052.361 Illegal reentrant call to llEnter()
Service Status: HTTP : Disabled : : 000
Network Removed: If-1:10.0.0.2
00052.364 Illegal reentrant call to llEnter()
00052.365 Illegal reentrant call to llEnter()
00052.366 Illegal reentrant call to llEnter()

0 RandyP over 13 years ago in reply to Paul Newton

TI__Guru* 84110 points

Paul,

As you learned before, finding the exact cause is not easy, but the CPU Exception support does make it a lot easier.

Depending on the exact exception reported, there are different tables and figures in the CPU & Instruction Set Reference Guide to help you understand the information that gets presented in the exception registers. The NRP value may be a few instructions later than the instruction that had the exception. The explanations in CPU RG says that NRP will have the address of the first instruction packet that was annulled, so it might not be the actual instruction that failed.

Start by looking at where the NRP is pointing in the Disassembly Window and scroll up a few lines. See if anything looks unusual. Lookup the address in your linker .map file to find out which object module you are executing in, then try to determine what might be unique about that area.

It is very rare to find this to be a problem with compiled code or libraries, in spite of the fact that you have already encountered this once. The more common cause in my experience has been memory corruption for one of the following reasons:

Overlap between cache and L2 SRAM. This happens when your linker command file allows the L2 SRAM space to be larger than the remainder after subtracting out the L2 cache space that is allocated. The cache space will corrupt program and data in SRAM, and vice versa.
Errant data pointers write to program space in the SRAM area.
Power supply noise can cause memory corruption. This could be due to external noise events or insufficient decoupling capacitors or under-driven power supplies.

These are the most common causes that I can think of. You should be able to tell if the memory is corrupted or not by inspecting the area prior to the NRP address for incorrect code.

You may also want to look for correlation in the addresses where errors occur. Is it always the same address, or always a similar address?

Regards,
RandyP

0 Paul Newton over 13 years ago in reply to RandyP

Intellectual 340 points

Hi Randy,

Thanks for the reply.

I am working through some analysis now, and will get back to you for some more input.

Regards - Paul

0 Paul Newton over 13 years ago in reply to Paul Newton

Intellectual 340 points

Hello Randy,

I have checked the L2RAM and cache are not overlapped, and just to make sure, I tried disabling all cache, and exceptions still occur but at a slightly different point in the code. (I think I mentioned before that if I change anything (like adding an extra ASSERT statement), the problem moves.)

I am very limited on what I can do with the hardware as it is buried inside several layers of metal in a fan cooled unit - hence I am unable to check for power supply noise.

However we have run an extensive DDR2 memory soak test and are happy that the DDR2 memory is behaving itself when we don't have a full application running.

I have spent many days now examining the registers after the exceptions occur.

The current symptom I am seeing is that the processor triggers an RCX + PFX exception. The ERP/NRP/B3 registers all point to this occurring in the middle of data on my heap, so it appears the code has jumped somewhere stupid.

Looking at the stack pointer, the location pointed to by the stack pointer contains the address I see in the ERP/NRP/B3 registers. This has lead me to believe that the stack pointer has been corrupted, and then a return from function call caused a jump to the data heap area.

Looking at the data stored on the stack on either side of the stack pointer, I can see other return addresses. At SP minus two words and at SP plus ten words. These return addresses are consistent, and if I continue to follow the stack to the bottom (increasing SP addresses) I can see the whole call chain with correct SP adjustments at each call.

It appears that during the second to last function call, the SP was decremented by 12 words. Subsequently the last function was called and B3 was placed at the SP. But, when that function came to return the SP was now 2 words greater than it should have been, and the value loaded into B3 from the SP actually points to the data heap and not the caller.

Currently the scenario is repeatable, the address in the heap is not always the same, and the SP is often (but not always) the same. My explanation of it being a corrupted SP feels consistent with the symptoms, but I can't work out why it is happening.

I suspect that it may be due to an interrupt going off during the last function call. On the stack above the current SP (decrementing address) there is a lot of data including the odd code address. This data does not match the call signature of any function that might have been called. I have seen that when I single step through the code the stack can "explode" during a single step - hence I know that interrupts can add lots of data to the top of the stack.

I have followed what appear to be a few red herrings. One such lead was that on making a very small change to the code, the exceptions began to occur in the ethernet tx interrupt. Hence I added counters to that interrupt to determine if it was going off whilst the code was executing. Initially it did look like this might be the culprit, but eventually I caught an occurrence where it was apparent that the ethernet interrupt had not gone off recently.

So I am a bit stuck.

I know that turning off the SRIO interrupts does not make the problem go away (it does however make it take a lot longer to happen).

I could try disabling the ethernet traffic, but that will leave me without any indication of whether the code is running. (If I use debug printfs, it adds very large delays to the code and I lose sync with the SRIO data.)

I have looked for possible causes of stack corruptions by the code that is running. For example, I am performing FFTs - I have found that the FFTs are running on data in the heap and not on the stack. But, this still would not explain the corruption of the SP itself.

Can you throw me anymore ideas please?

Regards - Paul

0 RandyP over 13 years ago in reply to Paul Newton

TI__Guru* 84110 points

Paul,

If I understand you correctly, you have debugged a tremendous amount and have a lot of new information. The original question was about how to fix the particular exception you were getting, and now you have traced the problem much deeper and more specific, even if it does not sound like it in a "manager's summary".

It appears that the problem has been tracked back to executing from data. A corrupted stack is the easiest way for this to happen since a return address is often pulled off the stack and then the branch back/return occurs to that popped address.

I am a little surprised that you said ERP/NRP/B3 lead you to the data execution. ERP/NRP will point you to the approximate location where the exception occurred. But B3 has to be loaded with a value for a return address, and that would be a very strange thing to execute instructions in the heap that will save the current address into B3 and branch to it. Am I reading too much into this?

Since you mentioned DDR2, I would guess that this is where your heap resides. If it were in L2RAM then there are memory protection features available that could cause an exception trap as soon as you start executing in the heap area. In fact, since you are not using cores 1 & 2 right now, you might be able to setup those cores to be running a minimum amount of code, just enough to use the Memory Protection features and handle exceptions, then allow Core0 to put its heap in the L2 for one of the other cores. You could try just relocating the heap, if it fits, and see if the exception still occurs in that area before trying to implement the small footprint code to get a trap.

CCS supports hardware breakpoints that can cause a halt when you execute from an address or range of addresses. I have not used the C6474 for quite a while, so I do not remember the exact syntax or the menus to use. But through the CCS Breakpoint Manager, you can set the type and features of a breakpoint. This may be a good way to capture the first time an execution tries to happen in the data area.

What timeframes are you talking about before corruption occurs with the different interrupt configurations? Minutes, seconds, hours?

I do have a lot of pride and confidence in the libraries that TI supplies, including the run-time support libraries. We have found bugs in the past, but that is very rare. Any bugs like what you are seeing are rare. At this point, I would recommend that you try to inventory all the libraries and sources of code that could have any chance of causing the stack pointer corruption that you have observed. C-compiled functions should be 99% safe (really 100% but we can allow for any possibility), so any object libraries and assembly functions will be useful places to start.

Those are my ideas for now.

Regards,
RandyP

0 Paul Newton over 13 years ago in reply to RandyP

Intellectual 340 points

Hi Again, thanks for the quick response.

Yes lots of debugging, lots of single stepping through stuff I can only guess at - so deep in fact that I have a decompression chamber on standby in case of getting the Bends.

I had to keep the detail to a "managers" description else I would have been typing all day - I have weeks of notes now.

The original question was what might be causing the RCX, however since the mode of failure has changed yet again (it is now very obvious I am trying to execute data) the original question is now obsolete.

To make sure I don't miss any of your points, I have broken down your reply and answered it piece meal below:

ERP/NRP/B3 all the same:

I suspect that the stack pointer itself has been corrupted (offset by two words from where it should have been). Then a function has returned by loading B3 from the stack and jumping to it.
I was wondering if an interrupt might be storing the SP on the stack and the SP got corrupted whilst stored and was later read back into the SP corrupted.
I was actually single stepping through the HWI_F_dispatch code when your reply arrived.

I think I have just seen that on entry the SP was 0x00898D60, but on exit (at the B IRP), it was 0x0089D68.
Is that a co-incidence, or am I missing something?

Heap:

Yes the heap is in DDR2 along with my user app code.
The stacks and most if not all the bios code are all in L2 SRAM.
There is no chance the heap will fit in L2 memory, my application uses a lot of very large buffers in addition to those used by UDP packets, etc.

Multi-core code:

I have never written any code for the other cores, previous projects have been on the C54xx and C55xx.
Writing code for the other cores was something I was hoping I would not have to do.
I have both the other cores suspended using the debugger.
I have manually linked in an IDLE instruction into the image for both of them just in case.

Hardware Breakpoints:

I will look into these, but I think it will just confirm that the first instruction executed is the one in the B3/ERP/NRP.

Time:

I run my app and then use a web page to make a configuration and set off the "real" code.
"Usually" I don't see a problem until after I make the selection on the web page.

Sometimes (1 in ten or twenty) the initialisation code that runs from c_int00 fails to start properly, other times the ethernet init code can hang in a loop.
Usually, once in this initialisation crashed state I have to power cycle everything - sometimes that involves the emulator and sometimes even the PC.

Having made the selection on the web page, it can take anything between 5 seconds and sometimes up to 25 minutes before the code encounters an exception.
I was taking 25 minutes to be the cut off for a good build, but then one day I ran such a build a second time and it failed in 5 mins.

Libraries:

Once I have inventoried the libraries - what should I do with the information?
Would you be able to spot any inter library compatibility issue given the details?

Thanks again - Paul

0 Paul Newton over 13 years ago in reply to Paul Newton

Intellectual 340 points

Hi Randy,

I am trying to configure a hardware breakpoint as suggested.

I can't find any useful documentation that tells me how to setup a hardware breakpoint that uses a range of addresses.

The document SPRS552H (TMS320C6474 data manual) says that the 6474 can have hardware breakpoints that work over a range of addresses:

7.20.1 Advanced Event Triggering (AET)
The C6474 device supports Advanced Event Triggering (AET). This capability can be used to debug
complex problems as well as understand performance characteristics of user applications. AET provides
the following capabilities:
• Hardware Program Breakpoints: specify addresses or address ranges that can generate events such
as halting the processor or triggering the trace capture.

But when I follow the links to other the suggested other documents, SPRA753 and SPRA387, these were written over ten years ago and the menus they refer to don't exist in CCS v4.

Does CCSv4 allow setting of hardware address range breakpoints - or just single address ones?

Paul

0 RandyP over 13 years ago in reply to Paul Newton

TI__Guru* 84110 points

Paul,

Libraries:

What I was thinking was to figure out what code might have come from a suspect source. If someone wrote an assembly ISR for extra speed, they may have missed some branch path that cause a disturbance. I cannot imagine the TI C Compiler causing an error in the SP between entry and exit, so that conceit leads me to look elsewhere. Of course, I have to admit that bugs have been found in the compiler in the past and in BIOS functions and it RST functions and in most code that has been written by anyone. But when the same code is used by thousands of users, it gets tested pretty well.
There is no inter library compatibility issue that I can think of other than the obvious ones like endianness and processor version.

ERP/NRP/B3 all the same:

I missed the (obvious to you) point that B3 could have been popped off the stack to be the return address. The first sample .asm file I looked at did just that. I also noticed that it restored the SP by copying the FP register to SP directly, so that is another register that could get corrupted in a non-compiler-generated function; I have done that myself when writing assembly and not realizing how many registers were reserved for the compiler's use. The Compiler User's Guide has the details on which registers are used and reserved under different circumstances.

Paul Newton said:

I was actually single stepping through the HWI_F_dispatch code when your reply arrived.

I think I have just seen that on entry the SP was 0x00898D60, but on exit (at the B IRP), it was 0x0089D68.

Is that a co-incidence, or am I missing something?

I think this is evidence of skilled debug. Having the SP change by 8 should crash the system every time. Unless there is some detail about the way the interrupt processing works so that this is always the case, this must be an envelope around the actual cause.

If you are using SYS/BIOS, you can get the full source code by running a custom build. If that works smoothly, you can leave it that way although builds tends to take a bit longer with all that BIOS source code getting included. But it could help if you could add some debug code into HWI_F_dispatch to help you figure out what is going wrong.

It would be good to know if this SP corruption occurs only on a particular HWI, and which one, and what it calls. Then you can insert debug points into your own code to catch whether the SP changes in the HWI_F_dispatch routine or in the called ISR. Just to toss out a random information point, you would know if anyone used the interrupt keyword mistakenly with a function that is called by the HWI Dispatcher, because that would cause bad results.

There are debug tools built into the C6474 that can help you figure out where the processor has been executing before reaching a breakpoint. The ETB module is mentioned in the datasheet in Section 7.20 along with a pair of application notes that may be helpful. Even if you do not have a 60-pin header on your board, there are ways to access the ETB through code or emulation. If you decide that will be helpful, please post a new thread on the E2E Code Composer Forum under Development Tools since the emulation experts live there. If you do that, please post links from this thread to that one and vice versa to help people to easily find your eventual solution and to let the experts understand your problem.

Regards,
RandyP

0 Paul Newton over 13 years ago in reply to RandyP

Intellectual 340 points

Hi Randy,

Previously I could not see how to create an address range break point. Well I finally figured out how to set a hardware "address range" breakpoint in CCS v42.5.00005:

First you go to any piece of code in the source window and set a new hardware breakpoint by right click->New Breakpoint->Hardware Breakpoint. - you will NOT get the option to set it as an address range at this stage.
Then in the breakpoints view (tab) right click the breakpoint that you just created and select properties, if you click on the word point in the location type field you can actually change it from point to range and then set the address range in the fields below.

Having set up a breakpoint for addresses in the heap, the code that was reliably failing all morning when it jumped to the heap, now fails at a completely different location.

This time however, I think the assembler code may be suspect. (There is a conditional branch just prior to this assembler so this code may not have executed before).

At 0x8403546C, A5 is loaded, but I am not sure if it will be ready by 0x8403547C when it is used.
Also, at 0x84035480, an operation is performed on A4 and A5, in parallel with a load into A4:A5.

The ERP/NRP is 0x84035480, IERR = RCX, the SP appears to be correctly alligned to the current function's stack data. The exception pointer is no where near a call instruction as it has been in the past.

Would you please tell me if there is an issue with the following assembler:

506     *pdSnk++ = _itod(_cmpyr1(_hi(*pdSnk), _hi(*pdWin)),_cmpyr1(_lo(*pdSnk), _lo(*pdWin)));
            C$DW$L$_function_name$31$B, C$L90:

0x84035468:   02005FEE            LDW.D2T2      *+B15[95],B4

0x8403546C:   02805EEC            LDW.D2T1      *+B15[94],A5

0x84035470:   0F805EEE            LDW.D2T2      *+B15[94],B31

0x84035474:   00004000            NOP           3

0x84035478:   01901FD9            OR.L1X        0,B4,A3
0x8403547C:   04140364 ||         LDDW.D1T1     *+A5[0],A9:A8

0x84035480:   02140FD9            OR.L1         0,A5,A4
0x84035484:   030C0365 ||         LDDW.D1T1     *+A3[0],A7:A6
0x84035488:   02100364 ||         LDDW.D1T1     *+A4[0],A5:A4

0x8403548C:   021003E6            LDDW.D2T2     *+B4[0],B5:B4

0x84035490:   00004000            NOP           3
0x84035494:   12188330            CMPYR1.M1     A4,A6,A4
0x84035498:   12953330            CMPYR1.M1X    A9,B5,A5
0x8403549C:   02FD005A            ADD.L2        8,B31,B5
0x840354A0:   02805EFE            STW.D2T2      B5,*+B15[94]
0x840354A4:   00000000            NOP
0x840354A8:   027C03C4            STDW.D2T1     A5:A4,*+B31[0]

Thanks - Paul

0 Paul Newton over 13 years ago in reply to Paul Newton

Intellectual 340 points

I have just re-loaded the above code and single stepped through it.

The code in the previous post has the value 030C0365 at 0x84035484 which makes the next instruction parallel.

In the freshly re-loaded version, the value is 030C0364 with the LSBit set to zero, the next instruction is now NOT parallel and there is no exception..

(Manually modifying the value to 030C0365, I have verified that it does cause an exception.)

So in this case, I appear to have had a corruption of some sort.

A memory corruption like this could cause all sort of other symptoms too.

I'll keep debugging....

0 RandyP over 13 years ago in reply to Paul Newton

TI__Guru* 84110 points

Paul,

Yes, there are problems with this code. The listing above implies that the C code with the _itod intrinsic compiled to generate the code shown. If this is the case, then it is a serious compiler bug and we would need a small example project from you that duplicates this code generation. Is this from a .lst file or from the disassembly window?

The two parallel instructions at 0x84035484 & 0x84035488 both use the same functional unit, and that will cause a resource conflict exception.

And the move from A5 to A4 (OR 0) at 0x84035480 seems like an odd (unnecessary) instruction since A4 will be overwritten before that value is used.

More context is needed, or the example test case so this can be duplicated by our compiler team. And if this is the case, we will need to move this to the Compiler Forum for someone to address from there.

Regards,
RandyP

0 Paul Newton over 13 years ago in reply to RandyP

Intellectual 340 points

Randy,

I think our last replies crossed over.

I have seen that the program memory is corrupt, and this leads to the illegal parallel instruction combination.

I currently do NOT believe this a compiler or library issue.

I think I have now come full circle back to two of your original suggestions!

something in the code that has written to the program area and corrupted it
an issue with our hardware that is resulting is a bad read or write to program / data memory.

Thank you for all the help - it is very much appreciated.

Paul

0 RandyP over 13 years ago in reply to Paul Newton

TI__Guru* 84110 points

Paul,

Yes, my previous post was written before I saw you post a few minutes ahead of mine. Sorry. That makes it more difficult to figure out context.

A single-bit error like this is more likely to be a device or DDR-EMIF timing issue. An errant |= (OR-equals) instruction could do it, but the chance seems low to me.

Do you have other boards that have the same or similar issues?

It may help you to keep track of the address, value, and bit position of the words that you find are wrong. Some correlation or no correlation could help point you to the inside or outside of the SDRAM and the DSP.

From an historical perspective, this debug process you are going through is much easier than it used to be. The C64x+ MegaModule added the Exception module to allow you to catch things like this. Before that, the problem would not show up until much farther down the code execution stream when the DSP is totally lost. That is the good news.

The other good news is that you seem to be getting close, and your management should be impressed with your skills.

Regards,
RandyP

0 Pavan D over 13 years ago in reply to RandyP

Prodigy 10 points

Hi Randy/Paul,

Apologies for not providing any details related to my query,

what is the conclusion for OPX+RCX exceptions.

btw ,I am facing a similar issue in Math Lib.

Thanks .

0 RandyP over 13 years ago in reply to Pavan D

TI__Guru* 84110 points

Pavan,

If you have a new problem, please start a new thread. You will want to supply a lot more information than you have told us above.

Since this thread was marked as answered, the conclusions are either within this thread or else the solution was found from following this thread.

In your new thread, you may want to say what "Math Lib" is. I am not familiar with this.

Regards,
RandyP

Processors

Processors forum

How do I resolve Internal Resource Conflict Exceptions?