TMS320F28335 Vectors being fetched from 0x0000+offset instead of PIE vector space, ENPIE=1, VMAP=0

Geoffrey Mortimer

Other Parts Discussed in Thread: TMS320F28335

A quick question. I am using a TMS320f28335 with CCS 5.3.00090 and and XDS560v2. I am trying to debug a rather nasty problem in which the stack pointer seems to get corrupted, and then bad data appears in the registers. Examination of the stack suggests that an IRET is executing with the stack pointer pointing to the wrong place, corrupting those registers saved automatically during interrupt response. This results in AMODE=PAGE0=1, and VMAP=0, so a bit of a mess. Examination of the stack shows a storm of interrupts, the first of which I can't identify, but which jumps to 0000h (the stacked return address is 0x00000001). There it attempts to execute an addressing mode which, due to AMODE=PAGE0=1, causes an illegal opcode trap. The vector for this is fetched from 0x00000026, whereas, with ENPIE still set to 1, and the vectors intact at 0xD00, makes no sense to me. In the default state of the system, the address at 0x00000026 points to another illegal opcode, hence the storm of interrupts which overflows the stack. I have verified with the use of hardware watchpoints that the vector is indeed being fetched from 0x00000026, using the debugger to load different addresses there.

Two questions:

1) My understanding was that if ENPIE is set, the vectors would automatically be fetched from the PIE vector table, regardless of the state of VMAP. This is demonstrably not happening.

2) Another puzzle is that the debugger appears to allow me to single step even though the DBGM bit remains set. I have tried this after a reset and it still works, single steps seem to be possible with DBGM set. Also, I managed to trigger a HW watchpoint with DBGM seemingly set.

Any help at all appreciated.

Cheers
Geoff

over 9 years ago

0 Lori Heustess over 9 years ago

TI__Guru* 91275 points

Geoffrey Mortimer said:
1) My understanding was that if ENPIE is set, the vectors would automatically be fetched from the PIE vector table, regardless of the state of VMAP. This is demonstrably not happening.

No. VMAP has to be 1 in order for the PIE to respond. If VMAP is 0, then ENPIE is ignored. Refer to table 109 "Interrupt vector table mapping" in www.ti.com/lit/SPRUFB0D

Geoffrey Mortimer said:
2) Another puzzle is that the debugger appears to allow me to single step even though the DBGM bit remains set. I have tried this after a reset and it still works, single steps seem to be possible with DBGM set. Also, I managed to trigger a HW watchpoint with DBGM seemingly set.

I'll need to refresh my memory a bit and get back to you. It may be that you are in "rude realtime mode" in which case DBGM is being ignored.

I'll also re-read your whole post and see if I think of anything to try.

0 Lori Heustess over 9 years ago in reply to Lori Heustess

TI__Guru* 91275 points

Geoffrey - something to check is that the stack is not overflowing. You can set a watchpoint at the end of the stack to see if this happens. There is some information here:

processors.wiki.ti.com/.../Checking_for_Stack_Overflow

0 Lori Heustess over 9 years ago in reply to Lori Heustess

TI__Guru* 91275 points

Geoffrey - I did double check that DBGM is ignored when in rude realtime. That may be the answer to your 2nd question.

Another debug technique since the device is going all over the place is to fill the RAM with an ESTOP0 before executing the program. This is a software breakpoint and will halt the CPU. It may help narrow down where things go wrong before the CPU takes a run all over.

In the CCS gel file there is a function setup to do this:

/********************************************************************/
/* The ESTOP0 fill functions are useful for debug. They fill the */
/* RAM with software breakpoints that will trap runaway code. */
/********************************************************************/
hotmenu Fill_F28069_RAM_with_ESTOP0()
{
GEL_MemoryFill(0x000000,1,0x000800,0x7625); /* Fill M0/M1 */
GEL_MemoryFill(0x008000,1,0x000800,0x7625); /* Fill L0 */
GEL_MemoryFill(0x008800,1,0x000800,0x7625); /* Fill L1/L2 */
GEL_MemoryFill(0x009000,1,0x003000,0x7625); /* Fill L3/L4 */
GEL_MemoryFill(0x00C000,1,0x004000,0x7625); /* Fill L5/L6 */
GEL_MemoryFill(0x010000,1,0x004000,0x7625); /* Fill L7/L8 */
GEL_MemoryFill(0x040000,1,0x000800,0x7625); /* Fill USB RAM */
}

0 Geoffrey Mortimer over 9 years ago in reply to Lori Heustess

Prodigy 110 points

Hi Lori, many thanks for the reply. For 1) I need to read the datasheets better!

The main problem checking for stack overflow in that way is that there is a re-entrant interrupt storm in progress, at the priority level of the illegal opcode trap - would the RTOS interrupt get a look in? I have used a hardware watchpoint in the debugger to trap stack overflow, and what I see is a continuous sequence of the same registers being stacked, with the same values, in the order one would expect from entry to an ISR, over and over again, although alternately with a gap of one DWORD between stacking. Apart from the first interrupt, the source of which I cannot determine, which always happens (low water mark -2) on the stack (0x0402, main() executes with SP=0x0404) and jumps to zero (stacked return address = 0x00001), I am seeing a repeated cycle of stacking 7 Dwords, 8 Dwords, 8 Dwords, the 8 Dwords being the stacked registers plus a hole containing one DWORD of old data at the start. The 7 dwords make sense as far as stack contents go - there are recogniseable return addresses and local variable values, although scrambled between registers which would not be used by the compiler to hold them. I can only imagine that an IRET is executing with the stack pointer corrupted. One result is that ST1 is being loaded with, amongst other things, AMODE=1, PAGE0=1, VMAP=0.

The return address for all successive interrupts is 0x8245 (due to the code loader used by the debugger leaving 0x8244 at location 0x0026). This points to an illegal opcode (see below). If stack overflow is not trapped, this carries on with the stack growing all over memory and peripheral space until the SP overflows and wraps around to location 0x0000. The bug is triggered only sporadically but not after a fixed number of cycles. It happens once every 10 hours in my case, whereas my customer is able to make it happen more frequently, but only about once every hour. We are using exactly the same compilers and our .out files are identical. The reason for the interrupt storm is that the illegal opcode trap is being invoked over and over again, since the vector at 0x0026, even after stack wrap-around, usually points to an illegal opcode or memory access itself (given AMODE=1 and PAGE0=1, most memory accesses trigger the trap, also, many locations contain 0x0000, ITRAP0), an illegal opcode trap being, of course non-maskable.

What I would say is that the problem always seems to happen at a certain (but unfortunately large) subset of points in C code execution and the first "bad" interrupt always at the same place on the stack - I am able to use this fact to put a memory write-with-data watchpoint at 0x0408 to reliably trap the first occurance.

Looking at return addresses on the stack together with the RPC, during normal operation, it appears as far as I have seen so far, that the bug happens only when one of a number of functions called by a certain function, say, B is executing, where B is called from main(). The pair of return addresses are, as expected after a bit of head scratching, adjacent. During abnormal execution, I see the same pair of return addresses, but stacked from registers, or part registers which could not have been used by the compiler originally to store addresses.

This is the only clue I have to date about the cause of the problem. Another clue is that traffic on the CAN line seems (but only subjectively) to increase the frequency of occurrances. This is more the case with my customer than on my board.

If I insert so much as a NOP anywhere in the code (so far as I've seen, it's a big project) the whole problem goes away "on its own" altogether.

I am looking at a typical-for-automotive (don't worry, only a racing car!) interim solution in which I load a vector at 0x0026 which points to code which resets the processor by writing zero to the watchdog control register in the event of an illegal opcode trap occurring when the vectors are mapped at 0x0000. I have written machine code to do this (using the debugger and RAM in order not to disturb the alignment of the compiled code) and it seems to work, without modifying the project.

I start the debugger, load the reset code at 0x007E0 (in RAM beyond my customer's defined top-of-stack and well beyond my observed high water mark) and then at 0x0026 load a pointer to it. This seems to work, I arrive at 0x007E0 ok, verified with the HW watchpoint and if I hit the play button the processor seems to reset correctly. My worry, however, is that the stack pointer corruption may not be so predictable as I think and of course zero is a magnet value for corrupted registers, the danger being that a crash slightly different from what I have seen will overwrite my RAM vector at 0x0026 (In the final version I will put the reset code in flash, so it should survive.

The PIE vector space remains uncorrupted if I trap the interrupt storm before stack overflow occurs.

Oh for a trace debugger!

Best regards, many thanks
Geoffrey

0 Lori Heustess over 9 years ago in reply to Geoffrey Mortimer

TI__Guru* 91275 points

Geoffrey Mortimer said:
The return address for all successive interrupts is 0x8245 (due to the code loader used by the debugger leaving 0x8244 at location 0x0026).

Is this location being used by the code? If it is, you might try reserving it so you can use it for debug. I would an address there and have an ESTOP0 where it points to. That way you can stop the storm and maybe debug before it gets out of hand.

Another thing to check is that the prefetch is not going beyond valid memory. There is an errata in www.ti.com/lit/SPRZ272 that describes this. The CPU will get an invalid opcode - the "adding a NOP" fixes it makes me suspect this errata as a potential cause.

-Lori

0 Lori Heustess over 9 years ago in reply to Geoffrey Mortimer

TI__Guru* 91275 points

Geoffrey Mortimer said:
I am looking at a typical-for-automotive (don't worry, only a racing car!) interim solution in which I load a vector at 0x0026 which points to code which resets the processor by writing zero to the watchdog control register in the event of an illegal opcode trap occurring when the vectors are mapped at 0x0000.

Oops I missed this the first read through ! You could try vector 0x26 pointing to an ESTOP0 to halt and possibly debug back one step to how the PC got there.

Geoffrey Mortimer said:
Oh for a trace debugger!

Yes. I know. I've debugged many issues where I wish we had trace on these devices.

-Lori

0 Geoffrey Mortimer over 9 years ago in reply to Lori Heustess

Prodigy 110 points

Yes it is. 0x008244 is in the middle of their flash eeprom emulation API. But it's non-aligned, in the middle of a DWORD long instruction. So doesn't read like their program code. I like the ESTOP0 fill approach, you're ahead of me on this, I will need think about it a bit to understand what it means to destroy the RAM signature of the device (which I can't help thinking might have something to do with this, given that the clocks look good and I need to explain why it happens so often to my client and so infrequently to me). I will check out the errata again, the only thing I came across last time I looked was a spurious interrupt warning. This code has that unadvised sequence, but it executes only once, during initialization. I haven't really figured out whether or not that's important. If I try the workaround, inserting a NOP, chances are I'll eliminate the problem for other reasons. Code alignment seems to be a factor in this, in one way or another - if I change the code at all, even in the most banal way - asm( "NOP") , for instance - it all goes away. I would use a binary chop search, modifying files based on linking order, but, as I said, the problem occurs only every 10 hours so it could be a long, drawn-out process. To get a 3 sigma confidence level of having discovered anything would require days of running.

0 Lori Heustess over 9 years ago in reply to Geoffrey Mortimer

TI__Guru* 91275 points

Yup fetching in the middle of an opcode is not good. :(

Clarification Note: an embedded ESTOP0 will only be a debug mechanism for when the debugger is connected. This instruction behaves like a NOP when the debugger is not connected.

I will let you know if I think of anything else.

0 Geoffrey Mortimer over 9 years ago in reply to Lori Heustess

Prodigy 110 points

Lori, thank you very much, been on this 48 Hours now and need some sleep - are you around tomorrow? We'll get to the bottom of this, somehow!

0 Geoffrey Mortimer over 9 years ago in reply to Lori Heustess

Prodigy 110 points

I have the problem that if I insert any sort of code [asm (" NOP")], for example, the problem ceases to exist. It simply doesn't happen at all. Can you let me know about any code alignment issues?

0 Vivek Singh over 9 years ago in reply to Geoffrey Mortimer

TI__Guru** 113885 points

Hi Geoffrey,

Are you using nested interrupt in your code? If yes then please make sure in all the ISR (where interrupts are enabled using EINT to enable nesting) , interrupts are disable at the end (using DINT) before retuning from ISR. Failing to do that could cause issues like this.

Regards,
Vivek Singh

0 Geoffrey Mortimer over 9 years ago in reply to Vivek Singh

Prodigy 110 points

Goodness, Vivek, that is really quite evil! We do, indeed have code which re-enables interrupts:

/* CPU Timer1 ISR. */ //aghirotti
static interrupt void ISR_CPU_TIMER1(void){
INTEN();
IER &= (~ M_INT14);
if(task_1ms_in_run == 0){
task_1ms_in_run = 1;
//resetdiag
GFD_Reset_Diag(DG_DIAG_WATCHDOG_1MS);

// callback invocation
AppTask_1ms();
task_1ms_in_run = 0;
}
else{
//setdiag
GFD_Set_Diag(DG_DIAG_WATCHDOG_1MS);
SET_ALARM(NDiagStatusUnmasked,(WORD)DG_DIAG_WATCHDOG_1MS);
}
IER |= M_INT14;
}

INTEN() expands to asm(" SETC INTM") so nothing complicated. Do you think this code could cause problems?

Many thanks indeed
Geoff

0 Geoffrey Mortimer over 9 years ago in reply to Geoffrey Mortimer

Prodigy 110 points

Sorry, CLRC INTM, silly of me.

0 Geoffrey Mortimer over 9 years ago in reply to Geoffrey Mortimer

Prodigy 110 points

What an utterly evil bug! How on earth did you discover that, Vivek? Do you have any idea of the mechanism involved? I would have written that code myself, given that ST1 is unstacked automatically it seems to me completely illogical to have to set INTM before exiting!

0 Vivek Singh over 9 years ago in reply to Geoffrey Mortimer

TI__Guru** 113885 points

Hi Geoffrey,

Please refer restrictions part in section "RPTB label, loc16 Repeat A Block of Code" (page # 127) of this document.

Regards,

Vivek Singh

0 Geoffrey Mortimer over 9 years ago in reply to Vivek Singh

Prodigy 110 points

Hi Vivek

I have found two ISRs in the application which re-enable interrupts. In both cases the compiler inserts PUSH RB at the beginning and POP RB at the end. I have therefore added SETC INTM instructions at the end of the user code. However, having kept the assembler generated by the compiler, I cannot find any RPTB instructions in the application code.

1) Do library routines use RPTB blocks?

2) Is it necessary for an RPTB block to be interrupted in order for there to be a problem, or can executing POP RB before returning from an interrupt sometimes cause problems in any case if not preceded by SETC INTM?

Many thanks, best regards, Geoff

0 Vivek Singh over 9 years ago in reply to Geoffrey Mortimer

TI__Guru** 113885 points

Hi Geoff,

Sorry for late reply.

1) Do library routines use RPTB blocks?

I am not sure about this one.

Is it necessary for an RPTB block to be interrupted in order for there to be a problem, or can executing POP RB before returning from an interrupt sometimes cause problems in any case if not preceded by SETC INTM?

We have definitely seen problem when code has RPTB instruction but issue happens when POP RB instruction get interrupted hence our recommendation is to always disable the interrupt before rerunning from ISR.

Regards,
Vivek Singh

0 Geoffrey Mortimer over 9 years ago in reply to Vivek Singh

Prodigy 110 points

Hi Vivek

This seems to solve our problem - I have run with SETC INTM in the two ISRs which re-enable interrupts for 72 hours with no problem. One problem is that modifying the code often makes the problem go away on its own, and it is very intermittent. As a control, I have changed the instructions to CLRC INTM (which does nothing, but maintains the alignment of the code) and rerun the test, and the problem has returned.

We will now need to test this code on a my customer's boards, if successful I will mark the problem as solved!

Many thanks

Geoff

0 Vivek Singh over 9 years ago in reply to Geoffrey Mortimer

TI__Guru** 113885 points

Thanks for the update Geoff. Yes, it's good to wait for final testing to close this thread.

Regards,
Vivek Singh

0 Geoffrey Mortimer over 9 years ago in reply to Geoffrey Mortimer

Prodigy 110 points

Hi Vivek

It has been running for 150 hours without problems with your suggested mods, Then I substituted NOPS for SETC INTM to preserve the timing, and it crashed. So I can be pretty confident the problem is solved. Many thanks indeed.

I think it might be worth documenting this more prominently. Many systems without RTOS need to re-enable interrupts in the slower routines, these, in my estimation, account for about half the systems out there. It really would not hurt to make this problem more evident than hiding it in the FPU manual.

Anyway, I reiterate my thanks!

Best regards

Geoff

0 Vivek Singh over 9 years ago in reply to Geoffrey Mortimer

TI__Guru** 113885 points

Thanks for the update and valuable feedback Geoff.

We understand the concern and will look at option to include this usage note in device errata also. We are also planning to fix this in compiler to insert instruction to disable interrupt.

Regards,
Vivek Singh

C2000™︎ microcontrollers

C2000 microcontrollers forum

TMS320F28335 Vectors being fetched from 0x0000+offset instead of PIE vector space, ENPIE=1, VMAP=0