Program Counter (PC) getting out of sync, executing wrong instructions

Ryan Johnson

Other Parts Discussed in Thread: MSP430F5418A, MSP430F5507

I'm seeing a very strange issue which I'm hoping someone can help me with. I have an application using FreeRTOS, and I'm able to create a repeatable crash. Here is the assembly code where my problem occurs (this happens to be in queue.c:prvUnlockQueue of FreeRTOS):

0x0CA68: C232 DINT

0x0CA6A: 4303 NOP

0x0CA6C: 5392 5BFE INC.W &usCriticalNesting

0x0CA70: 3C0B JMP (C$DW$L$prvUnlockQueue$13$E)

C$DW$L$prvUnlockQueue$11$B, C$L83:

0x0CA72: 0ACC MOVA R10,R12

0x0CA74: 00AC 0010 ADDA #0x00010,R12

0x0CA78: 13B0 CE0E CALLA #xTaskRemoveFromEventList

Based on the conditions of my program, I should never reach 0xCA72 (after the jump), and in fact I have verified that no C code executes which would lead it to that address. Curiously though, I will eventually end up there. Through some extensive debugging using the trace function and by moving the address of the usCriticalNesting variable, I am almost certain that for some reason the program is clobbering together the instructions at 0xCA6C and 0xCA70. The assembly view when looking at address 0xCA6E looks like the following:

0x0CA68: C232 DINT

0x0CA6A: 4303 NOP

0x0CA6C: 5392 5BFE INC.W &usCriticalNesting

0x0CA6E: 5BFE 3C0B ADD.B @R11+,0x3c0b(R14)

C$DW$L$prvUnlockQueue$11$B, C$L83:

0x0CA72: 0ACC MOVA R10,R12

0x0CA74: 00AC 0010 ADDA #0x00010,R12

0x0CA78: 13B0 CE0E CALLA #xTaskRemoveFromEventList

I believe this is essentially what is happening: Instead of using 0x5392 0x5BFE as an instruction (increment usCriticalNesting then jump), it is treating 0x5BFE 0x3C0B as an instruction (add bytes, no jump). When I do end up at 0xCA72, usCriticalNesting is 0, supporting the theory.

If I change the address of usCriticalNesting to 0x5A40, it changes the incorrect instruction to "ADD.B R10, PC", and my program ends up in no-mans land.

Any thoughts as to how/why this could be occurring? For reference, this is with an MSP430F5418a, using CCS4 v.4.2.4.00033, using v.3.3.3 of the code generation tools, large data & large code memory models.

Thanks!

-- Ryan

over 12 years ago

0 Roberto Romano over 12 years ago

Guru 27285 points

Ryan Johnson said:

0x0CA6C: 5392 5BFE INC.W &usCriticalNesting

0x0CA6E: 5BFE 3C0B ADD.B @R11+,0x3c0b(R14)

C$DW$L$prvUnlockQueue$11$B, C$L83:

0x0CA72: 0ACC MOVA R10,R12

0x0CA74: 00AC 0010 ADDA #0x00010,R12

0x0CA78: 13B0 CE0E CALLA #xTaskRemoveFromEventList

I believe this is essentially what is happening: Instead o

Hi Ryan, from your writing you disassembled a data part of an instruction so listing hide the jump but 3c0b still is the jump and the two word I highlighted are the same word, first instance argument of INC.W second instance simply foolished debugger.

Regards

0 Jens-Michael Gross over 12 years ago

Guru 227245 points

I'm jsu tcheckign the device errata lists, and there are several which could explain somethign liek that, but none of them actually applies to your code.

However,

Ryan Johnson said:
Instead of using 0x5392 0x5BFE as an instruction (increment usCriticalNesting then jump), it is treating 0x5BFE 0x3C0B as an instruction

This would only happen if the cpu somehow jumps to 0xca6c, not properly executign the bytes before.

Your INC.W &0x5bfe instruction actually is an
ADD 0(R3), 0x5bfe(SR) instruction, where 0(R3) returns 1 (constant generator) and does not have an additional data word, while for x(SR) , SR equals 0, taking x as absolute address.
I could imagine a bug in the debugger where the debugger would erroneously assume this being a 3-word instruction while single-stepping, and not a 2-byte one. But I'm pretty sure that such a common bug would have been eliminated before the very first release.
And in no case, it should happen if not debugging.

One other possibility is that you have a problem with your stack, maybe inside an ISR. So the code jumps into the wild, eventually arriving on 0xca5e. but it must jump there directly somehow, since the two instructions before are one-word instructions, which would bring the processor back on the rail.

You should try setting an instruction fetch breakpoint on 0xca6e. You should never get an instruction fetch on this address (only read access) But if you do, the quesiton is: how did you get there?

I'd be interested to know what is on 0xca88, the place where the JMP is intended to jump to.

0 Ryan Johnson over 12 years ago in reply to Jens-Michael Gross

Intellectual 330 points

Thanks, I probably should elaborate on a few things. First, this doesn't usually happen. In my setup it won't happen until I enable a timer & SPI, and start exercising both. It seems to be some sort of a race condition.

I've thought about your idea about stack corruption inside an ISR, but the fact that it always seems tied to this segment of code -- regardless of where this code is in memory makes me think it's something else (I could well be wrong!)

For clarification, I don't hit this address while single-stepping. I run my test, and then when it crashes all signs lead me to believe it executed at 0x0CA6E. If I set a breakpoint on an instruction fetch at that address being on the memory address bus, I do trigger when the condition occurs. Oddly though my PC is set to 0x0CA6A (the NOP). Is that a clue?

Here is the full assembly listing for this block of code. Green is where code should only execute, red is where it never does, and orange is where I end up when this condition occurs.

C$L82:

0x0CA68: C232 DINT

0x0CA6A: 4303 NOP

0x0CA6C: 5392 5BFE INC.W &usCriticalNesting

0x0CA70: 3C0B JMP (C$DW$L$prvUnlockQueue$13$E)

C$DW$L$prvUnlockQueue$11$B, C$L83:

0x0CA72: 0ACC MOVA R10,R12

0x0CA74: 00AC 0010 ADDA #0x00010,R12

0x0CA78: 13B0 D05C CALLA #xTaskRemoveFromEventList

0x0CA7C: 930C TST.W R12

0x0CA7E: 2402 JEQ (C$DW$L$prvUnlockQueue$12$E)

C$DW$L$prvUnlockQueue$12$B, C$DW$L$prvUnlockQueue$11$E:

0x0CA80: 13B0 FC1E CALLA #vTaskMissedYield

C$DW$L$prvUnlockQueue$13$B, C$L84, C$DW$L$prvUnlockQueue$12$E:

0x0CA84: 839A 0036 DEC.W 0x0036(R10)

C$DW$L$prvUnlockQueue$14$B, C$L85, C$DW$L$prvUnlockQueue$13$E:

0x0CA88: 939A 0036 CMP.W #1,0x0036(R10)

0x0CA8C: 3803 JL (C$DW$L$prvUnlockQueue$15$E)

C$DW$L$prvUnlockQueue$15$B, C$DW$L$prvUnlockQueue$14$E:

0x0CA8E: 938A 0010 TST.W 0x0010(R10)

0x0CA92: 23EF JNE (C$L83)

C$L86, C$DW$L$prvUnlockQueue$15$E:

0x0CA94: 43BA 0036 MOV.W #-1,0x0036(R10)

0x0CA98: 9382 5BFE TST.W &usCriticalNesting

0x0CA9C: 2404 JEQ (C$L87)

0x0CA9E: 8392 5BFE DEC.W &usCriticalNesting

0x0CAA2: 2001 JNE (C$L87)

0x0CAA4: D232 EINT

Thanks,

-- Ryan

0 old_cow_yellow over 12 years ago

Guru 58965 points

Ryan,

That is very strange. You are saying that the code in the flash is actually:

0x0CA68: C232 DINT

0x0CA6A: 4303 NOP

0x0CA6C: 5392 5BFE INC.W &usCriticalNesting ; INC.W &0x5BEF

0x0CA70: 3C0B JMP (C$DW$L$prvUnlockQueue$13$E) ; JMP 0xCA88

C$DW$L$prvUnlockQueue$11$B, C$L83:

0x0CA72: 0ACC MOVA R10,R12

0x0CA74: 00AC 0010 ADDA #0x00010,R12

0x0CA78: 13B0 CE0E CALLA #xTaskRemoveFromEventList

But it is being executed by the CPU as if it was as follows:

0x0CA68: C232 DINT

0x0CA6A: 4303 NOP

0x0CA6C: 5392 5BFE INC.W &usCriticalNesting ; INC.W &0x5BFE

0x0CA6E: 5BFE 3C0B ADD.B @R11+,0x3c0b(R14) ; ADD.B @R11+ 0x3C0B(R11)

C$DW$L$prvUnlockQueue$11$B, C$L83:

0x0CA72: 0ACC MOVA R10,R12

0x0CA74: 00AC 0010 ADDA #0x00010,R12

0x0CA78: 13B0 CE0E CALLA #xTaskRemoveFromEventList

To further verify your theory, I suggest that you try the following:

(1) Do not modify your code, but set up a BP at 0x0CA6A and start the CPU until PC reaches the BP at 0x0CA6A

(2) Inspect the contents of R11 and R14, modify them to 0x0CA6A and 0x05BFC reapectively.

(3) Inspect the contents of 0x05BFC to 0x05BFF.

(4) Clear the BP at 0x0CA6A. Set two BPs. 0x0CA72 and 0x0CA88.

(5) Let the CPU continue until PC reaches either 0x0CA72 or 0x0CA88.

(6) If your theory is true, the PC will be at 0x0CA72. The contents of 0x05BFC will be increased by 3 and R11 will be increased by 1.

(7) If your theory is not true, and the CPU is normal, the PC will be at 0x0CA88. The contents of 0x05BFC and R11 will not change.

(8) In either case, the contents of 0x05BFE will be increased by 1.

-- OCY

0 old_cow_yellow over 12 years ago in reply to Ryan Johnson

Guru 58965 points

Ryan,

I replied to your earlier post and did not see your most recent one.

Now I suspected a that a single NOP following DINT is not enough, and an interrupt at that moment corrupted the PC. Could you insert another NOP there?

-- OCY

0 Ryan Johnson over 12 years ago in reply to old_cow_yellow

Intellectual 330 points

Thanks OCY, I think that did do the trick! Now, can you explain why? ;-)

In addition, I did try the steps you outlined in your prior post, and I ended up in step 7 (incorrect theory).

-- Ryan

0 Jeff Tenney over 12 years ago in reply to Ryan Johnson

Guru 12160 points

Hi Ryan,

Very interesting post. Sounds like the code works OK with two NOPs. Does it work with no NOPs? FYI, with no NOPs, don't put a breakpoint on the INC.W instruction and don't single step through the DINT instruction (errata).

Also, just for clarification, the problem also occurred when not debugging (JTAG tool disconnected), right? And the problem is intermittent, meaning that it doesn't happen every time through this code, right?

Jeff

0 Ryan Johnson over 12 years ago in reply to Jeff Tenney

Intellectual 330 points

I'll try it out with no NOPs and report back. FYI, the 'DINT; NOP;' sequence was generated using the _disable_interrupt() function.

Yes to both of your questions -- it will lock up predictably with the JTAG device disconnected, and no it doesn't happen every time through this path. In fact, until I enable my timer & SPI interrupts, I never come across it, so I think OCY's theory about an interrupt being able to sneak in could be correct (since there are lots of interrupts firing). After I do enable the timer/SPI interrupts, I can get it to crash within 10 seconds 100% of the time.

-- Ryan

0 Jeff Tenney over 12 years ago in reply to Ryan Johnson

Guru 12160 points

Do you have stack overflow? Are you using nested interrupts?

I noticed that the address of usCriticalNesting is at the very top of RAM. Is the stack using the bottom of RAM then?

It does appear that the CPU is beginning execution at 0xCA6E. I don't think the CPU would make that mistake no matter where an interrupt sneaks in. That would have made the errata sheet I think. Typically a corrupted PC is from stack overflow, stack corruption, or over-speed MCLK, etc. The consistent return address makes me think stack overflow, but who knows.

Jeff

0 Jens-Michael Gross over 12 years ago in reply to Jeff Tenney

Guru 227245 points

Jeff Tenney said:
I don't think the CPU would make that mistake no matter where an interrupt sneaks in. That would have made the errata sheet I think. Typically a corrupted PC is from stack overflow, stack corruption, or over-speed MCLK, etc.

I secodn that.

A documented erratum is that an interrupt may be granted during teh GIE clear operation, whcih causes the interrup tto happen after the instruction followign the GIE clear (because this instruction was already fetched due to pipelining). Hence the automatic NOP placed by the _disable_interrupt() intrinsic.

Jeff Tenney said:
The consistent return address makes me think stack overflow, but who knows.

If the stack happens to flow over the address of usCriticalNesting, incrementing usCriticalNesting would result in an increased return address from an interrupt.

Undetected stack overflow errors are one of the most annoying sources of software problems, followed by buffer overflows (the common entry-door for trojans) and second only to unexperienced coders :)
Well, maybe buggy or badly documented demo code is more annoying ;)

0 Ryan Johnson over 12 years ago in reply to Jens-Michael Gross

Intellectual 330 points

Stack overflow was my initial thought, but it seemed unlikely after the PC was always corrupted in this section of code, regardless of where in memory the code was, or the location of usCriticalNesting. usCriticalNesting was forcibly located at the top of RAM just for testing. I also placed it in various other locations with similar effects. The stack is also located at the top of the stack.

Jens-Michael Gross said:

If the stack happens to flow over the address of usCriticalNesting, incrementing usCriticalNesting would result in an increased return address from an interrupt.

This is a good thought however, so I'll pursue testing this theory out more thoroughly. Per Jeff's comment, I'll also review our clocking to make sure everything is within spec (we're running at 30MHz, so just below the limit).

Thanks,

-- Ryan

0 Jeff Tenney over 12 years ago in reply to Ryan Johnson

Guru 12160 points

Ryan,

Just in case you didn't already know, the only clocks that are allowed to run faster than 25MHz are XT1CLK and XT2CLK. When higher than 25MHz, these clocks must be divided before used as MCLK, SMCLK, or ACLK.

Jeff

[Edit: Actually, I forgot about DCOCLK and DCOCLKDIV which are also allowed to run faster than 25MHz.]

0 Jeff Tenney over 12 years ago in reply to Jens-Michael Gross

Guru 12160 points

Jens-Michael Gross said:
A documented erratum is that an interrupt may be granted during teh GIE clear operation, whcih causes the interrup tto happen after the instruction followign the GIE clear

Hi JMG - Can you or anybody else find this erratum? I have wondered and even publicly opined on this very issue many times. But I have not been able to prove it either way. I have never once experienced an interrupt *after* the instruction following the DINT, even with code written specifically to induce the problem. Many times I have seen an interrupt immediately after the DINT, but none after the following instruction.

The only erratum I have seen about NOP after GIE refers to debugging (breakpoints and single stepping I believe).

Jeff

0 Ryan Johnson over 12 years ago in reply to Jeff Tenney

Intellectual 330 points

Doh, you're right. I misinterpreted the datasheet (I'm a bit green when it comes to MSP430's). I was indeed configuring MCLK to run at 30MHz using Init_FLL_Settle.

Thanks!!

-- Ryan

0 Jens-Michael Gross over 12 years ago in reply to Ryan Johnson

Guru 227245 points

Ryan Johnson said:
we're running at 30MHz, so just below the limit

30MHz on wich MSP (yo didn't ever mention the exact target processor)?
Or do you mean with a /2 divider for MCLK?

0 Jeff Tenney over 12 years ago in reply to Ryan Johnson

Guru 12160 points

Ryan,

Also if you haven't already set VCORE up to the maximum, you must also do that to run 25MHz.

Finally, to be truly safe and truly within spec, you may have to settle for something less than 25MHz if you are using the DCO. Typically the DCO modulator will alternate between two adjacent DCO speeds on a cycle-to-cycle basis, and to satisfy the FLL some of those cycles are faster than the target and some slower. If your target is 25MHz, then some cycles will be faster than 25MHz. Technically speaking, you shouldn't do it. You can read all about it or just target 23MHz and save yourself the trouble.

Jeff

0 Paul Schilling1 over 9 years ago in reply to Jens-Michael Gross

Prodigy 20 points

I have the same problem on an MSP430F5507 running at 25 MHz. The uptime between failures gets worse progressively at slower MCLK frequencies. 30 seconds at 25 MHz and 2 seconds at 20 MHz.

I am also using FreeRTOS version 8.0.1. I have also been able to decrease or even stop it from having PC corruption be setting the Low Power Mode from LPM1 to LPM0 in the idle call back. With a bare FreeRTOS implementation I didn't see this problem but as I added more functionality and more ISR code the PC corruption got worse and seemed to depend upon how much the system Idled.

This possibly sounds like a CPU40 errata, Possibly in the Context switch code or ISR low power mode exit code.

It's not clear from the posts that this problem was truly solved.

0 Jens-Michael Gross over 9 years ago in reply to Paul Schilling1

Guru 227245 points

The slower the clock speed, the higher the interrupt frequency of some interrupts, compared to the main code execution speed. And the higher the probability that an interrupt happens at one particular location in the code.
If an interrupt that happens at a fixed time interval (external/constant timing) of 1ms takes, say, 5000 clock cycles to execute, this means that on 25MHz, it happens every 20000 cycles of main execution. On 20MHz, it would happen every 15000 cycles of MCLK execution. Which is 33% more often, compared to 25% less clock speed. And on 5MHz, an interrupt would happen every main instruction (actually, main execution would effectively stop)
If there are sections of disabled interrupts, interrupt probability further rises for the remaining part of main.

So your observation is perfectly explainable with an interrupt-caused stack problem.

To know why it happens, a deeper look at your code would be required.

0 Paul Schilling1 over 9 years ago in reply to Jens-Michael Gross

Prodigy 20 points

I have discovered that in LPM1 it reliably corrupts the program counter but in LPM0 it doesn't. It enters low power mode in the RTOS idle hook.

There may be some EMI problems on the board that could be causing issues.

0 Jens-Michael Gross over 9 years ago in reply to Jeff Tenney

Guru 227245 points

While the GIE instruction is executed, the next instruction is already fetched. There is a small window in which the interrupt can be granted while the fetch of the next instruction is already scheduled (and therefore the interrupt will be executed after this already fetched instruction) but before the GIE clear takes effect and will forbid the interrupt.

I don’t know whether it ever appeared in an errata sheet, however, I think I read about it in the MSPGCC compiler description as the reason for including a NOP into the DINT macro. Or in a discussion in the MSPGCC mailing lst.
And since IAR and CCS also included the NOP into the DINT...

However, this will only be a problem on an asynchronous interrupt, as it can happen only on port pins or if the interrupt source runs form a different oscillator than the CPU (e.g. ADC from MODOSC)

The other ‘erratum’ you mentioned is not an erratum at all. Breakpoints are set on instruction fetch. And since the instruction after an LPM entry (not just GIE set) is fetched at the same MCLK cycle the LPM bits are being set (which will only prevent the next MCLK cycle to happen), the breakpoint is triggered before LPM is entered, and not after waking up. This is obvious if one wastes a few thoughts on the processor’s inner workings. The instruction then is discarded if an interrupt wakes from LPM, but it has been fetched adn therefore triggered the breakpoint.
However, the comment ‘//for debugger’ in the demo codes rather confuses people, leading them into the trap of placing a breakpoint exactly there where it shouldn’t be set.

0 Jens-Michael Gross over 9 years ago in reply to Paul Schilling1

Guru 227245 points

Paul Schilling1 said:
I have discovered that in LPM1 it reliably corrupts the program counter but in LPM0 it doesn't.

EMI is always a possibility. However, switching the DCO off in LPM1 means some time before DCO comes up again. And I remember an erratum ‘DCO comes up fast’, where the DCO might still run faster than it should (has not settled) when MCLK is reactivated.
Try to increase the MCLK divider before entering LPM (and set it back in the ISR). This increases ISR latency, but maybe it solves the problem even in LPM1.

**Attention** This is a public forum

MSP low-power microcontrollers

MSP low-power microcontroller forum

Program Counter (PC) getting out of sync, executing wrong instructions