This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

MSP430F5437A: Reading UCA0RXBUF causes CPU occasionally skip the subsequent instruction

Part Number: MSP430F5437A


Tool/software:

I have a long-running product whose new firmware build started to fail occasionally when subjected to heavy incoming serial port traffic. Using the analysis tools built in our firmware and a debugger I have been able to isolate the problem to very specific part of the firmware code.

The firmware reads the UCARXBUF (UART mode) repeatedly using TST.B instruction, even if there is no pending character. (Do not ask why; but I have not found any document that forbids reading the UCA0RXBUF even if UCRXIFG is not set.) However, sometimes, like once per 1000000 executions, the CPU jumps over the subsequent instruction with no apparent reason.

Because this subsequent instruction is RETA, jumping over it causes the execution to flow into completely incorrect path rendering the device nonoperational.

0x37C64 TST.B  0xC(R12)             ; this instruction reads UCA0RXBUF (R12=0x05C0)
0x37C68 RETA
0x37C6A MOVA   R13,0x8(R12)         ; unrelated instruction that should be never executed but is anyway sometimes executed after 0x37C64
0x37C6E RETA

Whenever the execution is halted by a breakpoint on the not-to-be-executed instructions, State Storage shows that the execution has indeed skipped over RETA instruction at 0x37C68:

The occurrence probability and the fact that there must be incoming bytes for this to happen gives me a feeling this is related to situations where UCRXIFG is set by hardware at the ~same time when the RXBUF is read. However, UCRXIE and GIE are set, so all incoming characters should trigger interrupt processing but no interrupt processing can be seen in the State Storage. (State Storage shows "Irq=1" shortly after the TST.B instruction reading the RXBUF. Why it is there and why no interrupt processing started if it was an interrupt?)

I cannot find any advisory on MSP430F5437A errata that could cover this kind of execution runaway. What is happening and how to ensure the reliability of the firmware? Of course I could add a dummy instruction after reading the UCA0RXBUF but without any confirmation of the root cause I cannot be sure that it would fix the issue reliably. Undocumented behavior should not happen on CPU in the first place anyway.

  • Amendment: If State Storage is configured to record all CPU cycles, the control bit "Irq" seems to be asserted exactly on the cycle which reads the UCA0RXBUF (0x05CC) and then the execution will "stall" on 0x37C6A for a while (already ignored the RETA at 0x37C68?) Note that there are unexpected writes(!) to addesses 0x133C and 0x133E (BSL flash area?!), too, before the second word of the instruction at 0x37C6A is finally read).

  • It could be possible that you have triggered a rare problem with the interrupt system. If it begins to process an interrupt but that interrupt source is cleared before it reads the vector, it can read the wrong vector.

    So what could happen is that a new byte arrives and RXIFG is set. The interrupt system then begins the process of fielding the interrupt. If, perchance, it has to wait for this particular TST.B instruction to finish, then RXIFG will be cleared. Now when the hardware picks which vector to read, it gets the wrong one. If some other interrupt is pending, it will use that. If no interrupt is pending, the vector used isn't specified. It could be almost anything.

    The usual cure for this is to have a spurious interrupt vector.

    So when you have the receive interrupt enabled, don't read RXBUF except in the ISR.

  • Thanks David for your comment. Indeed, this looks like something unexpected is happening the interrupt processing. Is there any official TI document that describes this behavior? Based on my literature research is it not officially "forbidden" to read RXBUF at any time.

    Definitely there is no other interrupt pending in the examples above, otherwise it would have been serviced already/instead. So this goes to the "almost anything" category then. First I thought that just adding a NOP after the TST.B instruction ensures that everything works even if the subsequent instruction is occasionally skipped. But if "almost anything" can happen, this would not be a reliable solution.

    Wondering if this could happen with any instruction that clears any hardware interrupt flag? For example, by writing 0 to any xIFG flag at the same moment when the corresponding interrupt occurs. If this is the case, there should be guidance that one should never write xIFG's unless the corresponding interrupt is disabled (via xIE or GIE). (In typical code this is the case because xIFGs are cleared in interrupts and interrupts are run with GIE unset.)

  • The "almost anything" is constrained. The interrupt hardware that selects which vector to load will pick one. But which one will it use when there is no input? I suspect it would be the lowest priority. Most likely with the value 0xffff in it.

    The result is that the program counter would have 0xfffe loaded (the lsb being a don't care) which would cause the data at the reset vector being used as an instruction. Not good.

    As for documentation, there is a note in the section of the guide on interrupts. Warning about clearing an interrupt flag just before enabling interrupts. Note quite the same but similar.

  • The observed behavior is not to jump to 0xfffe but skip the subsequent instruction. (This is less fatal than jump to 0xfffe but is still likely to cause issues depending on how significant instruction was skipped. In our case the firmware recovered from the jump but the unexpected code executed shortly before returning to the correct flow caused serial port baud rate to change, which made the issue visible.) Of course, all of this falls well under "unexpected CPU execution".

    As you mentioned, the highlighted part is closest to this problem but I considered it originally as unrelated. This part of code is not enabling/disabling interrupts at all, it is "only" clearing interrupt flags (while the corresponding interrupt is enabled, though). Quickly thinking one might think that clearing an IFG, which should be zero anyway (otherwise interrupt would have been serviced which clears the IFG...), does not hurt at all. But it seems to be something that should be avoided because of above-speculated reasons.

    Currently we have a manufacturing hold for our end products. We do not officially know why they are failing because the firmware apparently looks like not doing anything illegal. Because of the lost sales, in addition to working around the problem, I will be asked to make a detailed report on what was the root cause and why it has happened (= who made the mistake) and how this can be avoided in the future. That is why I am looking for existing documentation that should have unambiguously prevented the original developers to code like this.

    This firmware is full of apparently unnecessary actions on peripheral registers and useless variables. So far I have considered them only as lost code memory space but now I need to consider them as a risk for the firmware stability, too.

  • Because it might be possible that I will never find an "official" statement what could happen if UCA0RXBUF is read while interrupt is enabled, I tried to analyze my observations and derive my own understanding of what the CPU is actually doing in that case.

    (As a Commodore64-era hacker I love these architectures deriving from 1990's because one could expect the CPU to be based on simple logic, without any fancy microcode or anything dynamically changing parts. For completeness and to entertain possible similarly-minded readers of this thread, I will share my analysis below.)

    The explanation for unexpected write to 0x133C (see the State Storage picture showing all cycles above):

    1. The IRQ glitch causes the CPU to read RETA instruction from 0x37C68 as "non-fetch" and continue to instruction word MOVA R13,X(R12) at 0x37C6A. (The read of UCA0RXBUF, 0x05CC is happening here in between because of the pipeline.) There is "something" already happening in the CPU when 0x37C68 word is read because the "Control Signals" contain value 0x302, which significantly differs from other cycles.
    2. However, also the destination index word for this indexed MOVA is read from 0x37C6A (PC not advanced). This causes the MOVA destination to be 0x133C (0x05C0 from R12 + 0x0D7C (opcode word of MOVA R13,X(R12)) from 0x37C6A = 0x133C). (Because of the pipeline, the actual write is performed a few cycles later, though.)
    3. After this instruction, the PC is still not advanced, after which the normal operation resumes. Instruction MOVA R13,0x8(R12) from 0x37C6A is now normally fetched and executed causing a write to 0x05C8 (UCA0MCTL, 0x05C0 from R12 + 0x0008 (destination index word) = 0x05C8).

    Guessing and expecting this to be the way how the CPU misbehaves, I can analyze the binaries our firmware and give a statement to our internal stakeholders if the CPU problem could cause any issues on the final product containing those firmware versions.

  • Reading RXBUF outside the ISR is a problem so shouldn't be done.

    Even assuming that the ISR gets invoked correctly, what happens in the ISR when RXIFG isn't set?

    If you insist on doing that, disable interrupts first.

  • I agree; I already removed those useless RXBUF reads completely (excluing the valid one inside the ISR) from the next firmware version. But I still need to evaluate what the unnecessary RXBUF reads in those hundreds of thousands of products shipped during last 15(?) years could cause. To achieve that the only thing that I can do is to hope that the CPU behaves similarly every time. Assuming that, I just need to disassemble the binary images to see what instructions there are after the RXBUF reads and how this kind of abnormal execution would affect the product operation in each case, if at all.

    I have been called into this case just to explain what other's have done years ago and to determine if it causes a major recall Innocent

**Attention** This is a public forum