This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

How to research and workaround CPU errata - quickly???

Other Parts Discussed in Thread: MSP430F5438A

A couple days ago I stumbled across and diagnosed CPU42 (DINT immediately after EINT crashes the CPU if an interrupt is pending - this instruction sequence can happen from C code when the compiler inlines functions). Then I found a post where it was reported two years ago and learned its name. http://e2e.ti.com/support/microcontrollers/msp43016-bit_ultra-low_power_mcus/f/166/p/53319/832795.aspx The comments and bug-database imply that the fix may not be adequate for two reasons:

1) One commenter reported that, depending on the instruction, the CPU can execute one _or two_ instructions after EINT when an interrupt is pending.

2) The silicon-errata=CPU42 is described as not being turned on by default.

Reading the threads here gives me a non-warm-and-fuzzy feeling about CPU errata. For one thing, CPU42 apparently didn't make it into the toolchain until 4.1.3, although it was reported in June 2010. For another, there are various bugs being reported where other CPU errata are not supported - at all - even by 4.1.3.

And CPU42 is not in the errata sheet for the processor I'm using (MSP430F5438A) that I downloaded yesterday.

It was literally a one-in-1000 chance that this bug cropped up repeatably rather than randomly crashing the CPU once every few days. If the timer-tick interrupt hadn't fallen right on the correct spot in the initialization code... ugh. We would not have caught it until it was too late.

So, we have to ship this product in a couple of weeks, and it'll be impossible to change the code after that. (It's actually a CubeSat.) Given how close this came to killing us, I want to make very sure that there aren't any other known-but-poorly-documented errata lurking, and that all known errata are actually dealt with correctly by our toolchain/environment.

I'm installing the latest CCSV5 right now (I'd been using toolchain 4.1.1, which is not that old). What else should I be doing? Where else should I be reading? I don't have the time (or the budget) to read every post in these forums, or every bug in the database, in case it contains a bug that might affect my chip.

Do I just _hope_ that there aren't any more problems? How do other MSP430 engineers handle this? Does anyone have any advice I can use?

Many thanks!

(BTW, if any TI engineers are reading this, _please_ document the DCO/FLL sensitivity to power supply fluctuations in more detail. We found that a 50 mV 1 us spike would change the master-clock timing enough to cause serial bit errors. Even after we'd debugged and characterized the problem, our hardware team claimed their power supply was good enough - and we couldn't prove them wrong, because I can't find anywhere that you describe the requirements for the power supply.)

  • Chris Phoenix said:
    How do other MSP430 engineers handle this?

    Well, because you're explicitely asking...

    I have some code (actually quite old) for writing to flash. Since a flash write takes some time and interrupts need to be disabled during a flash write operation, I shortly enable them between writes.
    Because of this (and the other) thread, I checked my code and found that I was explicitely placing two NOPs between EINT and DINT, with the comment 'give some time to any pending ISRs'.
    I don't know whether the code would have crashed without these NOPs. I just was aware that there might be more than one ISR pending and added the two NOPs. Without knowing of any possible silicon bug.
    It's not the only situation where I circumvented a bug without knowing about it at all, just by being 'aware' of the circumstances.
    Many problems go away or do not even pop up if you visualize what happens rather than what you want to happen.
    However, this requires some base knowledge. If one doesn't understand how things work, he cannot see where reality and his imagination drift apart. (like pushing 10 bytes in a row into TXBUF and wondering why only the first and the last arrive - which obviously must happen if MCLK is much faster than the baudrate).

    Maybe it's just because I'm curious above average. I want to understand how things work rather than just picking keywords form a sample code. It takes a bit more time, but it pays at the end.

    Ask yourself one question: how come that the errata sheet contains entries at all? Apparently someone must have indentified the problem. If he could, why can't you? I doubt that most of the problems were identified by looking at top-secret internal TI documents like silicon blueprints.
    Sure, it's convenient to have an environment that simple works. But world is less than ideal. And being aware of this is one of the differences between a software engineer and a software designer.

  • Chris Phoenix said:
    (BTW, if any TI engineers are reading this, _please_ document the DCO/FLL sensitivity to power supply fluctuations in more detail. We found that a 50 mV 1 us spike would change the master-clock timing enough to cause serial bit errors. Even after we'd debugged and characterized the problem, our hardware team claimed their power supply was good enough - and we couldn't prove them wrong, because I can't find anywhere that you describe the requirements for the power supply.)

    I am not a TI engineer, but I found two FLL bugs that can upset async serial I/O.  After I worked with TI, these eventually became UCS7 and UCS10.  If I had to guess, I would say your problems were caused by UCS10.  My work around (which TI chose not to publish) was to use the FLL intermittently -- in other words, turn it on, wait for it to settle, turn it off, and then make sure UCS10 didn't happen after you decided it was settled.

    As to your primary question in this thread regarding dealing with silicon bugs quickly -- all you have is the errata sheet for your MCU and this forum.  This forum being most helpful for discussing work-arounds, both those suggested in the errata sheet and those that should have been.  In that case, search by erratum name (eg, "UCS10").  Seems to me TI does about as good (and bad) a job as any other MCU maker in dealing with errata and providing related resources.

    I am intrigued by the CPU42 mystery, and I wonder why it is not included in the errata sheet.  The general rule, however, is that all known defects are included in the errata sheet.  CPU42 seems to be a strange exception to the rule.  Most people will not encounter this bug because DINT is typically preceded by MOV SR which makes it immune from CPU42.

    Jeff

  • In our case, it was definitely triggered by powersupply - we could put a scope on the power line, and a square wave output by the chip, and trigger on extra-short pulses on the square wave, and the power line spikes would always be there, a microsecond or two before.

    If I understand UCS10, it can only happen around the time of a FLL-controlled DCO tap change, which can only happen when REFCLK transitions, which takes well over 10 us. So I don't think It's possible that the power supply spikes were activating UCS10.

    I think I also remember that the frequency shifts also happened when using the DCO with FLL turned off. I'm 90% sure of that.

    Congratulations on finding those, BTW.

    On CPU42, I was writing straight C code, with optimization level 4, speed-vs-size 0, and Configuration Debug (if that makes any difference), in toolchain 4.1.1, for MSP430F5438A. By code was essentially this:

    void BadFun(struct *param)

    {

      _disable_interrupts();

     param->field = volatile_variable;

    _enable_interrupts();

    }

    void CrashIt(void)

    {

    BadFun(var1);

    BadFun(var2);

    }

    When I looked in the disassembly view, it showed that the calls to BadFun had been inlined, and EINT was immediately followed by DINT.

    The crashes were interesting; depending on exactly what code I added to BadFun, sometimes the CPU would reboot immediately, and sometimes it would hang until the watchdog timer rebooted it. During the hang, the timer-tick interrupt would not be serviced (or else the I/O pins I was twiddling in the interrupt handler were disabled somehow).

    Interestingly, I could put one I/O pin twiddle in the middle of BadFun (e.g. P5OUT |= 1), but not two; that would make it stop crashing. I didn't check the disassembled code, so this is probably just optimizer magic.

    Chris

  • Maybe UCS10 actually caused your power-supply spikes.  The DCO accelerating by 10% "instantly" could easily have that effect on the power supply.  The "short" output pulse would come soon afterward as evidence that UCS10 happened.

    On CPU42, your function BadFun() isn't normal because it "blindly" enables interrupts.  Typically it would look like this instead:

    void BadFun(struct *param)

    {

      save_interrupt_status();  //  MOV.W  SR, [ ]

      _disable_interrupts();  // DINT

      param->field = volatile_variable;

      restore_interrupt_status();  // MOV.W [ ], SR

    }

    which would have prevented CPU42.

    Jeff

  • No, the spikes happened in the power supply without the CPU hooked up. Just bad design. The CPU did inject additional noise into the line, but the spikes were a different shape - very distinctive.

    But I'm realizing that I didn't verify that every spike caused the timing change. It's possible that it happened only when a spike coincided with REFCLK. So it could still be UCS10 after all... if my memory is wrong about the timing change happening even when the FLL was off. But the more I think about it, the more I remember that we looked at the UCS10 fix, tried it, and found it didn't solve the problem for us. Anyway, it's moot for us, because we were able to put an external resonator on the board.

    I can see why save/restore would prevent CPU42. In my code, I'm not calling BadFun() from anything that could have already disabled interrupts. Your version is more paranoid, and thus better. :-) But I wouldn't have thought that everyone would use it automatically. Am I missing some way that interrupts can be disabled without my code explicitly doing it?

    BTW, thanks for actually answering my question about where to find errata.

  • Chris Phoenix said:
    Am I missing some way that interrupts can be disabled without my code explicitly doing it?

    Interrupts are also disabled during interrupt acknowledge (when preparing to call your ISR), but you already knew that.

    Most of the commercial and professional code I see uses the save-disable-restore approach.  Just makes the function a bit more agile, ready for use by any caller.  Not really sure about code in the hobbyist community.  (Actually, off topic, I think most people over-use interrupt masking.  Good design often eliminates the need.)

    Anyway I think I'm going to do some CPU42 experiments too.  I can't understand why it does NOT appear in the errata sheet.

    Jeff

  • Jeff Tenney said:
    Most of the commercial and professional code I see uses the save-disable-restore approach. 

    Especially if you use it inside a funciton that is called from an unknown caller, it is mandatory to save the current state before disabling interrupts. (and code shouldn't enable interrupts at all if they are disabled,a sthis usually has a reason)

    I once created a macro that increments and decrements a counter for "critical sections" and disables interrups, but only (and then always) enables them when the counter is decremented to 0 at the end of the macro. It allows for unlimited (well, 65535) nested critical sections.

    Jeff Tenney said:
    I think most people over-use interrupt masking.  Good desing often eliminates the need.

    I completely agree. But when you use long global variables that are changed by an ISR and read by main, it is necessary to disable interrutps while readoing the variable. And there are some other situations where you cannot allow interrupts, like writing to flash.

**Attention** This is a public forum