This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Apparent uncommanded MSP430F1611 in field flash erasure

Other Parts Discussed in Thread: MSP430F1611, MSP430F1232, MSP430F415

We have MSP430F1611's that are apparently erasing segments of their on board flash in the field on power up. Note this isn't a low power application, we just re-used this processor because we had it on the shelf from other projects.

The circuit board using the MSP430 is powered by a DC power supply and is usually switched off at night and then back on in the morning.

Sometimes in the field the microcontroller fails to initialize the board properly. Looking at the devices in the lab segments of flash have been erased. They are not always the same segment and sometimes there are multiple segments. Reloading the flash fixes the problem.


The firmware guys tell me they aren't re-writing the onboard flash ever and they have done a code audit to make sure it isn’t happening accidentally.

I’ve never been able to duplicate this failure in the lab and am wondering if anyone has any ideas. Meanwhile, I’m testing all of the usual suspects – Power Transients, ESD, etc.

We have thousands of these in battery powered applications and never had an issue like this so it’s something specific to the hardware/software design of this project.

.

  • Mel Costello said:
    We have MSP430F1611's that are apparently erasing segments of their on board flash in the field on power up. Note this isn't a low power application, we just re-used this processor because we had it on the shelf from other projects.

     Is this failure on just one board or is deterministic so a couple of bard fail at same time?

     Is this failure frequent?

    Mel Costello said:
    Sometimes in the field the microcontroller fails to initialize the board properly. Looking at the devices in the lab segments of flash have been erased. They are not always the same segment and sometimes there are multiple segments. Reloading the flash fixes the problem.

     Again if more than one fails the erased segment are the same patten for all them?

    Mel Costello said:

    The firmware guys tell me they aren't re-writing the onboard flash ever and they have done a code audit to make sure it isn’t happening accidentally.

     So they checked inn some way flash register are untouched during run? They covered all software cases? (very time consuming)

    Mel Costello said:

    We have thousands of these in battery powered applications and never had an issue like this so it’s something specific to the hardware/software design of this project.

     This means just one fails? Or refer to processor type?

  • Mel Costello

    We have MSP430F1611's that are apparently erasing segments of their on board flash in the field on power up. Note this isn't a low power application, we just re-used this processor because we had it on the shelf from other projects.

     Is this failure on just one board or is deterministic so a couple of bard fail at same time?

    Multiple boards have failed in the field in different pieces of the same model equipment at different times over the last year

     Is this failure frequent?

                Yes. If the trend holds we are looking at 5-10 percent of the population failing per year.

    Mel Costello

    Sometimes in the field the microcontroller fails to initialize the board properly. Looking at the devices in the lab segments of flash have been erased. They are not always the same segment and sometimes there are multiple segments. Reloading the flash fixes the problem.

     Again if more than one fails the erased segment are the same paten for all them?

    Segments are different. Sometimes there are multiple segments erased, some of them are contiguous, some are not.

    Mel Costello

    The firmware guys tell me they aren't re-writing the onboard flash ever and they have done a code audit to make sure it isn’t happening accidentally.

     So they checked in some way flash register are untouched during run? They covered all software cases? (Very time consuming)

    No, they the just did a code crawl. I’m sure they haven’t checked all the software cases, particularly things such as the effect of different combinations of interrupts occurring at different times. All I can say at this point is they never intentionally set the FMC to erase mode (or unintentionally as far as they can see by doing a code crawl)

    Mel Costello

    We have thousands of these in battery powered applications and never had an issue like this so it’s something specific to the hardware/software design of this project.

     This means just one fails? Or refer to processor type?

    We use this device in three different products. Two of them are portable battery powered devices where we’ve never had a failure of this type (thousands in the field). The device that is failing is powered by a 5Volt DC supply (regulated down to 3.3V using a TC55RP series regulator).

    I’m not saying this is a hardware issue – I’m just wondering if there is a known hardware based timing/startup/ESD issue that could cause this.

  •  Hi Mel, thank covering statistic failure data, two more are needed:

    Mel Costello said:

    Is this failure on just one board or is deterministic so a couple of bard fail at same time?

    Multiple boards have failed in the field in different pieces of the same model equipment at different times over the last year

     Is this failure from different customer or same plant? Are they on particular seasonal or just evenly scattered over time?

    (if from same plant are some high power RF field there or some particular condition?)

    Mel Costello said:

     Is this failure frequent?

                Yes. If the trend holds we are looking at 5-10 percent of the population failing per year.

     From this sentence I infer this failure is showing from long time and more than one year presented spotted failures, did you ever experienced one of these failures at lab?

     Is processor protected or open to jtag programming? So are you detecting flash failure by bootloader or reading from JTAG?

     Can be some picture of device posted and some idea of how it work?

  • Hi Mel, thank covering statistic failure data, two more are needed:

    Mel Costello

    Is this failure on just one board or is deterministic so a couple of bard fail at same time?

    Multiple boards have failed in the field in different pieces of the same model equipment at different times over the last year

     Is this failure from different customer or same plant? Are they on particular seasonal or just evenly scattered over time?

    From the same plant but multiple customers at all times of the season, including from subtropical/tropical regions such as Florida and the Philippines where you would not expect static to be a problem

    (if from same plant are some high power RF field there or some particular condition?)

    The only RF would be from their 802.11 wireless. They have soldering irons and other typical electronic assembly equipment but that is about it.

     

    Mel Costello

     Is this failure frequent?

                Yes. If the trend holds we are looking at 5-10 percent of the population failing per year.

     From this sentence I infer this failure is showing from long time and more than one year presented spotted failures, did you ever experienced one of these failures at lab?

    Yes we’ve had failures for about a year and no we’ve never been able to create a failure in the lab, including using circuit cards that previously failed in the field and were re-programmed. If we could duplicate it, we could instrument up software and hardware and narrow down the failure area.

     Is processor protected or open to jtag programming? So are you detecting flash failure by bootloader or reading from JTAG?

    If you mean not accessible from users then it is protected from users. The card is in a metal box along with a open frame switching power supply, some external connectors, (all of which are buffered) and an optically isolated digital switch that is used to turn a 2A 60Hz motor off and off. I mention that only because there will be some low level EM fields associated with the Power Supply Inverter and the motor AC. I have checked for DC transients on the 5Volt supply during startup and see nothing. I’ve also done the same on the 3.3 V rail.

    Detecting flash failure by reading from JTAG

     Can be some picture of device posted and some idea of how it work?

                Will have to post tomorrow as I will be working on another project this afternoon

  • Mel Costello said:
    Yes we’ve had failures for about a year and no we’ve never been able to create a failure in the lab, including using circuit cards that previously failed in the field and were re-programmed.

     This sound strange, sending back these reprogrammed board failed again?

    Mel Costello said:

    Detecting flash failure by reading from JTAG

     Ok so processor is open to read write modify flash, along it can be some noise on JTAG starting this trouble, this seems really strange. If protected just all flash can be erased and not sector by sector till one know the vector area password.

     Is the sabotage of firmware by a malicious user to be excluded?

  • Mel Costello said:

    The firmware guys tell me they aren't re-writing the onboard flash ever and they have done a code audit to make sure it isn’t happening accidentally.

    To do a code audit for things could or could not happen accidentally is a daunting task!

    Instead of, or in addition to code audit, I would search for the words 0x0128, 0x012A, or 0x012C in the image of the entire undamaged Flash memory. If found, then I would disassemble the near by words to see if FCTL1, FCTL2, and FCTL3 could have been accessed intentionally or accidentally.

  • Hello Mel Costello,

    The community here has some great suggestions for helping you solve your issue, but I wanted to chime in here with a few more. One thing to highlight is the differences in hardware between the design that has an issue and the other two designs, specifically in regards to your power supply and possibly the power up sequence here. If you can look at the power rails during power up, that may shed some light on the issue. Of course, to really nail this down, a recreation of the failure condition will need to be done. From the thread, I see this has been proven difficult to do in lab. Are you using the same power supplies as the field or just a bench supply for testing? It would be best to try to recreate the field conditions as much as possible. Care must also be taken before changing DCO settings to ensure VCC has ramped to VCC(min) for the updated frequency. This is because the CPU starts executing after being released from the BOR circuit at default DCO values. The timing maybe different between the hardware designs, and you can use the SVS module to ensure that VCC is at the appropriate level before changing the DCO.

    Another aspect to look at is if the UART BSL is being accessed by accident and is erasing the flash. (Similar to concerns of JTAG access within thread.)

    Hope this helps!

    Regards,
    JH
  • Is the problem happening at power-up, or is it already happening at power-down?
    At startup, the MSPs run with 1MHz bus clock. The brownout detection (if the device has one) is suitable to deal with this. Especially if you have an R/C combo on the reset pin.
    But if you run the MSP at a higher frequency then, a higher voltage is required. Besides not having reached this level when you speed up the CPU (which can be prevented by using the internal SVS, if available, or the ADC), the same problem happens when power is lost. The MSP still runs on its higher speed, and so is the data bus. And suddenly the supply voltage is not high enough anymore to allow the flash controller to operate properly on the CPU read requests, sometimes leading to flash corruption (if the CPU hasn't crashed before).
    At least this was the outcome of some other threads.

    Using the SVS will put the CPU into reset as soon as the supply begins to drop. Or you connect a voltage supervisor chip to the reset pin, or the power_good output of your supply, if available. In other cases, doing so has instantly stopped the failure.
    In out designs, we use a supervisor chip on the 5V rail, which powers the 3.6V regulator for the MSP430F1232. The MSP is held in reset if the 5V rail is below 4.75V (which means that the 3.6V supply is still/already stable for sure)
  • I'm not sure if it is happening on Power Up or Down. We are only using an RC circuit on the reset and no voltage supervisor. The  processor is running at 7MHz.


    I suspect you are correct in that it is happening on Power Down. I should be able to test this scenario on the bench with a programmable supply which I will do next week.

    Thank you for the information

  • Greetings and thanks, (Mel, Jens, Jace )
    I have been wrestling with a nearly identical issue on a board based on a MSP430F415; random erasures of one or both flash INFO memory blocks in the field especially on units that are powered up and down at least daily.
    We are also running at 7 Mhz with (previously ) no use of SVS to prevent operation at low voltage, and this thread give me considerable hope that my intended improvements will be sufficient ( since I can't recreate the failures either ).

    --Thurston
  • Good Luck. I've tried reproducing the failure on the bench all week with no success, I've tried running with the supply voltage just above where the processor cuts off (and below the reccomended operating voltage for 7Mhz) as well as modulating the supply to cross the critical threshold. I can get the processor to hang but not to blow the cache.


    This seems to be a common problem with the MSP430 though from reading other threads. Since our application uses an AC source I and derives the 3.3 V from 5V onboard, I plan to use an external voltage supervisor tied to the 5V that will put the processor in reset whenever the 5V gets below 4.5V or so (as per Jens suggestion). That way we can sure we never ever get near a critical voltage on the 3.3 V rail.

  • I was able to induce some bad behavior of some of our output lines by manipulating the supply voltage to fall at just the right time. So I would expect that it isn't just the low voltage but having the right sort of operation performed at the same time.

    Can anyone answer with certainty that a simple read of a byte adderessed memory location in the INFO flash area might be capable of inducing this unintended erase operation under low VCC condiitons ? I will try to set up an experiment to see what happens...

    Or if the other theory could account for lost flash data - cranking up the clock speed via DCO adjustment before VCC has risen sufficiently ? Would it matter if the memory is being accessed at all ?

    A third possible mechanism that I dreamed up ( based on what my MSP430F415 software might be doing during a random power event ) is that the program does set up for an intentional erase operation and issues a dummy write to 0x1080 ( to erase INFO A block ) but due to the power conditions the address is corrupted to 0x1000 and it erases INFO B instead. Does this seem remotely plausible ? It is only a one bit error, but I'm still thinking it is implausible as I have *never* seen any of the main memory blocks erased by mistake.

    Our MSP supply voltage (3.3V) is likewise derived from a 12V source. So my maximal solution is to monitor that node( level shifted and voltage clamped ) on a regular input so the software will be able to see impending power down events with plenty of time to complete operations already in progress ( such as the INFO A operation in the above example )

    Thanks again, and good luck,
    --Thurston
  • I haven't been able to duplicate the Flash corruption but I have been able to verify erratic operation of the CPU when VCC drops under 2.5 volts on Power Down. Given the info on this and other forums, as well as the behavior I've been able to verify on the board, I'm convinced that the problem is with VCC on power down.


    Now the question is how to fix it. The consensus on the forums is that the on-chip SVS can't be trusted and that to fix this completely, you should use an external voltage supervisor. Given that we have a significant population of this board in the field I'm being asked (and I think it's a valid question) "How much of an improvement would we get with the current boards by simply changing the code to use the on-chip SVS to ensure the VCC rail is as close to 3.3 V as possible.


    This of course is because we can reprogram the devices in the field which is easier (and less costly) than replacing the board with a new one with an external voltage supervisor. We wouldn't want to do this if we see little or no improvement but if it reduces in-field failures on the current boards by 70-90 percent, it would be worth doing.

  • If supply is failing during an erase operation (which takes several ms), it is well possible that a different segment gets erased or partly erased.
    Using the SVS, especially if the CPU is running on higher speed and therefore the SVS threshold is high) will prevent the CPU from initiating the erase, while the falling power is likely still enough to complete a once-started erase (there should be some buffer capacitance anyway).
    The on-chip SVS is indeed not precise enough (due to low-power considerations) to prevent a CPU crash in all cases. Which isn't a big problem if you have the watchdog active (so if the SVS doesn't trigger a reset on a brownout, the watchdog eventually will).
    However, an external SVS is indeed the best solution, especially if it has some 'headroom'. Like testing for 5V failure, so it triggers before VCC begins to drop. And keeps the MSP in reset until VCC is already fully up again.

    If you have enough buffer capacitance on VCC, the external SVS can be read in software, so the software can shut down and then stop triggering the Watchdog. And test for the signal after reset before starting-up.

    For existing hardware, using the internal SVS won't perhaps not give 100% safety, but you should see a significant increase in performance (decrease of failures)
  • Just a follow up on my unintended flash erasure/corruption problem with 430F415.  Thanks to everyone in the thread for your comments.

    I was finally able to replicate the problems we were seeing by switching ac power on and off ( with relays ) to the supplies used with our units.  I still don't understand why I can't replicate the problem as easily by switching the DC power, but It is a good example of how important it is to test systems as they will be deployed. 

    Although a hardware based approach such as external SVS would clearly be a robust solution I wanted to do as much as possible within the constraints of our current PCB with no changes.

    The following improvements to the firmware have made a dramatic decrease in the rate of these errors in my testing  ( at least 99% ).  I did uncover and correct multiple contributing factors. 

    1.  Activated the internal SVS circuit to trigger reboot on loss of power.  Ideally below the datasheet min VCC for our chosen operating speed, but in order to avoid false resets and allowing for the datasheet tolerance of the SVS comparator and our limited VCC headroom I am setting the SVS nominal trip point a bit lower than ideal, but still higher than the min (2.7) voltage required for flash operations.

    2.  After activating the SVS wait until the SVSON bit indicates that the SVS is armed before changing the DCOPLUS settings to increase the chips MCLK.

    3.  Made sure that the FCTL register clock divider is set correctly to yield a fftg freqency for flash operations that is within spec.

    4. At each flash write operation I am switching the PORON bit off right before manipulating the flash control bits to unlock the erase and write operations so as to minimize the chance of a reset in the middle of the write operation.  Then I switch PORON back on after the write has completed.  My goal is to keep the chip from operating at low voltage, but if the power fails at the worst possible time, I want to avoid data loss with an SVS reset *during* the erase-write operations.

    5. Make sure that the system power draw is as low as possible before initiating any flash operations so that the available capacitance will last as long as possible.

    In addition to flash problems  I also had experienced so data corruption on writes to external SPI connected devices, so the changes described for steps 4 and 5 above were also applied around those operations as well.

    I have not done any testing of subsets of these changes, and don't have any field results yet, but I feel pretty confident that this problem should be resolved.

    --Thurston

    PS: I have a similar device that is already using the SVS circuit's external input and a timer as a makeshift ADC.  However I believe I can still get improved reliability by alternating (with the basic timer ISR) my analog read operations and using the SVS in the internal power monitoring state.  The approach must also ensure that my flash operations only take place immediately after adequate VCC has been confirmed.   

  • Thanks for the update.
    Framing a flash write with a check for stable/sufficient supply voltage is a good idea.
    However, there are some indications that not only flash write cna cause problems. If flash is read (even by the CPU) while power is failing, it seems that this may also cause flash corruption. Normally, the CPU fails long before the voltag eis critical for the flash controller, but on some devices, the CPU seems to be still operational when the flash begins to fail, and this seems to be a reason for flash corruption. This is because in some of the cases, the code did not contain any flash write operations at all.
    Using the SVS (or an external one) for putting the CPU in reset on critical voltage levels seemed to fix the problem in these cases.
    There hasn't been any authoritative answer on this from TI.

**Attention** This is a public forum