This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
We have MSP430F1611's that are apparently erasing segments of their on board flash in the field on power up. Note this isn't a low power application, we just re-used this processor because we had it on the shelf from other projects.
The circuit board using the MSP430 is powered by a DC power supply and is usually switched off at night and then back on in the morning.
Sometimes in the field the microcontroller fails to initialize the board properly. Looking at the devices in the lab segments of flash have been erased. They are not always the same segment and sometimes there are multiple segments. Reloading the flash fixes the problem.
The firmware guys tell me they aren't re-writing the onboard flash ever and they have done a code audit to make sure it isn’t happening accidentally.
I’ve never been able to duplicate this failure in the lab and am wondering if anyone has any ideas. Meanwhile, I’m testing all of the usual suspects – Power Transients, ESD, etc.
We have thousands of these in battery powered applications and never had an issue like this so it’s something specific to the hardware/software design of this project.
.
Mel Costello said:We have MSP430F1611's that are apparently erasing segments of their on board flash in the field on power up. Note this isn't a low power application, we just re-used this processor because we had it on the shelf from other projects.
Is this failure on just one board or is deterministic so a couple of bard fail at same time?
Is this failure frequent?
Mel Costello said:Sometimes in the field the microcontroller fails to initialize the board properly. Looking at the devices in the lab segments of flash have been erased. They are not always the same segment and sometimes there are multiple segments. Reloading the flash fixes the problem.
Again if more than one fails the erased segment are the same patten for all them?
Mel Costello said:The firmware guys tell me they aren't re-writing the onboard flash ever and they have done a code audit to make sure it isn’t happening accidentally.
So they checked inn some way flash register are untouched during run? They covered all software cases? (very time consuming)
Mel Costello said:We have thousands of these in battery powered applications and never had an issue like this so it’s something specific to the hardware/software design of this project.
This means just one fails? Or refer to processor type?
Mel Costello
We have MSP430F1611's that are apparently erasing segments of their on board flash in the field on power up. Note this isn't a low power application, we just re-used this processor because we had it on the shelf from other projects.
Is this failure on just one board or is deterministic so a couple of bard fail at same time?
Multiple boards have failed in the field in different pieces of the same model equipment at different times over the last year
Is this failure frequent?
Yes. If the trend holds we are looking at 5-10 percent of the population failing per year.
Mel Costello
Sometimes in the field the microcontroller fails to initialize the board properly. Looking at the devices in the lab segments of flash have been erased. They are not always the same segment and sometimes there are multiple segments. Reloading the flash fixes the problem.
Again if more than one fails the erased segment are the same paten for all them?
Segments are different. Sometimes there are multiple segments erased, some of them are contiguous, some are not.
Mel Costello
The firmware guys tell me they aren't re-writing the onboard flash ever and they have done a code audit to make sure it isn’t happening accidentally.
So they checked in some way flash register are untouched during run? They covered all software cases? (Very time consuming)
No, they the just did a code crawl. I’m sure they haven’t checked all the software cases, particularly things such as the effect of different combinations of interrupts occurring at different times. All I can say at this point is they never intentionally set the FMC to erase mode (or unintentionally as far as they can see by doing a code crawl)
Mel Costello
We have thousands of these in battery powered applications and never had an issue like this so it’s something specific to the hardware/software design of this project.
This means just one fails? Or refer to processor type?
We use this device in three different products. Two of them are portable battery powered devices where we’ve never had a failure of this type (thousands in the field). The device that is failing is powered by a 5Volt DC supply (regulated down to 3.3V using a TC55RP series regulator).
I’m not saying this is a hardware issue – I’m just wondering if there is a known hardware based timing/startup/ESD issue that could cause this.
Hi Mel, thank covering statistic failure data, two more are needed:
Mel Costello said:Is this failure on just one board or is deterministic so a couple of bard fail at same time?
Multiple boards have failed in the field in different pieces of the same model equipment at different times over the last year
Is this failure from different customer or same plant? Are they on particular seasonal or just evenly scattered over time?
(if from same plant are some high power RF field there or some particular condition?)
Mel Costello said:Is this failure frequent?
Yes. If the trend holds we are looking at 5-10 percent of the population failing per year.
From this sentence I infer this failure is showing from long time and more than one year presented spotted failures, did you ever experienced one of these failures at lab?
Is processor protected or open to jtag programming? So are you detecting flash failure by bootloader or reading from JTAG?
Can be some picture of device posted and some idea of how it work?
Hi Mel, thank covering statistic failure data, two more are needed:
Mel Costello
Is this failure on just one board or is deterministic so a couple of bard fail at same time?
Multiple boards have failed in the field in different pieces of the same model equipment at different times over the last year
Is this failure from different customer or same plant? Are they on particular seasonal or just evenly scattered over time?
From the same plant but multiple customers at all times of the season, including from subtropical/tropical regions such as Florida and the Philippines where you would not expect static to be a problem
(if from same plant are some high power RF field there or some particular condition?)
The only RF would be from their 802.11 wireless. They have soldering irons and other typical electronic assembly equipment but that is about it.
Mel Costello
Is this failure frequent?
Yes. If the trend holds we are looking at 5-10 percent of the population failing per year.
From this sentence I infer this failure is showing from long time and more than one year presented spotted failures, did you ever experienced one of these failures at lab?
Yes we’ve had failures for about a year and no we’ve never been able to create a failure in the lab, including using circuit cards that previously failed in the field and were re-programmed. If we could duplicate it, we could instrument up software and hardware and narrow down the failure area.
Is processor protected or open to jtag programming? So are you detecting flash failure by bootloader or reading from JTAG?
If you mean not accessible from users then it is protected from users. The card is in a metal box along with a open frame switching power supply, some external connectors, (all of which are buffered) and an optically isolated digital switch that is used to turn a 2A 60Hz motor off and off. I mention that only because there will be some low level EM fields associated with the Power Supply Inverter and the motor AC. I have checked for DC transients on the 5Volt supply during startup and see nothing. I’ve also done the same on the 3.3 V rail.
Detecting flash failure by reading from JTAG
Can be some picture of device posted and some idea of how it work?
Will have to post tomorrow as I will be working on another project this afternoon
Mel Costello said:Yes we’ve had failures for about a year and no we’ve never been able to create a failure in the lab, including using circuit cards that previously failed in the field and were re-programmed.
This sound strange, sending back these reprogrammed board failed again?
Mel Costello said:Detecting flash failure by reading from JTAG
Ok so processor is open to read write modify flash, along it can be some noise on JTAG starting this trouble, this seems really strange. If protected just all flash can be erased and not sector by sector till one know the vector area password.
Is the sabotage of firmware by a malicious user to be excluded?
Mel Costello said:The firmware guys tell me they aren't re-writing the onboard flash ever and they have done a code audit to make sure it isn’t happening accidentally.
To do a code audit for things could or could not happen accidentally is a daunting task!
Instead of, or in addition to code audit, I would search for the words 0x0128, 0x012A, or 0x012C in the image of the entire undamaged Flash memory. If found, then I would disassemble the near by words to see if FCTL1, FCTL2, and FCTL3 could have been accessed intentionally or accidentally.
I'm not sure if it is happening on Power Up or Down. We are only using an RC circuit on the reset and no voltage supervisor. The processor is running at 7MHz.
I suspect you are correct in that it is happening on Power Down. I should be able to test this scenario on the bench with a programmable supply which I will do next week.
Thank you for the information
Good Luck. I've tried reproducing the failure on the bench all week with no success, I've tried running with the supply voltage just above where the processor cuts off (and below the reccomended operating voltage for 7Mhz) as well as modulating the supply to cross the critical threshold. I can get the processor to hang but not to blow the cache.
This seems to be a common problem with the MSP430 though from reading other threads. Since our application uses an AC source I and derives the 3.3 V from 5V onboard, I plan to use an external voltage supervisor tied to the 5V that will put the processor in reset whenever the 5V gets below 4.5V or so (as per Jens suggestion). That way we can sure we never ever get near a critical voltage on the 3.3 V rail.
I haven't been able to duplicate the Flash corruption but I have been able to verify erratic operation of the CPU when VCC drops under 2.5 volts on Power Down. Given the info on this and other forums, as well as the behavior I've been able to verify on the board, I'm convinced that the problem is with VCC on power down.
Now the question is how to fix it. The consensus on the forums is that the on-chip SVS can't be trusted and that to fix this completely, you should use an external voltage supervisor. Given that we have a significant population of this board in the field I'm being asked (and I think it's a valid question) "How much of an improvement would we get with the current boards by simply changing the code to use the on-chip SVS to ensure the VCC rail is as close to 3.3 V as possible.
This of course is because we can reprogram the devices in the field which is easier (and less costly) than replacing the board with a new one with an external voltage supervisor. We wouldn't want to do this if we see little or no improvement but if it reduces in-field failures on the current boards by 70-90 percent, it would be worth doing.
Just a follow up on my unintended flash erasure/corruption problem with 430F415. Thanks to everyone in the thread for your comments.
I was finally able to replicate the problems we were seeing by switching ac power on and off ( with relays ) to the supplies used with our units. I still don't understand why I can't replicate the problem as easily by switching the DC power, but It is a good example of how important it is to test systems as they will be deployed.
Although a hardware based approach such as external SVS would clearly be a robust solution I wanted to do as much as possible within the constraints of our current PCB with no changes.
The following improvements to the firmware have made a dramatic decrease in the rate of these errors in my testing ( at least 99% ). I did uncover and correct multiple contributing factors.
1. Activated the internal SVS circuit to trigger reboot on loss of power. Ideally below the datasheet min VCC for our chosen operating speed, but in order to avoid false resets and allowing for the datasheet tolerance of the SVS comparator and our limited VCC headroom I am setting the SVS nominal trip point a bit lower than ideal, but still higher than the min (2.7) voltage required for flash operations.
2. After activating the SVS wait until the SVSON bit indicates that the SVS is armed before changing the DCOPLUS settings to increase the chips MCLK.
3. Made sure that the FCTL register clock divider is set correctly to yield a fftg freqency for flash operations that is within spec.
4. At each flash write operation I am switching the PORON bit off right before manipulating the flash control bits to unlock the erase and write operations so as to minimize the chance of a reset in the middle of the write operation. Then I switch PORON back on after the write has completed. My goal is to keep the chip from operating at low voltage, but if the power fails at the worst possible time, I want to avoid data loss with an SVS reset *during* the erase-write operations.
5. Make sure that the system power draw is as low as possible before initiating any flash operations so that the available capacitance will last as long as possible.
In addition to flash problems I also had experienced so data corruption on writes to external SPI connected devices, so the changes described for steps 4 and 5 above were also applied around those operations as well.
I have not done any testing of subsets of these changes, and don't have any field results yet, but I feel pretty confident that this problem should be resolved.
--Thurston
PS: I have a similar device that is already using the SVS circuit's external input and a timer as a makeshift ADC. However I believe I can still get improved reliability by alternating (with the basic timer ISR) my analog read operations and using the SVS in the internal power monitoring state. The approach must also ensure that my flash operations only take place immediately after adequate VCC has been confirmed.
**Attention** This is a public forum