Main memory corruption after power cycle

ArmenG

Other Parts Discussed in Thread: MSP430F2252, MSP430F2254, MSP430F235, MSP430F2370, MSP430F155, MSP430F2272, MSP430F149

Hello,

As I was testing for some issues with saving and retaining data in info memory, I stumbled across a more serious memory problem. The test consists of shutting down the power to the board (with MSP430F2252) for 5-10 sec, then restoring it for another 20-30 sec, doing so repeatedly, until a failure is detected. Occasionally, data stored while powered-up is lost during power recycle. (See "Data loss in info memory after power cycle" thread.) On very rare instances, the unit's behavior changes or it may completely "die". On such units that still have the JTAG fuse intact, I was able to read the contents of the program memory and compare to what's been saved there, originally. Specifically, I examined five such units, and here's what I found:

In one unit, the last segment of flash memory was erased. Thus the interrupt vector was gone, so the unit just "sat" on idle forever, at reset.

Another unit lost a segment (0xD600 - 0xD7FF) in the middle of the code, which normally extends up to 0xDFFF.

In three other boards, I found random bits cleared in a few random words (up to four words in each unit).

In normal operation the code never erases or writes to main memory. Only info segments A through D are used.

Has anyone seen such problems? What could be causing corruption of main memory during a power up or down? Since this happens a lot less frequently than my original problem, I have doubts that they are related.

over 15 years ago

0 Jens-Michael Gross over 15 years ago

Guru 227245 points

If an interrupt occurs during a flash write or flash erase operation, the vector table is not available. So the cpu will fetch 0x3fff ("JMP PC" opcode, lets the CPU "hop on place" normally, until flash is available again) as the ISR address. So once the write operation is finished, the CPU will execute the code at 0x3fff, which can be anything, including data that is interpreted as code. Since it is no ISR, this code can do anything, including memory writes indexed by a register which has a completely different content than it should. And this happened before you locked the flash controller again, leading to continued write or erase operations as well as other erratical behaviour.

This will explain the erased sector (write happened during flash erase) as well as the erased bits (happened during data write, so the erraneous writes are executed instead of triggering a segment erase)

It is possible that the code eventually hits a RETI instruction and returns to the flash function (but likely with altered register values) or (mor elikely) hits a RET instruction, returning to what it thinks to be the return address but rather is the saved status register on stack. Which then causes a 'return' to an even more wrong place. Eventually, the MSP is reset or the power backup depletes and you'll never know what happened.

0 ArmenG over 15 years ago in reply to Jens-Michael Gross

Prodigy 35 points

Jens-Michael, thanks for your reply. You gave me something to think about and go back and verify. The shut-down signal triggers a interrupt, which halts all operations, disables all interrupts (including NMI), reconfigures Timer A for about 100ms, and goes to LPM3 mode. The 100ms is necessary to give the system a chance to stablize, in case something was happening physically, and for debouncing any signals due to shut-down. By this time, we still have at least 500 ms worth of stable 3.3V. TimerA ISR then calls the Save function, which disables GIE, then proceeds with the write operation. Moreover, when EEI=EEIEX=0 (they are, in my project), interrupts are automatically disable during any flash operation, according to the data sheet. So, interrupts should not occur.

However, let's assume that an interrupt occurs. If the CPU fetches 0x3FFF as the ISR address (or 0x3FFE as an appropriate instruction address), then, at least in MSP430F2x family, that would poit to non-existent memory. What happes, if PC is loaded with such address? I would imagine, the next instruction fetch would return either 0xFFFF (or 0xFFFE), right? But that's where the Reset vector is, so my code will go through a normal reset. But that's exactly what my code does, immediately after the write operation, anyway. And after reset, I have a minimum of one second delay, to allow for voltages and oscillators to stabilize, before any operation takes place, or to allow for any remaining energy be drained out of capacitors.

In any case, I don't think interrupts are being triggered, nor they would explain random erase or write operations. But an interrupted write operation could explain my other problem :).

I'm now trying to check, if there are any Access Violations, Oscillator Faults, or Resets occuring. Could any of these cause the random erase or writes? I would think, these would only corrupt the segment being written to or erased, not anything else, right? Anyway, my problem is, the write operation is part of the shut-down process, and I only have a few hundred milliseconds to capture anything. Any ideas, how I can detect those?

0 Jens-Michael Gross over 15 years ago in reply to ArmenG

Guru 227245 points

ArmenG said:
that would poit to non-existent memory. What happes, if PC is loaded with such address?

int he 5x family, this would raise a vacant memory exception, leading to either a PUC (if the address is in unused I/O space) or an NMI (if outside I/O space).
If there is no interrupt, the mcu would try to execute this instruction. And no, it won't jump to the address 0xffe points to. It will execute the instruction whose binary value is 0xffff. Which is actually an AND instruction:

and.b @r15+, -1(r15)

It will read the byte value at the memory address pointed to by r15, add the value pointed to by R15-1 and write it back to (R15-1). Then adds 1 to R15 and increases the program counter by 4, which repeats the process until the program counter finally reaches something with a different content. Maybe the start of the main flash? If you're lucky, your reset code is put there, ot the flash is empty (returning 0xffff too).
Depending on the value of R15, the writeback will happen to any possible address, including the main flash area, initiating random segment erases or byte writes. And since it isn't in main flash, the cpu will continue wiht the next isntruction unstopped. Possibly causing an flash access violation - but only after the inital write started and the damage has been done.

ArmenG said:
In any case, I don't think interrupts are being triggered, nor they would explain random erase or write operations.

They would, perfectly.

ArmenG said:
I'm now trying to check, if there are any Access Violations, Oscillator Faults, or Resets occuring. Could any of these cause the random erase or writes?

Even a reset with the flash copntroller active may cause errativ behaviour.

ArmenG said:
when EEI=EEIEX=0 (they are, in my project), interrupts are automatically disable during any flash operation

No, that's a misinterpretation of the bit description. If EEIEX is enabled, any occurring interrupt will cause an emergency exit and abort the current flash write operation (leaving the written flash address in an undefined state. If it is clear, the controller will ignore any interrupt and leave it to the programmer to deal with the havoc this may cause.
EEI, however, has only effect during an erase and will freeze the erase cycle during interrupt service.
My guess is that it will stop erasing when the interrupt triggers (and GIE is changed to 0) and resumes as soon as GIE is switched back to 1 when reading the original status register content back from stack. But this will only help during segment erase.

Both bits have only a meaning if interrupts are enabled. But they don't define whether interrupts may happen or not. The word 'enabled/disabled' means 'support for interrupts enabled or disabled' and not 'interrupts enabled/disabled' And they both won't help if it is an NMI.

I wish I had the EEI bit on the 54xx series. It would allow continuing timer and other interrupt actions during the eternity of up to 32 ms segment erase. (during data write, this 'dead' time is much shorter)

ArmenG said:
I'm now trying to check, if there are any Access Violations, Oscillator Faults, or Resets occuring.

Most of the possible causes for a PUC will leave a trace in form of a status bit that isn't reset during a PUC while it is during a POR. You can identify those bits in the register description by looking for brackets around the default value. Those values are only set on power-on but not during (software caused) PUC. One example is the KEYV bit in FCTL3.
On the 54xx series, all these are grouped by a global reset vector register where you can read the cause of the last (and even last several) resets. But the older families require you to check all module registers.

0 Tony Weir over 15 years ago in reply to Jens-Michael Gross

Prodigy 40 points

I'm having exactly the same problem as ArmenG : corruption of main (program) Flash, although I can't say for certain it's happening at powerup.

The thread (above) describing interrupts going off into the weeds makes no sense in our case, and I'm not convinced it makes sense in ArmenG's either. Although we do have a routine that writes to configuration Flash, it's used only once or twice during the lifetime of the product, and corruption has never been noted to occur during this operation. Write addresses are hard-coded constants, and the function that encapsulates the writes will not run unless called with a magic number. Once running, our system keeps running; so I can only say for certain that it's happening either at power-down or power-up.

This problem is weird enough for me to consider dropping the MSP430 for our products. I've used many Flash micros over the years and never come across this kind of issue. TI, I think I and ArmenG need some guru feedback here.

Tony

0 Daniel Auer over 15 years ago in reply to Tony Weir

Prodigy 70 points

Hi Tony, Hi ArmenG

I'm having exactly the same problem as you both have described. I'm using a MSP430F2254 (Revision G4).

My Errors: Corrupt Reset Vektor (either 0xFFFF or 0x0000), corrupt Flash Sectors (whole sector is suddenly erase = 0xFFFF) or several single bits are changed the state to zero.

I don’t know the root cause for these error behaviour. But I found out that the MSP430 SW is starting up several time again when my System is powering down. My suspicion is the power Down behaviour, how the voltage is going down (steady / continuous or not). I have insert a Start Up Delay (0.9sec) just at the very beginning of the main(), but after Stopping the watchdog. With these delay the SW starts only once at powering down (not several time) and therefore the error probability is reduced dramatically but sadly it’s not zero.

I’m trying now to get more infos from a TI expert.

Have you both used an external reset controller? At the moment I belief (or better I hope) this could solve my problem.

Regards

Daniel

0 Tony Weir over 15 years ago in reply to Daniel Auer

Prodigy 40 points

Hi Daniel

Wow ... another one! I also have a delay at the top (~0.5sec) because we do a clock-speed change immediately and it can take that long for the xtal oscillator to stabilise. I have also noticed my CPU starting up again when it ought to be switching off - I thought that was a bug in my software and ignored it, since our application isn't affected by such behaviour.

I agree with your hunch: it's happening at power-down, and a reset controller would probably help. The depressing thing is that the MSP430 is supposed to incorporate a proper reset/brownout circuit. I wonder if it has inadequate hysteresis - being a very low power MCU, the MSP430 will draw tiny currents from whatever capacitors are hanging off its voltage regulator, so VCC will fall very very slowly in most implementations.

As I recall, the SAR variants of the MSP430 have the ability to monitor their own supply voltage, so I'm going to attempt to use this feature to put the micro to sleep when the supply is falling ... I'll post again if I get any result.

However: my opinion is that it simply should not be this easy to cause Flash corruption - even if we're all doing something horribly wrong (which I'm fairly certain we're not). Come on, TI, we need a fast fix for this: personally, I'm assuming worst-case (bug in the silicon) and I'm porting my code to a different MCU before we get too much product shipped. I don't have the time or the inclination to fiddle with obscure "features".

regards

Tony

0 Jens-Michael Gross over 15 years ago in reply to Tony Weir

Guru 227245 points

One problem with slowly falling supply voltage is that the voltage on electrolytiv capacitors can raise again if the load falls off.
In case of the MSP, it is possible that when the voltage falls below brownout and the MSP switches off, the load almost completely vanishes (no more currents though still open port pins or active hardwar emodules). This may cause the voltage to raise again above the threshold and the MSP starts over (with modules disabled and ports off). When the ports are reactivated, the voltage again drops etc.... This can lead to several power cycles, especially if you use a larger electrolytic capacitor VCC.

Make a test: Load an electrolytic capacitor, then shortcut it for a second, then test its voltage with a multimeter. You'll be surprised.

Also, signals comin in to the MSPs port pins from outside when it is already powered down, may flow through the clamp diodes into VCC, reviving the processor once more.

It's not the fault of the brownout circuit or its hysteresis. It's the surrounding circuitry which can trigger such a behaviour.

it's likely that you do not properly lock the flash controller after doing a config write or such. Then there is a chance that there will be erratic write operations to the flash memory which in turn will trigger an (incomplete because of power fail) erase or write cycle.

In any case, flash corruption never occurred to me, neither in development nor production stage for a long time now. (I had several with PICs and overvoltage VCC)

Peronally, on several projects with each many different versions of the hardware and gazillions of firmware test cycles, I cannot remember a single flash failure except for a verified bug in my flash write function.

0 Andreas Dannenberg over 15 years ago in reply to Jens-Michael Gross

TI__Guru 70192 points

Jens-Michael Gross said:
Peronally, on several projects with each many different versions of the hardware and gazillions of firmware test cycles, I cannot remember a single flash failure except for a verified bug in my flash write function

This reflects our experience here at the factory as well. Generally speaking from a customer support perspective I can say that probably two most common reasons for Flash corruption issues are:

Undervoltage code execution causing program execution corruption. Basically an MCLK vs. VCC violation occurring at _any_ time during code execution or power-up or power-down ramp. To prevent this one can use either an external or the built-in SVS (when available), or at a minimum set the DCO frequency as low as possible for the given application.
An improper setup of the Flash timing generator.

I would also recommend reviewing the "MSP430 Flash Best Practices" list under: http://processors.wiki.ti.com/index.php/MSP430_Flash_Best_Practices and double-check the application and see if some of the best practices can be implemented.

Regards,
Andreas

0 Jens-Michael Gross over 15 years ago in reply to Andreas Dannenberg

Guru 227245 points

Andreas Dannenberg said:
Undervoltage code execution causing program execution corruption

That might be a common cause. I guess the recommended setup for the reset line is a bit too tight. In my projects I use 100nF (well, that's the cheapest one anyway) instead of the recommended 10. This causes the device to be held in reset state a bit longer, giving VCC time to rise. On the devices without a BOR (e.g. 1232) this was necessary anyway. On the 1611, I implemented a functionthat waits for a given voltage (using the SVS) before poceeding with the clock settings.
One drawback on using the SVS is that the SVS registers are not reset if the SVS causes a reset - the device simply enters reset state and stays there. That is unconvenient, since during the voltage check, i can output some LED signals to indentify the device state - during (SVS-caused) permanent reset it just stays dark and I don't know why. So either I have no idea why the devise isn't working, or I can't use the SVS to trigger a reset when I get a power fail (which leaves the device in an unsafe state when the power drops and the core is clocked too fast).

Since after a reset the normal startup code is executed and takes care of voltage and clock, an SVS reset could just restart the device as if it had been just powered-up - the software has to copy with this situation anyway. No need to held it in continuous reset until the voltage raises again, since everything is on startup condition anyway.

Andreas Dannenberg said:
An improper setup of the Flash timing generator.

Not only that. Improper usage of the flash controller in general might be a problem. Even if the flash writing funciton is working properly, a stack overflow or stack corruption during code execution may lead to uncontrolled execution of the flash write function. With random parameters then.

Also, if one forgets to lock the flash after the write, any rampaging code may write randomly to flash, or erase it. The controller will put these unintentional writes into hardware then because the flash is open to writes and all is valid (from the flash controllers view). If the door is left unlocked and unguarded, anyone can get in, not only the intended visitors.

0 Ran Cohen62182 over 15 years ago in reply to Jens-Michael Gross

Prodigy 20 points

I am also seeing the same problems.

When I asked TI for solution they told me I'm the only one having those problems! This is of course NOT true.

My only conclusion is not to use MSP430. They are too sensitive to noise/voltage change.

I don't see any reason, what so ever, that only one bit will be changed in the flash.

It could occur only due to MAJOR problem in the chip design.

I'm working with the: MSP430F23xx.MSP430F1xx and MSP430F55xx on the same board.

The 55xx family was never damaged only the 23xx and the 1xx. if it was power supply problem I would see it on the 55xx too.

I think its major bug that TI refuse to approve they have.

I am checking the option to start new layout with other components.

0 Tony Weir over 15 years ago in reply to Ran Cohen62182

Prodigy 40 points

Ran, that's very interesting ... do you actually have any code inside your application that erases or writes the flash? If not, the only possible cause is that the boot ROM is being (erroneously) called ... which according to TI is virtually unheard of. Also, are there any obvious differences between the three ICs on the board? For example, is the 55xx a processing engine while the others are connected to things in the "real world"?

The only thing I got back from TI was an app note describing some "best practices" for protecting Flash. It is, in fact, all good advice - but as you note, it seems to be way too easy to cause Flash corruption. It shouldn't be necessary to walk on eggshells. We're stuck with the MSP2132 for now, but I'm also looking at other possibilities. I've worked with a dozen different Flash MCU architectures over the years and none of them have given me this much trouble.

My personal theory (for what its worth) is that the device sensitivity is a natural outcome of the low-power optimisation. Also, the shared RST line was a big mistake ... RST is the most important pin on the IC besides power and it is not to be messed with. The design engineer should be able to hang whatever external circuit he needs onto it without worrying about how it's going to affect the debug port or NMI behaviour.

0 Jens-Michael Gross over 15 years ago in reply to Tony Weir

Guru 227245 points

Tony Weir said:
My personal theory (for what its worth) is that the device sensitivity is a natural outcome of the low-power optimisation.

Indeed. If you can waste any amount of power to keep things above threshold, it's easy to fight PCB designer errors. Put a stepup converter in to keep the voltage high etc. The sub-µA-design of the MSP forbids this. So the PCB designer and the firmware programmer need to take kare of keeping th edevice inside operating specs.
It's not that the MSPs are violationg their specs or that it is impossible to keep them in safe area. It's just not easy plug&play.

The 5x devices as well as few of the 1x (e.g. the 16xx) devices have a brownout detection and a supply voltage supervisor. Which is usually neither programmed nor used by most MSP projects. No wonder if things go wrong.
In our own projects, we used an external supply voltage supervisor for the 1232 processor but stopped using it once we switched to the 1611. The 5x series, however, internally works with much lower voltages than VCC, so things are easier again.

Anyway, while the MSPs may work with low voltages, there is a minimum voltage required for flashing. Below it the flash won't be programmed properly. It may or may not work. It may lose contenf after some days or keep it for years. The 5x flash controller has a boundary scan feature to check for a safe programming.

It's the job of the board designer to keep the VCC high enough during programming process. This includes enough buffering for the suddenly increasing supply current during flash programming.

I cannot confirm that TI is denying problems and bugs in their MSPs. There are large errata sheets and sometimes, devices are even pulle dfrom the market (liek the 54xxA and 55x devices with the nasty flash execution bug). But if I see how often people didn't look into the users guide, how many more won't look at the device datasheets or the design guides? No wonder if things go wrong if devices are idealized as 'ideal' parts rather than 'real' ones.

Tony Weir said:
it seems to be way too easy to cause Flash corruption

Only if the device specs are violated. These are, I mus tadmit, tighter than on other devices. As you already guessed a result of the extremely-low-power design.

Tony Weir said:
RST is the most important pin on the IC besides power and it is not to be messed with. The design engineer should be able to hang whatever external circuit he needs onto it without worrying about how it's going to affect the debug port or NMI behaviour.

Here we totally agree. it's a shame that NMI is shared with RST. It makes NMI practically unusable.
OTOH, the interrupt handling of the MSO is quite good and has been further oimproved on the latest devices. So an NMI isn't necessary for most jobs. With the CCR units, most timing-critical jobs can be done way faster and more precisely than with any NMI driven software. And the option to even trigger DMAs or other things based on external signals or timer events without software intervention, make the NMI quuite obsolete for the vast majority of applications. If you really require it, well, then you have to sacrifice RST. With SVS/power module, BOR and WDT, RST isn't that important anymore anyway.

0 Ran Cohen62182 over 15 years ago in reply to Jens-Michael Gross

Prodigy 20 points

It seems that we find workaround for this problem.

For devices that don't have SVS build in we are using inside comparator.

The inside comparator has fixed reference of 0.55V, so I set the voltage divider to be 0.64V.

I have used 2.3Kohm and 10Kohm resistors as voltage divider connected to P2.6.

When the voltage drops the comparator toggle interrupt and the MSP goes to low power mode.

Here is the code attached:

#pragma vector = COMPARATORA_VECTOR
__interrupt void CA_Vector(void)
{
if(CACTL1 & 1) // Clear interrupt Flag
LPM4;
}

On main function:

WDTCTL = WDTPW + WDTHOLD;                 // Stop WDT
CACTL1 = 0x38; //CA6
CACTL2 = 0x30; //
while((CACTL2 & 1));                     //Wait while power is less than 3.2V
CACTL1 &= 0xFE;                           // Clear interrupt Flag
if (CALBC1_16MHZ ==0xFF || CALDCO_16MHZ == 0xFF)
{
    while(1);                              // If calibration constants erased
}                                        // do not load, trap CPU!!

CACTL1 |= 2; //CA6 interrupt enable

let me know if it help you too.

Ran

0 Tony Weir over 15 years ago in reply to Ran Cohen62182

Prodigy 40 points

Wow ... great stuff, Ran. I have a good feeling about this, I was just about to try something similar using the SVS (which I confess I've ignored until now). We have some evidence - although we can't prove conclusively - that the flash corruption is happening only at powerdown. Our system is pretty weird; the supply falls veeeerrry slllowwwly, which is a bad thing at the best of times.

I'll try your suggestion and let you know how we get on.

best regards

Tony

0 Daniel Auer over 15 years ago in reply to Tony Weir

Prodigy 70 points

Hi Tony

Yes, I agree. In my system the Flash corruption happens only at powerdown as well. At powerdown the MSP430 SW is starting up again and again (Reset -> Boot -> Reset -> Boot -> .....) and during this time something is happen with the flash. The Reset happens at DCO Calibration.

Our product with this MSP430 Controller is already released and therefor we have over 1000 System in the market. The Flash problem occures much less with a "Start Up Delay" in the SW. But I had still 3 Errors with this delay until today. We will rework the PCB and spend a Reset Controller to the MSP430. Long Time test shows no more Flash Erros with this Reset Controller (because no more SW Start Up at Power Down). Hopefully the external Reset Device is solving this problem totally.

Best Regards

Daniel

0 Jens-Michael Gross over 15 years ago in reply to Daniel Auer

Guru 227245 points

Daniel Auer said:
In my system the Flash corruption happens only at powerdown as well. At powerdown the MSP430 SW is starting up again and again (Reset -> Boot -> Reset -> Boot -> .....) and during this time something is happen with the flash.

Well, for every processor speed there's a minimum system voltage. Below this, the CPU may do virtually anything. From freezing to erroneously executing any code. Also, there's a minimum operating voltage for flash. If the MSP is operated outside its operating conditions, then virtually everything can occur.

I agree that that's not a desireable condition. Anyway, the 1x series is oooooold. If you'd still use an 8088 processor, you'd not complain about missing ECC support, would you?
And even in the 1x series, the 16x devices did have a brownout detection and a voltage supervisor. With the SVS, you can define the minimum voltage required for your device to safely continue running at the selected speed. Once VCC falls below, the processor is held in reset state by the SVS module until either the required voltage is there once more or the brownout detection kicks in, shutting the device completely down.

It's up to the programmer to use this functionality.

Anyway, we to built several hundred devices for energy emtering in industrial surroundings based on the 1232, (and RF transceivers based on 1611) and never experienced a flash failure. Maybe because we added an external reset controller to the 1232 devices because we knew from the datasheet that there is no proper brownout detection. We had (and needed) one on the previous PIC based generation of devices too. Where we did face flash failures.

0 Daniel Auer over 15 years ago in reply to Jens-Michael Gross

Prodigy 70 points

Hi Jens-Michael

Good for you to use the Reset Controller from the beginning :-)

We are using one device with a SVS (MSP430F235) and without initializing the SVS we had the Flash Problem. With the SVS working, there are no Flash Problems.

Sadly the other MSP we are using is a MSP430F2254 without this SVS. This one has only the Brown Out. TI conformed that the several SW Restart behavior at Power Down is possible even with the Brown Out.

Here I have the statement from TI (sorry, but it is in German):

Das ist ein bekanntes Problem. Wenn der Brownout zuschlaegt und das Device in Reset versetzt, geht der Stromverbrauch zurueck und die Spannung kann leicht ansteigen (besonders bei langsam fallender VCC und grossen Caps). Ist der Anstieg groesser als die Hysterese des Brownout, dann laeuft das Device wieder los und die Spannung geht dadurch wieder nach unten. Das fuehrt zum mehrmaligen Aufstarten des Devices. In diesem Zustand ist man eigentlich ausserhalb der Spec und um ganz sicher zu gehen muesste man einen externen Resetbaustein verwenden. .......

I agree there is no obvious connection between this Start Up behavior at Power Down and the Flash Problem itself exept that my system is working correct with the Reset device and without not :-). So, I know now aswell that there is no proper brownout detection and therefor the Brown Out is useless..... Anyway beside this, we are quite happy with the MSP430.

Regards

Daniel

0 Jens-Michael Gross over 15 years ago in reply to Daniel Auer

Guru 227245 points

Daniel Auer said:
Good for you to use the Reset Controller from the beginning :-)

From the previous experience with the PIC (which was even not restarting at all sometimes after a brownout) and the fact that we had to solder the then added controller to the wires on the PCBs (too many produced and stuffed for a redesign), we just added it tot he MSP 'just in case'. And while we didn't see any problems ever (except that the controller which also included a watchdog, was interfering with the flashing process - a series resistor fixed this, allowing the JTAG to override), I see from this thread that it was a good decision.

Daniel Auer said:
sorry, but it is in German

I'm in German(y) too :)
And I think this reply is what I too wrote in this or a similar thread (I don't remember).
To prevent the continuous restarts, one could jsut consume MUCH current in the startup phase (for a short moment), so VCC will drop faster :) I twould, however, expose an additional load at power-up too, increasing the requirements to the VCC rise behavior. Well, you can't eat the cake and keep it :(

Daniel Auer said:
So, I know now aswell that there is no proper brownout detection and therefore the Brown Out is useless

Well, not 100%. It depends. If the surroundings fit, the brownout is useful. I think, however, that the hysteresis and the trigger levels are badly chosen. The minimum requirements for default startup condition (startup DCO speed etc.) are known and the brownout should trigger near these and not far below.
On devices with both, brownout and SVS, the brownout is mostly used to detect whether the device was coming from power-up or resetting due to a short power failure but maybe ram contents still intact. This may be helpful for some applications. Also, BOR is used to reset the SVS, So the device can start over with default values. So code detecting undervoltage and showing it may be able to run ratherh than an eternally reset device until the power supply finally fades away totally.
It all depends on teh intended application.

Daniel Auer said:
Anyway beside this, we are quite happy with the MSP430

We are too. We weren't with PIC. And Atmel is way too complex (and expensive) for out applicaitons. Even if not producing 10k devices, every $ counts. MSP gives a good bang for the buck.

0 chris.o. over 14 years ago in reply to Jens-Michael Gross

Prodigy 195 points

Hi All,

I too have a couple of units from a recent build that have come back to me dead after some time in the field. The boards have an MSP430F2370, but no external SVS. However, we did increase the R and C reset circuit values above the default values.

The difference with my units versus others posting here is that rather than an entire sector being erased or corrupted, only a few bytes scattered throughout the main flash are affected. In particular, one unit had only three bytes total that were affect: two in an almost 16k image (beginning at address 0xA300) and the upper byte of the reset vector (located at the suspicious address of 0xFFFF).

Am I potentially seeing the same issue? Also, does TI have an official recommendation on how to address this with completed units (i.e. firmware modification, minimally invasive HW mod)?

Thanks for the insight.

Chris

0 Jens-Michael Gross over 14 years ago in reply to chris.o.

Guru 227245 points

chris.o. said:
Am I potentially seeing the same issue?

Possible. But it is also possible that the problem lies in the programming circle. While flash shall retain data for sever al years, it only does so under optimum circumstances. If your programming equipment is doing the flashing with a relatively low voltage, flash is probably programmed 'weak' and may lose content after relatively short time.

Newer MSPs (5xxx) have an itnernal programming voltage generator and supervisor for the flash adn also a 'marginal read' feature which allows checking the written flach data unbuffered, so you can detect weak cells easily.

Another possible source of flash failure might be short overvoltaeg spikes in the supply. I've seen this with PICs before (not with MSP, btu then, we didn't have this condition on MSPs): some of your pics were flooded with 5V when teh voltage regulator failed on 9V cells. This caused the PICS to lose their flash content (after reprogramming they worked fine). I cannot say whether this may or may not happen with MSPs too. MAybe the flash technology used on MSP is resistant to this, maybe the internal supply routing will protect the flash or maybe it will happen too udner certain circumstance but I jus tnever had these circumstances.

0 Lost Signal over 14 years ago

Prodigy 10 points

We saw exactly the same behavior with multiple units. After some number of power cycles a few random bytes would get corrupted or a page would get erased. Corrupted bytes would always have some of the ones turn into zeros. All symptoms pointed to the flash controller accidentally executing write or erase operation during a power cycle. Enabling SVS fixed the problem. It seems the issue was caused by the fact that CPU was allowed to run at fast clock rate while the power rail was dropping down from 3.3V. I understand that this kind of "overclocking" can cause CPU to execute random instructions but corrupting password protected flash memory should be very unlikely. Considering severe consequences of corrupting code memory in an embedded device I believe this should be classified as a silicon bug. The workaround is to prevent CPU from overclocking by using SVS or external reset generator or possibly a power-fail interrupt that switches CPU to slow clock. I've heard about several companies switching away from MSP430 because of the un-explained memory corruption. TI should be more forthcoming explaining this issue and how to avoid it if they want ppl to keep using MSP430.

Regards,

LostSignal308

0 Ray Her over 14 years ago in reply to Lost Signal

Prodigy 20 points

hi all

I have a problem very similar to yours. Our company released around 50 product samples based on MSP430F155 (very old chips :P ), and got 2 broken pieces back. I checked the memory and found that the calibration data stored in Info Flash were corrupted.

Yes, you got it, I did not enable the SVS. Because I don't know the root cause, I will do this.

Thanks for you suggestions. :)

Rayher

0 Mark Neary over 13 years ago in reply to Daniel Auer

Prodigy 70 points

Hi Daniel,

I am responding to your post of almost two years ago, as we believe we are currently suffering the same problem described in detail in this web thread.

I am wondering if you can give me an update on your progress in solving your problem as it may help our situation. Also, I am curious if you have any suggestions for a VCC Supervisor IC that may work well for us based on your experience.

Specifically we are experiencing a wiping of addresses 0xFFE0 to 0xFFFF after some number of power cycle events using a MSP430F2272 operating at 3.3V. The failure is always consistent and occurs only due to power cycle events. It is always possible to re-program the MSP to restore original performance.

In our product, there is some amount of noise on VCC and we may also have a significant amount of VCC capacitance making the 3.3V fall quite slowly. We also have 200ns transients (50KHz rate) at start up as high as +6V/-4V on the 3.3V supply (only periodically during the rising of VCC). (There is a 3.3V ESD device in place that is providing limiting to these transient levels.)

Based on the significant information in the thread posts, we are planning to implement an external VCC supervisor (reset) IC. We are thinking that a supervisor that implements a timeout delay will work best to hold the CPU in reset throughout the transitions of the VCC supply.

So my specific questions for you are:

1) Does it sound like we are observing the same issue with the fixed brownout circuit in the MSP device?

2) Have you been successful in solving a similar problem with an external VCC supervisor (reset) device, and if so what was the specific part number that you used?

3) Are there any specific characteristics of an external VCC supervisor that we should look for?

Any assistance you can provided here will be much appreciated.

Regards,

Mark

0 Daniel Auer over 13 years ago in reply to Mark Neary

Prodigy 70 points

Hi Mark

to 1) yes sounds like it is a similar issue....

to 2) Yes we solved the problem with an external reset IC perfectly. No failure since then. We are using TPS3808G33 and on another design a TPS3808G25.

to 3) No, just use an ordinary reset controller with a appropriate time delay. Can be another manufacturer then TI aswell :-)

On MSP430 which provide an internal SVS unit you can solve the problem with this SVS and don't need the externel reset controller.

Regards

Daniel

0 Mark Neary over 13 years ago in reply to Daniel Auer

Prodigy 70 points

Hi Daniel,

Thanks so much for the prompt response.

We have now completed a significant amount of testing that shows clearly that the addition of an external VCC supervisor has corrected our observed issue.

As you indicated, I would agree that the exact supervisor part number is probably not critical. It is probably desirable if there is an appropriate time delay as you indicated, which will serve to hold the MSP in reset as VCC ramps up and ramps down. We had success with the Micrel MIC803-29D3VC3 TR (3 Pin SC-70) and the Microchip MCP120-300DI/TO (TO-92 Package). For the benefit of other readers, in general I found that important parameters to look for are the voltage threshold, the timeout delay (in the order of 20ms to 200ms), the RESET assertion delay (in the order of 5us to 130us) and the temperature range of course. The reset assertion delay may be important to ensure no false resets due to noise spikes("transient immunity"), which can generally be improved with an external capacitor on VCC very close to the supervisor part.

We were lucky that our VCC was operating at nominally 3.3V so we had lots of headroom to choose a supervisor with a 2.9V threshold, which is well away from the 1.5V threshold in the MSP. I would be pretty frightened of operating this series of MSP at 1.8V using the internal brownout detection unless power was perfect, like maybe from a battery.

Another thing that may interest others is that the three signals of interest (VCC, GND and CPU RESET LOW) are present on the Spy-By-Wire header if it is used. In our case, we were able to make a band-aid solution whereby a small PCB plugs into the Spy-By-Wire header preventing a soldering modification on our production line. In trying this solution, one would have to watch out for adding too much inductance that would tend to isolate the supervisor from the CPU VCC signal. Generally the supervisor should be as closely linked to the CPU VCC as possible.

Thanks again for your input as it helped to validate and add ammunition to our own work. Still no response from my direct TI technical support request. I think they are a bit scared after I pointed out this web thread to them in relation to my problem. I'm still a big TI MSP fan, but this is a bit ridiculous.

Best Regards,

Mark

0 Todd Rimbey over 12 years ago in reply to Mark Neary

Prodigy 30 points

Mark,

We have also been experiencing something very similar to what is mentioned here in this forum. Would you mind sharing what testing you performed to verify or recreate the problem you were seeing? We have been doing a bunch of power up/power downs, but have been unsuccessful in recreating the problem (corrupted flash/cleared memory) consistently.

Regards,
Todd

0 Daniel Auer over 12 years ago in reply to Todd Rimbey

Prodigy 70 points

Hi Todd

In my case, I could easaly recreate the Flash failure by powering up and down the system (Power supply of the MSP430). Of course it takes a few power up/down sequences to recreate the failure, therefore we have used several system for testing.

What type of MSP430 are you using? Do you have an external reset controller?

Regrads

Daniel

0 Todd Rimbey over 12 years ago in reply to Daniel Auer

Prodigy 30 points

Daniel,

Thank you for your rapid response!!! We are using the MSP430F149 and we do not have an external reset controller. I have been wondering how many devices Mark used to test, at what frequency he was cycling power, what his power decay curve looked like, and how regularly the failure presented itself. We are currently executing out of flash, but only use about 9K of it. So failures could be occurring in other parts of the flash and we aren't seeing it. Also, I have read some discussion about the use of a "wait" during boot up and power down where the part is held in RESET. Has anyone had success with that feature? We may need to spin our hardware to add a reset controller, but we need a quicker fix initially.
!
Thanks again for the response - we are under the gun on a critical effort and are looking for something to greatly reduce the occurrence until we can spin our hardware.

Regards,

Todd

0 Mark Neary over 12 years ago in reply to Todd Rimbey

Prodigy 70 points

Hi Todd,

In our case we had the MSP430 on a circuit board where there were two factors that I believe contributed to the liklihood of the corruption problem. Firstly, there was a lot of power supply noise as the MSP430 was physically close to a large (100W) AC/DC power supply, and the 3.3V supply for the MSP430 was derived from that AC/DC power supply and showed evidence of power supply noise at the switching frequency of the AC/DC supply. Secondly, there was significant capacitance in the system that made the power supply transition to zero slowly when the AC mains supply was removed from the system.

In our system, we found that some units were more susceptible to a memory corruption problem than others, and I am not sure why this is the case but I would guess that it has to do with circuit tolerances/variation in the circuitry external to the MSP430 or within the MSP430 brownout circuit. If we are considering a unit that showed a tendency for memory corruption, in our system, we found that we could reliably reproduce the problem if we applied the AC Mains with a duty cycle of one (1) minute ON and fifteen (15) minutes OFF. (The extended OFF time helped our output capacitance to transition fully to zero, which forced a noisier AC/DC power supply start-up in our system, which we thought might contribute to the problem, but in reality we are not sure if the corruption tends to happen at power up or power down, or both.) Running this cycle with a unit with a failure tendency, we would generally see a corruption within two days of testing. When we installed an external RESET supervisor IC, we felt that if we operated for two weeks with no corruption, the supervisor was correcting the problem.

So as you can see, the method used to cause the corruption will probably depend on the nature of your system. Furthermore, the failures will probably be fairly random and may take some period of time to occur, which is annoying. (That being said, other users on this thread have reported easier failure creation.)

I would recommend getting a test setup that has been shown to consistently create the problem within two days, or maybe one week at the most, and then test a possible solution considerably longer than that. (If you know someone good at statistics, some calculations may be performed to rationalize the success of your test.)

Sorry my answer is wishy-washy, but this is a particularly unpleasant type of "firmware" problem.

Mark

0 Mark Neary over 12 years ago in reply to Todd Rimbey

Prodigy 70 points

Todd,

Additional points:

1) At one point we tested twenty (20) units simultaneously for about 12 days before we saw the first corruption.

2) In our case we almost always see the interrupt vector table cleared to zeroes, which causes a consistent checksum error upon reading the memory back with the Elprotronic programming tool. Recently we saw a case where the DCO constants in Info A had been wiped. The unit was a field failure where no external supervisor was fitted. It was not operating correctly in terms of timing and we traced it to the DCO constants because when we tried reprogramming the MSP430 (after ruling out the external crystal oscillator) the Elprotronic programming tool detected that the DCO constants were not correct and performed a DCO calibration. This is something to watch out for. We have not experienced corruption of our programmed firmware code memory aside from the interrupt vector table corruption (I don't think).

Mark

0 Mark Neary over 12 years ago in reply to Mark Neary

Prodigy 70 points

Todd,

One more point. In our system we came up with a good band-aid solution. Our system had a 4 position connector for the Spy-By-Wire interface. This header contained GND, VCC (3.3V) and the Reset point for the CPU. We were therefore able to make a reset supervisor that could be plugged onto the Spy-By-Wire header. It consisted simply of a reset supervisor in a TO-92 package soldered to a header that mated with the Spy-By-Wire header. It worked out fine and was used for a period of time in production equipment, before we changed the main PCB to accept the supervisor IC. You would need to have a pull-up on your main board for the Reset net, or get a supervisor IC with an integrated pull-up.

Mark

0 Todd Rimbey over 12 years ago in reply to Mark Neary

Prodigy 30 points

Mark,

Would it be possible to find out all of the details of your exact test set up? We have been running two systems trying to catch the failure, but have not had much success. We have had 5 out of 60 fail in the past 5 months. I did see that pulling power down to zero is key in the test setup, but i also saw jens-michael gross mention noise on power input lines. Have any other conditions been found that help exaggerate the issue? We are in a root cause analysis phase and are driving to recreate the combination of conditions that cause this problem reliably.

Many thanks for the help!

Todd.

0 Mark Neary over 12 years ago in reply to Todd Rimbey

Prodigy 70 points

Hi Todd,

For testing that we have done in our local production facility, we tested 20 units for about 10 days before we observed a single failure. In our system an Omron controller that allows different timing to be configured was used to switch the AC Mains for all units simultaneously with a voltage of 240VAC. The duty cycle here was 15 minutes ON / 15 minutes OFF. (I don't think that the 15 minute ON time was beneficial and really just slowed down the number of cycles.)

Subsequent to this testing, a unit failed in the field (with no reset supervisor fitted) and the corruption failure in this case was loss of the DCO constants stored in Info A resulting in incorrect system timing and operation. We found that this particular unit would fail quite easily with a 1 minute ON / 15 minute OFF cycle of 240VAC AC Mains application. the switching was done with some custom timing system involving typical AC relays in this case. This was some piece of equipment available in this particular production facility and I am not fully clear on the details. This particular unit would exhibit corruption failues within two days typically under this test scenario.

In my own location, I configured a system for accellerated testing. This involved a programmable AC power supply that could be configured to be switched ON and OFF at an adjustable rate. I then rigged up an AC relay and a power resistor to kill the output capacitance in the unit under test quickly as soon as AC Mains was switched OFF. This allowed a greatly sped up test. However, despite tens of thousands of test cycles I was never able to make this setup cause a unit to fail, even for units that had exhibited the problem before.

This is a very annoying type of electronics problem. I would recomend trying to make units fail that have previously demonstrated the failure as they are probably the most likely to fail.

I'm afraid this is all the detail I can provide concerning our exact test setup. Good Luck.

Mark

0 Jens-Michael Gross over 12 years ago in reply to Mark Neary

Guru 227245 points

Mark Neary said:
a unit failed in the field [...] We found that this particular unit would fail quite easily

So that means that some units are more sensitive to tis problems than others. Those that have failed once are likely to fail again. Interesting. This fit into my theory.

Mark Neary said:
I then rigged up an AC relay and a power resistor to kill the output capacitance in the unit under test quickly as soon as AC Mains was switched OFF. This allowed a greatly sped up test. However, despite tens of thousands of test cycles I was never able to make this setup cause a unit to fail, even for units that had exhibited the problem before.

This further hardens my theory about what's going on:

Most units hwere this problem happens are designed for low power consumption or unintentionally consume low power. So when the supply fails, VCC is falling slowly. If VCC is falling fast on power fail, the problem doesn't show up.

Also, if an external reset controller / supply voltage supervisor is used, of if the unit has an internal supply voltatge supervisor and it is properly programmed, the problem also doesn't seem to show up.

So I think the problem is, that with slowly falling VCC, below minimum operating voltage, the flash controller might start to behave erratically, unlocking the flash. If now the CPU is still operating (VCC below minimum operating voltage doesn't automatically mean that it must fail at this point, and even if it fails, it may still do something) and accessing flash, unintended write or erase operations may occur until VCC is depleted.

It would be interesting to know whether in these failing units the CPU frequency was rather low (so a complete CPU crash would occur at a lower voltage, probably below the minimum voltage for flash operation)

0 Mark Neary over 12 years ago in reply to Jens-Michael Gross

Prodigy 70 points

Hello Jens-Michael,

(You are still interested in this issue after more than two years!)

I agree with your theory posted Oct 19 2010. It does seem likely to me that the problem involves slowly decaying power supply voltage, possibly in conjunction with load changes during the decay that cause the voltage to pop up and down around the CPU internal reset threshold (about 1.5 volts). Also power supply noise could play a part and in our system we have some nice big noise spikes (oscillation style) at a 50KHz rate.

Your comments about my accelerated test setup have gotten me thinking about things. In my accelerated test setup, I was not killing the 3.3V directly but I was killing a higher voltage that is ultimately derived from the same AC switching power supply transformer. The situation is complex, but in thinking about the math relationships I thhink that the 3.3V would fall faster if the higher output voltage falls faster and probably more consistently. There also may definitely have been less power supply noise at turn Off and less opportunity for a slow decay/noise problem to result in the CPU memory corruption. I believe it is likely that by killing the higher voltage supply, the 3.3V would fall faster, with less noise and probably very consistently, so your theory certainly not violated and is possibly supported.

Thanks for your input.

Mark

**Attention** This is a public forum

MSP low-power microcontrollers

MSP low-power microcontroller forum

Main memory corruption after power cycle