This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Flash Corruption in Application that Never Writes

Other Parts Discussed in Thread: MSP430F1611, LM5574, MSP430F1612

Hello:

We currently have a  product based upon an MSP430F1611 that are exhibiting FLASH corruption in the field.   Here are the details:

1.)   The application is "read-only".  At no time do we write the FLASH.

2.)  The system is powered by a "user".   A user turns the unit on via cranking a DC brushed magnet motor.   The output ouput is rectified and fed into a big (20000uF) cap

3.)   The cap a switching power converter (LM5574) which provides a regulated 5v to my system.  The +3.3v for the MSP is provided by an LDO

4.)   The MSP has all the standard (0.1uF) bypass on all pins.   Reset has a .01uF to ground with a 23k pull up.

5.)   There is a mosfet attached to the reset and one hooked to the TCK line to control BSL.    We have a special adapter that can plug into the board (not user accessible) for BSL.   The the adapter is not plugged in, the FETs are held low witha 47k from gate to source.

6.)   The SVS is currently NOT enabled.   It was my understanding this was only needed when writing Flash, not accessing it.

I can upload schematics if its helpful.

Has anyone seen similiar behavior of FLASH corruption when an application never writes?

 

 

  • Normal 0 false false false MicrosoftInternetExplorer4

     

    Elisha,

    Could you explain a bit more about what do you mean by "Flash Corruption"? My guess is that you mean that the code fails to run because it was corrupt. If this is so, have you enabled the verification when the code is downloaded to ensure it is downloaded successfully? Are the same Flash location always affected or does it change randomly?

    Have you tested multiple ICs to ensure that the corruption is not the result of FLASH failure? Please remember that Flash memory has a finite amount of Erase/Write cycles and therefore it might have been exceeded in some way (if the unit deployed to the field is the same as one that was used for development for example).

    Powering any digital system with an unstable power supply could be problematic. The MSP430 includes a Brownout reset that will not let the MSP430 operate unless the voltage is acceptable (please refer to pg 33 of the MSP430F1611 datasheet for this information).

    I would also highly suggest that you ensure that the power rails are clean and that any noise is kept to a minimum.

     

    Gustavo

     

     

  • Gustavo:

    FLASH failure is probably a better term.     This particular project gets programmed and verified on the factory floor via BSL.  I am using the reference code supplied by TI for the BSL.    The product is fully tested before leaving the factory so the device does function.  We have been getting units back from the field that have about %75 of their FLASH erased and/or mangled.   I use the Elprotronic FET-Pro430 tool to do a "difference" between the FLASH and the production programming file.

    All of these units start with a fresh IC.   I am looking into the power rails now.   The first check showed that they were very clean but we are trying to get a good test setup where we can bring the supply up and down to see if there is a brown out issue.

    My goal now is to try to manually induce the problem.....

     

    -Eli

  • slowly raising supply voltage might be a problem.
    The brownout in the 1611 will start the processor on a relatively low voltage. The reset isn't a real help, as it will likely raise fast enough that it always is at VCC, how low VCC might be. So the processor will start while VCC is still relatively low (check the datasheet for the brownout voltage).

    Even more, the type of power source results in a changing voltage supply (the large capacitor won'T help suppressing the high-frequency ripple because of its limiting ESR).

    It is possible that VCC will raise, fall, raise more, fall less, rise even more etc until it beomes stable. This may leave the MSP in a critical state.

    Still it is strange that the content of the flash changes. The above would normally just result in a crash/failure to startup but not destroy the flash content.

    The only case where I has destroyed flash content (reproducable) was with the PICs I had to work with beofer the MSP. When the voltage regulator failed (e.g. because a soldering problem with its GND reference), tehre were 9V instead of 4V applied to the PIC and it completely lost its flash content. After re-flashuing it, all was well (except for the lost calibration data)

    Kepp in mind that most voltage regulators (especially the switchign ones) cannot LOWER the output voltage if it get s too high for some reason. Thy jsut cease to add current to the output, but cannot actively draw it back if it floats in from somewhere else. A possible cause could be an overvoltage at the port pins which is routed to VCC throught he port pin clamp diodes of the MSP. If this current is not conumed by the rest of the MSP, VCC will raise until it matches the port pin voltage - and this may be way too high for the MSP and maybe cause flash damage.

    The newer MSPS (54xx and up) have an internal voltage regulator for the core/flash, so this will not happen there that easily.

  • Hi,

    I'm also having flash corruption on devices equiped with a MSP430F1612. They have been in the field for two years and I recently received two of them with flash corruption problems. One of them had the same problem one year ago. Some blocks were erased and others were corrupted. My application only writes configuration data to the infoa and infob blocks. And this data is only written when the devices are tested.

    I read on an other forum that you can force a flash corruption by cycling the power. So I cycled the power for two days on my device, but the flash is still ok. I also tried lowering the power supply voltage, but no luck with that too. The VCC on the MSP430 drops very fast when the main supply is dropping below a certain level. I also made an ESD test on the  the enclosure (buttons and connectors) and that too did not damage the flash.

    I noted that many people have a flash corruption problem with MSP430. There are many forums with people complaining about flash corruption on the MSP430.

    I was wondering if there might be a bug in the MSP430 flash controller? Or that some msp430 might be defectives.

     

     

  • mrx said:
    I was wondering if there might be a bug in the MSP430 flash controller? Or that some msp430 might be defectives.

    Or there are significant differences in the board designs, especially in the power supply section?

    We built hundreds of devices in dozends of projects and never had a case of flash corruption. The devices are used in industrial environment and we had only a few failures, And there it was either the crystal that failed (due to vibrations) or the power supply (permanent extreme overvoltage, 400 instead of 230V AC, still worked for some weeks before exploding). I had not a single failure because of flash corruption. And some of the devices do write to (info) flash.

    I must add that we only used the 1232 and 1611 in 'mass' production so far. So there is no general bug in the flash controller.

  • We have 3 known problem units out of 455 shipped over the past 6 months or so.  Keep in mind that at no time does the software write the FLASH memory.

    I am trying to replicate the problem on my bench.   I have tried holding the supplies in the "brown-out" areas for long time periods (days) with no success.    I am now trying to get an experiment setup where I power the MSP with an extremely noisy supply that spans a 0.5 to 3.5v range.

    Any other thoughts on how I could try to induce this failure?

  • Elisha Hughes said:
    Any other thoughts on how I could try to induce this failure?

    Slowly raising and falling VCC? MSP devices have problems when the maximum rise times are exceeded. Then the reset pin will rise with VCC (always appear high) so no proper reset can be gerenated
    The brownout release level is sometimes (on some devices) far too low to ensure proper operation with default clocking. So when the brownout-induced reset expires, slowly rising VCC is still too low and everything can happen. And even if VCC is expected to be high enough for 8MHz operation, maybe it isn't. In my projects, I program the SVS to ensure proper voltage before I raise MCLK.

    Also, there can be something called 'Lazarus effect'. If a brownout is triggered by fallign VCC, the MSP will suddenly cease operation. This may cause voltage to rise again because of electrolytic capacitors (chemical 'battery' effects). This will 'restart' the MSP temporarily. Depending on the firmware, this may cause the MSP to execute code erratically. It shoudl not, however, affect the flash controller unless the code is doing a write right at the start. But who knows... Outside operating conditions, everything is possible.

     

  • Hi Jens-Michael, Can you share how you did

    "I program the SVS to ensure proper voltage before I raise MCLK."

    Thanks

     

     

  • I tested power quality and was not able to get a failure.   See Attachment.  3823.MSP430 FLASH Failure - Preliminary Test Report.docx

    Has anyone had experience with operating temperature being a problem?  We are suspecting that the customer maybe operating this unit (inadvertantly) at high temperaturess (130F-140F)

  • Jim Carlson said:
    Hi Jens-Michael, Can you share how you did "I program the SVS to ensure proper voltage before I raise MCLK."

    Here's some example code I once wrote for the 1611.
    It checks for VCC>3V
    The variation is too large to get closer to the desired 3.6V without risking that it will never be satisfied. But ensureing at least 3V is better than
    switching to 8MHz while VCC is still in the 2.5V range.
    Also, it will trigger a POR once voltage falls below 3V (which will likely will already have the CPU crashed) instead of waiting until it gets below BOR point (which will probably never happen on a VCC power glitch)

    use #define __NOWDT if the WDT is not used, else it will be triggered int eh waiting loops

    void CheckSystemVoltage(void){
      unsigned char i, j, k=0;
      SVSCTL=_SVS_SYSTEM_VOLTAGE;           // set threshold to 3.2V (2.94..3.42V)
      while (!(SVSCTL&SVSON)) // wait for settling SVS
    #ifndef __NOWDT
        WDTCTL=WDTPW|WDTCNTCL;
    #else
        ;
    #endif
      SVSCTL&=~SVSFG;                   // clear the SVSFG bit
      for(i=0;i<255;){
        i++;
        k++;
        for(j=0;j<255;j++)
    #ifndef __NOWDT
          WDTCTL=WDTPW|WDTCNTCL;
    #else
          WDTCTL=WDTPW|WDTCNTCL|WDTHOLD;// dummy load, so the loop isn't empty and may be optimized away
    #endif
        ;// allow SVSFG to come up again
        if(SVSCTL&SVSFG) i=0;
        SVSCTL&=~SVSFG;                 // clear the SVSFG bit
        ((k&32)?LED1_ON:LED1_OFF); // just to see something blinking :)
        ((i>32)?LED2_ON:LED2_OFF); // indicate that power is good for at least 32 cycles in a row
      }
      SVSCTL|=PORON;                    // activate power on reset if VCC falls below threshold again
    }

    Elisha Hughes said:
    Has anyone had experience with operating temperature being a problem?

    Well, high temperatures as well as high voltages may indeed influence the flash content. I never had this problem on MSPs (mainly because we program them after the devices are built) but for our wireles fire sensors, we shipped pre-programmed PICs to the PCB production and a small fraction of devices didn't work after soldering - reprogramming solved it usually.
    Also, we had some devices where the voltage regulator (9V battery from the sensor to 4V for the PIC with the transmitter) failed and leaked out 6 or 7V. These devices too lost the flash content (but could be reanimated by reprogramming).

    So yes, high temperature may cause flash data corruption. It increases the movability of the electrons and may cause read access turn into accidental write access.

    The maximum junciton temperature (while VCC is applied) is given as 95°C for the 54xx devices. Given the ambient/junction resistance of 50°C/W, the maximum still air ambient temperature may not be higher than 93°C (using typical 10mA power consumption). Recommended operation conditions are 10 degrees less. (Sorry, I'm not used to °F and the datasheets list °C too)

  • Here is an update.     I ran 3 units  in a oven at 100C for 24 hours.   The devices were powered/operating during the test and I everything was Ok.

    I am now running more EMI tests.  We have a nasty DC permanent magnet brushed gear motor that is used as a generator.   We notice that if we make a small loop with a scope probe we can pickup  some 100mV transients that make it through the power supply lines.  Even with decoupling, etc.   we cannot get rid of the interference.   We have even tried aggressive shielding of the enclosure which makes me think that there may be another dynamic at work.  

    We do have BSL Reset/Start pins available (but pulled inactive) during operation.   Maybe a BSL start condition is spuriously initiated by EMI transient on the supply/ground lines.

    I will keep refreshing this thread as I make more progress.   

  • I'd like to suggest that you place an appropriately sized TVS (Transorb is one such brand) on your 5V and 3.3 V supply to deal with any possible power suppiles.  Also place an appropriate sized switching diode to deal with the negative spikes.

    Then, if necessary add a small LC filter and small series resistor to the 3.3 V supply.  Decoupling is not just about sprinkling capaitors everywhere.

    This sort of decoupling is typical automotive environment stuff with the latter for the more difficult problems. The schematic might help.

    Without seeing the circuit, I'm also going to suggest looking into an external (chip based) reset circuit.

    I have witnessed and repaired CMOS based electronics which were non-destructivey damaged, but would not work.  No component was replaced yet I was able to fix the product. One was a vacuum florescent display clock in my car that ceased to function after the car was jumped.  The clock was fixed without removal of the clock from the dash.

    I'm betting a transient that needs to be tamed.  If the DC motor has brushes make sure they are bypassed at the motor.

  • I believe I already posted scehmatics.   There are MOVs on all the input power supply lines.  The analog VCC has a series resistor.

    At this point,  I am trying to find a reliable way to induce the problem.  Then I can find a reliable way to fix it!

     

    -Eli

  • I looked again at your posts.  I didn't see a schematic, just a report of your testing.  MOV's are too slow.

  • Ron Dozier said:
    I didn't see a schematic, just a report of your testing

    The report contained several schematics.

    Ron Dozier said:
    MOV's are too slow

    Yep. Good enough to produce a shortcut if you accidentally apply 220V instead of 110V AC, but almost totally ignorant to HF transients.

    Zener diodes are slow too. Also, they have an ugly characteristic- They usually require 1 or 10mA current through the diode before they reach the defined voltage. And the voltage raises for larger currents.

    For cutting off transients (but not to limit to a precise voltage), I use transient voltage suppressor diodes P6KExxCA. They are bi-directional (CA type) and have a certain breakdown voltage, from 6.45..7.14V for the 6.8V type up to 440V. Way better than any Zener. Unfortunately, there are none available for below 6.8V.

    Alternatively, a simple lkinear voltage regulator might be a solution. While it wastes the surplus energy (well, a Zener does it too) and requires 0.5..2V above output voltage, it usually filters/suppresses transients and supply voltage changes way better than any Zener. Also, it does not simply shortcut excess charge form the charging capacitor, so it only takes what's needed to keep the output at the desired voltage.

    I've used zeners too (in a capacitor power supply), but not for regulating, only for pre-filtering the inpu tvoltage so it does not exceed the maximum input voltage for the real regulator.

  • I didn't realize that the schematic was in the test report.

    Thanks Jens-Michael for backing me up. 

    Once you clamp the 5V supply, the processor supply may be clamped too.

    In one design I did, using a single board computer in the 80's, the POR circuitry would work if I used a switching supply.   When I used a linear supply it would not.  I had to change the manufacturer's POR circuitry.  Fortunately it was replacing a socketed chip with a schmidt trigger version of the same part.

    FETs have been used as transient suppressors, but I don't know how.  I could guess.  I know I discovered a very nasty transient in a manufacturer's electrometer whch they did not believe me at first, but they provided an external solution that worked in one mode and eventually modified the instument at no cost.  Our semiconductor devices could easly be damaged with 2V transients.  I measured 100 V or more transisnts when the instrument switched ranges.

    I'll look closer at the schematic in your report.

    Transients can be a non-destructive and reverseable phenomenon and it can be semi-permanentt.  i.e. re-programming fixes it.

     

     

     

     

  • I've used some TransZorb devices in previous designs and they work well. They're also quite fast. Maybe they're good for your design.

    Vishay has a few with 4.1V breakdown at 1mA (Look at MSP3V3 which oddly seems like MSP430 3.3V).


    http://en.wikipedia.org/wiki/Tranzorb

    http://www.vishay.com/diodes/protection-tvs-esd/trans-zorb/

     

    Gustavo

  • Thanks for all the feedback.   BTW:   the schematics are in the test report.

    Our position is that we are not going to try any fixes until we can reliably induce the problem.    The transient due to EMI/RFI seems plausible but we don't want to apply any fixes until we have a way to test them.

    Here is the latest:    We do use the BSL interface to do factory programming.   The RST and BSL START pins are controlled from a special adapter board that can pull the lines low through a MOSFET.    These 2 MOSFETs have pulldowns from gate to source for when the programming adpater is not connected.    One hypothesis was that the BSL was accidentally started via random pulse that would pull on these lines.

    I wired up another microcontroller (an MBED!) to generate  psuedo random RST and BSL START pulse at about a 1KHz update time.   The idea was to see if the BSL could be tripped into doing something.  I powered the unit and left it go for 48 hours with now result.

    We also have the same motor we are using for the generator being power with a benchtop supply to generate the same EM field.   We placed this motor on top of the circuit to try to induce the problem.   This test has been going for awhile now with no problems.

    Next I am going to wrap the entire PCB in a large coil and pulse it randomly at high currents to simulate and extreme EM field.     Any ideas about how to properly execture this would be great, its out of my area of expertise!

    We'll get to the bottom of this!

     

     

  • Elisha Hughes said:
    The idea was to see if the BSL could be tripped into doing something

    Unlikely. It had to be a sequence that exactly fits either the erase command or one of the write commands (including prefix/checksum), so teh BSL would trigger a mass erase.

    However, in  aprevious post you said that there i sno flash write code in teh MSP so the processor cannot possibly jump to this code and execute it erroneously.
    However, there IS code in the MSP that does flash writes: in teh BSL. If you jump into the BSL area, it could be that you're bypassing the command reception and directly execute code that writes to flash or does erase actions.

    On the 5x family, however, the BSL area turns into vacant memory after the BSL exits (missing entry sequence after reset). So on these devices, the processor cannot jump into the BSL are - it would cause a reset or jump on place but not execute the BSL code.

  • I don't think it's EM.  Duplicating it will probably be futile.  I'll tell you some stories that hopefully will make you change your mind:

    1. A thermocouple scanner back in the 80's would die and destroy a few CMOS chips.  Turns out to save $, the manufacturer left out the regulators, thus CMOS was powered by an unregulated supply.  1000 W quartz lamps were used with temperatire conrollers.  The failure was about 6 months.  Putting the Transorbs cureed the problem.

    2. My car.  The clock module failed to work after a jump.  It survived many jumps, just not that one,  Disconnecting the battery and shorting the battery leads to the car, restored operation.  Large transients move charge and it gets trapped,  Shorting the supply leads allows the charge to leave restoring operation.

    3. An HO calculator stucj=k in "comma for the decimal place".  Removing power ansd shorting the power leads restored regular function.

    4.  I was asked to look at a bycycle computer that was dragged across a carpet in a SUV and ceased to work.  Removin the battery.  SHorting the battery contacts restored operation.

    In 2,3 and 4 the failures were not destructive failures.

    If you had th abilityto install a power line disturbance monitor on the 3.3 V DC line WITHOUT disturbing the original circuit, you MAY be able to detect, but not induce this problem.

    High speed voltage supressors and/or decoupling with resistors, inductors and capacitors just help.

    Automotive environments typically have -200 and +50 V short transients.  It's a harsh environment.  I had a National Semiconductor databook that discussed remedies in this environment,  I'm not sure I can locate the databook or if it's even in my library right now,

  • Another update.

    We have been able to induce the problem on several occasions now.     It does not appear to be large transient related.       Doing "clean" power cycles from a bench top supply is enough to make it happen.    The power supplies, etc are all very clean in this test (ideal case).

    Now that we have made it happen under our control, we hope to narrow this down.   I'll keep you posted!

     

     

  • Thanks for the update.

    The more you discover what it is NOT, the more I'm curious what it IS.

  • Can you specify how you induced the failures???   We seem to be having some similar flash corruption, equally mysterious.

    Is there a certain rise/fall time that seems to trigger it???

  • I do not think ESI or problems in power supplier can directly erase or rewrite a lot of the Flash cells. But they could easily cause the CPU to  go wild. A runaway CPU may indirectly cause Flash "corruption".

    Your code normally does not erase or write to Flash. The question is, can it mistakenly erase or write to Flash?

  • I can only speak to my situation, but this is definitely bit rot.   There are single bit errors in 1 - 3 separate bytes of the Flash space.  Don't know of any mechanism to "write" random bit.

  • Are those bits changed from 1s to 0s, or from 0s to 1s?

    A runaway CPU may cause the Flash controller to write a single bit of Flash from 1 to 0. (But not from 0 to 1.)

    A runaway CPU may also cause the Flash  to partially erase a segment or multiple segments of Flash. If that happens, some 0 bits may change to 1, while some other 0 bits may not.

     

  • In every case except 1 it was a 0 to 1 transition, i.e. self erasing bits..

  • That is strange. A runaway CPU cannot do that. I do not think even a runaway Flash-controller can do that. (It is not a EEPROM.) The erase process cannot be localized unless you use alpha particles or things like that.

  • old_cow_yellow said:
    That is strange. A runaway CPU cannot do that.

    Indeed. While a read-modify-write may clear single bits, setting a bit again, and only one bit, requires a precise programming: read the segment, alter the bits, erase the segment, reprogram the segment with the altered bits.

    There are only two other explanations: faulty flash (maximum write cycles exceeded? factory fault?) or bad programming.
    The latter may happen.
    The normal FET programming is done by storing a program through JTAG into the MSPs ram. This program flashes a segment (or part of it). On older MSPs, the program needs to set the write timing. Without knowing the actual DCO frequency. So it well may be that the required minimum or maximum flash osc setting is exceeded and the write process is outside normal parameters. Things like that often happened with self-writting flashing code (even in the original TI demo code!). Then the data retention time is greatly reduced and single bits may fall abck from 0 to 1 after weeks.

    i can imagine that certain power-on or -off or voltage spike/ESD conditions may enlarge the problem. However, there's no proof.
    On newer MSPs, the flash write timing is internal and shoud be always inside the allowed range. With an accent on 'should'.

  • Ok, here is the latest update.      The problem could be induced by simply power cycling the device enough times.   We had a test setup where we could continuously cycle many boards at about a 1Hz rate.    

    The results:   the only thing that fixed the problem was to assume the brownout detection circuitry doesn't work and nor the does the SVS.   We could reduce the failure rate by getting the SVS trip point as close to the +3.3v rail as possible.      Using an external MCU supervisor bugged in  was the best solution.  It didn't even have to be anywhere near the supply rail.....

    It's possible we got a bad batch....   in the end it had nothing to do spokes, voodoo, etc.  (Boards where contiuously zapped with 25Kv Static gun and could not cause a failure)   The 1611 did not like the supply voltage doing down.  

    We investigated the power supply circuit and it always came up and down in a clean manner.   Also note that we have a product that uses the exact same power supply configuration but has a F2618 device with no issues.    I also have non-TI parts with the exact same power supply configuration and are placed in the same operating conditions with no issues.

    Bottom line,  don't trust the internal brownout or SVS in the MSP430.  I learned my lesson to always use an external supervisor and never trust  what is in a device.

     

     

     

     

     

     

  • I just did a quick scan through all the posts, and it doesn't appear that anyone has asked this yet, so I thought I would give it a shot.

    You mention that you are using the SVS in your tests to double check that the voltage is high enough before execution. But how are you checking that the voltage is high enough to run the SVS? It has a minimum voltage of 2V, which is well above where the BOR is going to release.

    I just wanted to confirm the test conditions before we throw in the towel.

  • darkwzrd said:
    But how are you checking that the voltage is high enough to run the SVS? It has a minimum voltage of 2V, which is well above where the BOR is going to release.

    My experience is that the SVS holds the device in reset state until the required voltage is reached.
    But on 1x devices, teh SVS is inactive after power-up, so it needs to be programmed first. However, on a brownout after being programmed, teh SVS holds the device in reset state (without resettign itslef) until the programmed voltage is reached again. Or the BOR triggers, which deactivates the SVS.

    When I wrote the core firmware for the 1611, I experimented with the SVS a lot. I never experienced a flash failure, even with slowly rising and falling VCC. The only 'problem' was that after it was started once and reached the desired 3.4V minimum, after a drop of the supply it appeared dead (held in reset state) until the 3.4V were reached again or the power was completely off. Since the 'wait for voltage' funciton was blinking the LEDs, this was not optimal. (on a power-on with low voltage, you could see the LEDs signalling the low voltage, but after an SVS reset, this code was not executed - of course - until the SVS has released the CPU on proper voltage) But the SVS never failed me during the experiments (or later).

  • 3.4V threshold? Was that with a 3.6V supply? I'll keep your experiments in mind in the future.

    The OP also said he was using a high threshold (in his case 3.3V).

     

    JMG, have you done any tests with the SVS with a lower threshold? I would be interested to hear of your findings.

  • darkwzrd said:
    3.4V threshold? Was that with a 3.6V supply?

    Yes. For operating an MSP430F1611 on 8MHz.
    Unfortunately, teh 1611s SVS was so inprecise that I couldn't risk going nearer to 3.6V than 3.4V. Even if this did leave some risky are, as 3.4V (worst case even a bit lower) wouldn't be enough to ensure 8MHz stability. But well, in case of a crash, there was still the watchdog :)

  • Since you mentioned that you use the 1611 running at 8MHz, I thought I would send a friendly reminder that you may want to have a look at the new 1611 errata released June 22, 2011, if you haven't already. For some reason, even though I was subscribed to the 1611, I don't think I got a notification for it. I just happened to see the update when researching for this thread!

     

    It is an update on errata DMA9 (Timer B and DMA interrupt coincidence leads to bad vector fetch). I came across this bug late 2007, and probably spent close to a month whittling down the code to a repeatable test case. What a nightmare! Although, I do admit it was fun when we finally figured out what was going on. I hooked up a function generator to the timer capture input, and cranked up the frequency until it popped! ;)

     

    Anyways, CPU41is the same problem with the bad vector fetching, but it is more generalized to occur with any two vectors which have a "priority delta" greater than 8. However, the errata says the race only happens if you are running above 6 MHz.

  • darkwzrd said:
    I thought I would send a friendly reminder that you may want to have a look at the new 1611 errata released June 22, 2011, if you haven't already

    Thank you very much for this reminder. That's very surprising.
    I'm almost done with the 1611, new projects will use the 5438, so I didn't check for updates.

    That's a pretty nasty bug. It basically means that you cannot use interrupts for TimerB and USART1. If it would just be that teh wrong priority is called first, it would be bearable. But a corrupted vector fetch is really bad.

    Luckily, the maximum priority difference in my projects is 7.

    I can easily see why you spent so much time tracking it down. static bugs (which happen al the time) are already a nuisance, but such kind of dynamic bugs are really a pain in the A**.

  • Sorry for answering such an old thread but I came across this same problem and was able to completely arrest the errant flash writes with a solution that had not been suggested in the vast amount of attention that this problem has accrued. 

    I found that when the device powers down flash writes occur randomly when VCC falls below 1.7V.  I know this because I set-up a simple flash erase write sequence that is never called so I could set around the sequence the set and clear of a test output port.  In the start of the code I set the test port high.  What I found that during VCC ramp down below 1.7V to .9V the test point would on occasion pulse low to ground instead of floating down with VCC.  I found that I had to cycle power as much as 10000 times before the event would occur.  After the occurrence I would find some part of the flash erased or altered (1 gone to 0). 

    The fix.

    I tried all of the Wiki suggestions and everything that I could find in the forum with no success.  the problem wa completely fixed by:

    1) enable POR with SVS set appropriate for the MCU speed.  ( I used 2.9V, 7.8Mhz)  Add to the _low_level_init, first code to execute after a POR, the following:

      if (SVSCTL & SVSON)
        if (SVSCTL & SVSOP) {
          //disable all interrupts and go to sleep
          __bis_SR_register(LPM4_bits);                            
        }

    after a PUC the SVS will be cleared (off).  After a POR and SVS init, this will trap the SVS fault and shut off everything.  Now the MCU cannot flash anything until the next PUC.

    Now after 70000+ power cycles flash is not corrupted.

    Done!

  • Greg Greenwood said:
    I tried all of the Wiki suggestions and everything that I could find in the forum with no success.

    Problem is well addressed in the forum and wiki FAQ. You shall read chapter "running CPU out of specified freq". Excerpt: "it is advised to keep the device in reset state as long as the supply voltage drops below the adequate minimum voltage needed for running the CPU at the higher frequency. Failing to do so, the device can't be no longer guaranteed to work properly, and even it could cause severe damage such as flash memory corruption"

    I recall at least few threads discussing flash corruption or erase problems caused by running out of spec freq/VCC combination.

    Anyway thanx for reconfirming problem and showing verified solution!

  • So likely, it’s not the CPU itself that causes the problem, but the flash controller that runs wild if it is accessed (even for reading) when the supply voltage is too low for the operation. Well, makes no difference anyway.

  • Unfortunately I have a sad update to this issue.  After proving a fix in a unit that readily fails (1 in 20 power cycles) with code that reduced the failure to 0 (300K+ power cycles), we have had more failures.  All of the current failures were verified as containing the code fix.  Before turning to designing in an external SVS, the question was raised will holding reset really fix what appears to be a rogue flash controller or the flash itself.  We set up a unit that fails 1 in 1200 power cycles with the latest code, with the cap on the reset input shorted.  In other words, the unit is ALWAYS reset.  After 24 hours of power cycling (14000+ cycles) the unit turned up sections of memory erased (All FF) which previously contained 3F FF.  In some of the erased segments, the first few bytes contained random values.

    The fix appears to be to design out the MSP430F1611.  It cannot be trusted.

  • Hello Greg,

    Wow... that' s it! This is both sad and alarming news.

    Kudos to you & team for excellent idea of test that proves that chip (whole series?) is faulty.

    Hopefully TI will comment.

  • "In other words, the unit is ALWAYS reset. After 24 hours of power cycling (14000+ cycles) the unit turned up sections of memory erased "
    Just to play the Devil's Advocate:
    Did the flash corruption happen during the 14000+ power cycles while the device was in reset or during the one where you had the device out of reset to check its flash content?
  • I ran the test a second time to be certain and got the same result.  The process is as follows:

    1) Power up the board

    2) Program the device

    3) read the flash and save the contents

    4) short the reset to ground (note: power still on)

    5) disconnect the debugging pod (note: power still on, device in reset)

    6) Turn power off

    7) Start power cycling with reset shorted to ground (cycle 3 seconds on, 3 seconds off, note: enough to get through initialization and post main start)

    8) Run cycles for 24 hours

    9) Power on (constant)

    10) attached debugger (device still in reset)

    11) remove reset

    12) read contents and compare with read just after programming.

    I am going to repeat this test with a completely non functional code loading 3F FF throughout the entire address space.  I notice that this is what the compiler does after the end of my code to the end of flash.  I will let you know the results.

    Do Not act on any conclusions at this point.  More verification is needed.

  • In the first test process, there is an error.  On the first power cycle there is code that initializes three areas of flash which then changes the image read later.  So the procedure was changed so between steps 2 and 3 a reset from the POD is used to issue a reset causing the code to complete the first time init.

    After the first 24 hours (14400 cycles) the flash has NOT changed.

    The test is still running and will be checked daily with a report made.

  • Day 2 (28800 cycles) the flash has not changed.

  • Thanks for running the long test

    Greg Greenwood said:

    ... 4) short the reset to ground...

    ... 7) Start power cycling with reset shorted to ground (cycle 3 seconds on, 3 seconds off, note: enough to get through initialization and post main start) ...

    ... 11) remove reset ...

    Do you mean that starting at step 4 and ending at step 11, the reset pin is always grounded? if so, what is "enough to get through initialization and post main" in step 7?

    Another question.

    Greg Greenwood said:
    2) Program the device

    Could you provide an image of that Program? E.g. Flash dump in either intel-extended or msp430-txt format?

  • Answer to question 1)
    Between step 4 and step 11 the reset is always grounded. This test is being run because there is the question, "Will holding reset with an external monitor provide an everlasting complete fix or do we need a different micro?"

    Answer to question 2)
    Sorry, no. The code being used is the released code for the device and proprietary. However the problem is not code dependent, hence the wide spread nature across many designs. One could repeat the same test with "Blinky" or a main infinite loop.
  • Thanks.

    About the proprietary code:

    The power-up MCLK frequency is under 1MHz, when do you change to a higher frequency (if ever)? After some delay? Or after you checked the operating voltage?

    Does it include any instruction (whether it ever gets executed or not) that could change the contents of Flash Controller Registers?

    Does it include any instruction (whether it ever gets executed or not) that could write to Flash Memory Addresses?
  • In Lowinit.c, which runs before main, a check is made to test for SVS Flag and to set up the SVS if no flag is detected.  If a flag is detected the micro is put into LPM4 as a trap.  The first code to run in main is an initialization routine to select the operation frequency driven by an external clock device after which a 100us delay is forced.  After that the MCLK source is configured to use the external source which takes the MSP430 from 1MHz default to 7.98MHz.  During the MCLK routine the IFG1 bit is used to wait for a stable clock.  After a .5s delay the flash is read to test for non-erased user settings.  Proper voltage for operation is assumed at this point since the SVS is set to 2.9V and if it goes below that a POR results. 

    The code contains routines for reading and writing flash which uses FCTL1 - FCTL3.  Again it is assumed that if the Vcc falls below 2.9V a POR will result.

    After 5 days of power cycling (72000 cycles) while reset is asserted, no flash writes or erasures have been detected.  This from a unit that would fail for flash erasure within 100 cycles with reset pulled up with a 47K resistor and 10nF cap connected to ground.

    This gives reasonable confidence that an external SVS will fix our flash erasure issues.

**Attention** This is a public forum