This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

MSP430G2353: Flash Data Corruption Issue

Part Number: MSP430G2353
Other Parts Discussed in Thread: MSPWARE

Hello,

We are currently experiencing some field failures on our product which we have narrowed down to some sort of Flash memory corruption on our MSP430G2353.

We have found evidence from a few select examples of failures where portions of our information memory segment have been erased or corrupted, and were hoping to get to the root of why this is happening so we can prevent the issue from occurring entirely if possible. We’re not sure if the issue (which doesn’t seem to be widespread) is related to location/environment, or a by-product of something that we are doing in our firmware.

A few beginning questions:

  1. Are there any known environmental factors (static shock, electrical storms, humidity, etc) that can cause Flash corruption/erasure?
  2. Is there a cutoff voltage where the flash could be corrupted? Would this only happen if the low voltage event happened during a flash write/erase?
  3. We are operating at 8MHz and the “Safe Operating Area” graph in Figure 1 of the product datasheet seems to show the cutoff of flash writing at 2.2V, but this graph also seems to show that program execution would not occur under those same conditions.
    • Could we get specificity on what happens if our core voltage slides below the suggested supply voltage for program execution?
    • Figure 12 of the DS seems to say that BOR doesn’t happen till 1.35V… what happens if we are at 2.0V and attempt a flash write?
    • Is there a difference in voltage required for flash erase and flash write?

Below I have copied our base code for flash erase and flash writes. Our basic operation is to read the flash segment into RAM, modify the RAM buffer with our desired contents, erase the flash, then reprogram the entire segment with our desired contents, one 16-bit word at a time starting at the lowest address. Our MCLK is at 8MHz, SMCLK = MCLK/8 = 1MHz, flash timing generator = MCLK/2 = 500kHz.

Is there anything in that process or the code below that should be changed?

/******************************************************************************
* EraseFlash - Erase a segment of flash in the MSP430
******************************************************************************/
void EraseFlash(unsigned int address)
{
	__disable_interrupt();				// Disable interrupts. This is important, otherwise,
										// a flash operation in progress while interrupt may
										// crash the system.
	HOLD_WATCHDOG_TIMER();              // Disable watchdog timer
	while(FCTL3 & BUSY);				// Check if Flash being used
	FCTL2 = FWKEY | FSSEL_2 | FN0;      // Clk = SMCLK/2
	FCTL3 = FWKEY;                      // Clear Lock bit
	FCTL1 = FWKEY | ERASE;              // Set Erase bit
	*((unsigned int*)address) = 0;		// Dummy write to erase Flash segment
	while(FCTL3 & BUSY);                // Check if Flash being used
	FCTL1 = FWKEY;                      // Clear Erase bit
	FCTL3 = FWKEY | LOCK;               // Set LOCK bit
	RESUME_WATCHDOG_TIMER();
	__enable_interrupt();
}

/******************************************************************************
* WriteFlash - Write a segment of flash in the MSP430
******************************************************************************/
void WriteFlash(unsigned int address, unsigned int *flash_buffer)
{
	unsigned char i;
	unsigned int* ptr_to_addr = (unsigned int*) address;

	__disable_interrupt();				// Disable interrupts.
	HOLD_WATCHDOG_TIMER();			    // Disable watchdog timer
	while(FCTL3 & BUSY);				// Check if Flash being used
	FCTL2 = FWKEY | FSSEL_2 | FN0;		// Clk = SMCLK/2
	FCTL3 = FWKEY;                      // Clear Lock bit
	FCTL1 = FWKEY | WRT;				// Set WRT bits for write operation

	for (i = 0; i < (FLASH_INFO_SEGMENT_SIZE/2); i++)	// Word write is faster than byte write
	{
	  *ptr_to_addr = *flash_buffer;		// copy value to flash
	  ptr_to_addr++;
	  flash_buffer++;
	  while(FCTL3 & BUSY);				// Check if word write is done
	}

	FCTL1 = FWKEY;						// Clear WRT bit
	while(FCTL3 & BUSY);				// Check if Flash being used
	FCTL3 = FWKEY | LOCK;               // Set LOCK bit
	RESUME_WATCHDOG_TIMER();
	__enable_interrupt();
}

One more thing of note is a sample of the data that we found as being corrupt, which is below.

The curious thing to note on this is although we are doing word-wise writes of 16-bit values, our data showed the low byte of one value (at 0x1044) intact, while the high byte of the same word, and multiple following bytes, all seemingly erased to 0xFF. Is there any reason this should be? Does the compiler break down these writes into single byte writes? Or is this indicative of a failed/interrupted flash write?

Segment D:
1000: xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx
1010: xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx
1020: xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx
1030: xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx
Segment C:
1040: xx xx xx xx 7C FF FF FF FF FF FF FF xx xx xx xx
1050: xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx
1060: xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx
1070: xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx
Segment B:
1080: xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx
1090: xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx
10A0: xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx
10B0: xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx

Regards,
Tyler Witt

  • Hello Tyler,

    The following app note should help you in trying to narrow down your issue. http://www.ti.com/lit/slaa729

    One question I had about your segments above, are the x's 0's or 1's? This would tell you if its a botched write or erase.

    On another note, your flash timing generator frequency is out of spec for this part. 500kHz is too fast for the flash timing generator. do to this we cannot guarantee the correct operation of the device here in respect to flash programming/erasing.

    1. Environmental influence can be a factor in flash corruption, but it would be atypical without a combination of other factors as well. For example, you could have significant noise on the JTAG and/or BSL lines and say cause a mass erase of the device.

    2/3.
    Again, too low voltage can definitely be a factor, but just undervolting the part wouldn't corrupt flash. You can access flash down to the minimum operating voltage of the par (dependent on frequency). You cannot modify flash below 2.2V though as this is the minimum operating voltage to do so. the 2.2V min is for both programming and erase. If you try to write to flash below that limit, there is no guarantee the flash will program/erase correctly and will most likely be corrupted. A similar story for operating out of the minimum voltage spec in regards to frequency. Can't exactly say what will happen in a particular device, but PC corruption and as a result of that, flash corruption is possible, especially if you have flash writing routines within your application.
  • The x's unfortunately are unknown. This is a subsystem that we don't have complete visibility into for defects in the field. The "7C FF FF FF FF FF FF FF" array is the only part we've been able to glimpse. And that should be a set of four 16-bit values each in the range of 100-800, so some of the data definitely seems erased, but maybe not all of it. We were wondering, could the 0x7C indicate that, after page erasure, we only managed to write through address 0x01044, and then the rest didn't get written? And since the flash writes are done as 16-bit words, does it make sense that one byte of a 16-bit word would be intact, but the other not? Does that give a clue as to whether things went wrong 1) when writing to erased memory vs 2) erasing written memory vs 3) none of the above?

    Tyler mentioned the flash timing generator frequency. What happens if it's clocked too fast and there's a problem? How does that present? What would that look like? What happens to the BUSY flag in FCTL3? Because one thing that we've noticed is that the part seems to become unresponsive, sometimes for minutes, sometimes for hours, and then when it suddenly starts responding to serial commands again, we see that the data is now corrupt.
  • Hey Jace,

    Thanks for the linked app note - I'll be sure to dive into that as a way to start our debug process.

    As Chris was saying - the MSP is a subsystem that we can't readily read all the info memory sections from, but some portions are made available through some back channels in our application and we could get at least that out. Until we get a physical failure in house, we cannot say what the rest of the "xx's" are.

    I also have a hard time thinking the flash timing generator frequency is the issue - this is a system that has been in the field for the better part of 5 years, without issue until recently. We will look into changing this for future product, but as Chris was also alluding to - how do we know if this is the issue? How does overclocked flash present? Can you be more specific than "cannot be guaranteed to be correct"?

    Regards,
    Tyler Witt
  • Chris, Tyler,

    For the Flash timing being too fast, we specify a maximum limit for the generator in order to ensure proper operation of the flash control module. As far as what could possibly happen if its too high? Not sure, but flash corruption could be a result. Some units may work fine, but over temp and split lots, some may fail in some fashion. Unfortunately, I cannot be more specific as our devices are not simulated or tested outside the specifications given. My speculation of possible scenarios would be possible corruption of individual bits as the charge pump doesn't have enough time to charge Flash cells, or possibly a timing/desynchronization issue of the flash controller vs address bus causing either wrong data into an address or the wrong address being loaded for writing/erasing. Add in the possibility of undervolting the part while trying to run the controller too fast, anything is possible.

    Without knowing what exactly the memory is being written as, there is not much you could do to narrow down the issue here. I can say for this particular part, the most common issue is a frequency vs voltage violation or a DVCC voltage violation when trying to do in system programming. Both of these situations could cause flash corruption, or a whole host of other issues, as the part is being undervolted in these situations. The typical solution for these situations on this particular part, is adding an external voltage supervisor chip to hold the part in reset during these unstable power conditions, as this part does not have an internal SVS to perform this function.
  • Jace H said:
    Without knowing what exactly the memory is being written as, there is not much you could do to narrow down the issue here.

    Well, we do know part of it though.  We don't use all of Segment C, and we don't know what the first 4 bytes look like, but again, we have an 8-byte contiguous sample of what remains, and we posted it above:

    Segment C:
    1040: xx xx xx xx 7C FF FF FF FF FF FF FF xx xx xx xx

    As previously stated, there should be non-FF data at those addresses.  That doesn't look like "possible corruption of individual bits", and further examination of device battery data implies that we should NOT have been in danger of an undervoltage condition.  Hence my questions above, which I'll restate here, separated for clarity:

    1. Would the "7C" followed by a series of "FF"s indicate that, after page erasure, we only managed to write through address 0x01044, and then the rest didn't get written?
    2. Is that even possible, since the flash writes are done as 16-bit words?  Is it still done a byte at a time at some level, such that it makes sense that one byte of a 16-bit word would make it intact, but not the other?
    3. Does that snippet give a clue as to whether things went wrong...
      1. when writing to erased memory? (ie, can you say this is what it looks like if you erased the page to all FFs and then started writing a buffer back to the segment but it errored partway through?)
      2. erasing written memory? (ie, can you say this is what it (possibly also) looks like if something went wrong while erasing page, or should it all erase simultaneously rather than sequentially from the highest address downward?)
      3. none of the above? (ie, can you say that if either of those situations occurred, it would look different than a 7C followed by a series of FFs?)

  • Hello Chris,

    1-2: It could be possible as this flash controller is capable of byte or word writes. Erasing can only happen on the segment level or greater.
    3: with the given information, both A and B situations are possible.
  • Just to clarify, 3b IS possible? Erasing doesn't happen all at once? Data at the end gets erased before data at the beginning?
  • Chris,

    Not sure of the exact order when things get erased, but it is certainly possible that an erase gets interrupted and a sector could not fully be erased and possibly produce garbage data or partially erased data.
  • Okay, thanks. And even though we're doing word writes, it still actually does them a byte at a time so we could see one byte written and one byte not?

    Also, I want to address the fact that when this happens, the part is unresponsive to serial commands for sometimes hours on end. Per the example code for erasing and writing flash, we're disabling the WDT during the sequences, yet there's a while() loop where we wait forever for the BUSY flag. Is it possible that clocking the flash timing generator too fast could result in an indeterminate delay that could last hours? Or is there some other explanation that is more likely?
  • Chris,

    Keep in mind, the Flash for this part can be written by byte or word. If you are using word writes, then it will perform word writes. Up above, I was just stating it's possible to write via byte.

    for your second question directly above, could it be possible? Sure. I could see a timing mismatch if an internal clearing flag happens too quickly that the flash controller misses due to being clocked to fast.
  • Hey Jace,

    So to start off - we are changing the flash timing generator frequency and putting it within the proper specification (dividing 1MHz SMCLK by 3 for 333kHz nominal), that's our hopeful "silver bullet" to the issue, but for now we see it as Step 1.

    Seeing as how reproducing our issue is very difficult, we're going to produce some units with this change and see if we can track things down further as we go along testing with those, since following the specifications better should not introduce any risk in our changes as we unfold them, but some of our other actions may.

    For future planning, if we have everything in line with spec (timing generation, voltage, etc), but have a similar situation arise, what would you suggest for us to do?

    • Should we get rid of our while loop around the FCTL BUSY bit? I noticed from Dung's code example in MSPWare (circa 2010) that he did not check this bit at all, and just progressed without ensuring the write was done. 
    • Is the waiting for the BUSY flag only necessary when enacting writes from code that is executing in RAM? 
    • Or is the BUSY flag something necessary when writing words rather than bytes? Is it better to only write one byte at a time?
    • Should we re-install the WDT to our flash writes to help us break out in the case that we are stuck in the same position?
    • Should we create a software-based timeout feature to break out of our BUSY while loop when writes fail?
      • Can we estimate how much time our flash writes (for info segments only) should take? We are always doing 64-byte writes, so this should be somewhat consistent timing.
    • Will we be able to trust the Flash ever after a failure? How do we ensure that IF we are stuck at the busy bit of the FCTL that we can reset the flash?
      • Do we need a WDT reset or POR to make this happen, or is there a software way of doing this?

    Regards,
    Tyler Witt

  • Hey Tyler,

    Your first step sounds good. I'll mark TI Thinks Resolved on my reply here in order to close this post for now. If you get additional results after the new units are made, or if you have additional questions, replying to this thread will open it back up.

    For the BUSY flag, this only needs to be checked if initiating an erase or write from RAM. If doing so form FLASH, the procedure is slightly different. Please see sections 7.3.2.1 Initiating an Erase from Within Flash Memory, and section 7.3.3.2 Initiating a Byte or Word Write from Within Flash Memory for those procedures.

    If you are writing/erasing from RAM and access the Flash when the BUSY is asserted, then an access violation occurs, ACCVIFG is set, and the write result is unpredictable.

    If you were to create some kind of timeout and exit the procedure via the emergency exit function, then you cannot trust the sector of flash that was exited from. You would need to re-write and/or erase it. some CRC checks could help you know if an issue has occurred.

**Attention** This is a public forum