This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

nanddump ECC uncorrectable bitflips

Other Parts Discussed in Thread: AM3352

I frequently get ECC uncorrectable bitflip errors when running nanddump on a particular partition /dev/mtd6. I also consistently get ECC corrected bitflip on at least 2 other partitions (total of 17 partitions). Micron believes the uncorrectable bitflip errors are due to a TI fix not in the SDK being using. Also, the consistent number of corrected bitflips for several partitions seems to be high (I'd expect none for new NAND).

I'm using the TI AM335x EVM Starter Kit ti-sdk-am335x-evm-06.00.00.00-Linux-x86-Install on a custom board with Micron MT29F4G08ABADAH4 NAND, AM3352 ARMv7 Processor rev 2 (v7l) CPU. Default ECC BCH8 is used.

/dev/mtd6 is flashed as follows:

   mmc rescan
   mw.b 0x81000000 0xFF 0x1E0000
   fatload mmc 0 0x81000000 u-boot.img
   iminfo 0x81000000
   nand erase 0x60000 0x1E0000
   nand write 0x81000000 0x600000 0x1E0000

nanddump (version 1.5.0) is run as follows (a SD card is mounted at /media/mmcblk0p2): 

   nanddump --bb=skipbad -o -f /media/mmcblk0p2/nanddump_mtd06 /dev/mtd6

ECC uncorrectable bitflips occur about 80% of the time on /dev/mtd6, sometimes in the hundreds:

root:~# nanddump --bb=skipbad -o -f /media/mmcblk0p2/nanddump_mtd06 /dev/mtd6
   ECC failed: 0
   ECC corrected: 0
   Number of bad blocks: 0
   Number of bbt blocks: 0
   Block size 131072, page size 2048, OOB size 64
   Dumping data starting at 0x00000000 and ending at 0x001e0000...
   ECC: 3 uncorrectable bitflip(s) at offset 0x00000000
   ECC: 3 uncorrectable bitflip(s) at offset 0x00000800
   ECC: 3 uncorrectable bitflip(s) at offset 0x00001000
   ECC: 4 uncorrectable bitflip(s) at offset 0x00001800
   ...
   ECC: 2 uncorrectable bitflip(s) at offset 0x00176800

The file output by nanddump for /dev/mtd6 is different than other partitions of the same size flashed with the same image.

When I run nanddump a second time on the same partition, the ECC uncorrectible bitflip errors do not necessarily occur again, but the subsequent run shows a record of 650 ECC failures:   

root:~# nanddump --bb=skipbad -o -f nanddump_mtd06 /dev/mtd6
   ECC failed: 650
   ECC corrected: 1
   Number of bad blocks: 0
   Number of bbt blocks: 0
   Block size 131072, page size 2048, OOB size 64
   Dumping data starting at 0x00000000 and ending at 0x001e0000...
   root:~#

Running nandtest seems to work fine, but it does report the large number of ECC failures, and surprisingly (at least to me as it's new NAND) shows a 1 bit ECC correction occuring: 

root@skydrop:~# nandtest -p 10 /dev/mtd6
ECC corrections: 0
ECC failures : 650
Bad blocks : 0
BBT blocks : 0
001c0000: checking...
Finished pass 1 successfully
001c0000: checking...
Finished pass 2 successfully
00020000: reading...
1 bit(s) ECC corrected at 00020000
001c0000: checking...
Finished pass 3 successfully
001c0000: checking...
Finished pass 4 successfully
001c0000: checking...
Finished pass 5 successfully
001c0000: checking...
Finished pass 6 successfully
001c0000: checking...
Finished pass 7 successfully
001c0000: checking...
Finished pass 8 successfully
001c0000: checking...
Finished pass 9 successfully
001c0000: checking...
Finished pass 10 successfully

The uncorrectible bitflip reported by nanddump occur with or w/o the following patches intended for Spansion NAND: http://www.spansion.com/Support/Software/linux-psp-04.04.00.01-NAND.zip.

The biggest concern is factory reset partitions on this same NAND having uncorrectable bitflip errors when read and being subsequently copied to all the other partitions resulting in a bricked device.

A separate question: would doubling the timeout values in drivers/mtd/nand/nand_base.c adversely affect u-boot writes to NAND (nand write command)? I otherwise frequently get timeouts when writing nand partitions, especially for larger UBI rootfs paritions on this same Micron NAND (the following values have all been doubled):

   u32 timeo = (CONFIG_SYS_HZ * 40) / 1000;
   ...
   if (state == FL_ERASING)

      timeo = (CONFIG_SYS_HZ * 800) / 1000;
   else
      timeo = (CONFIG_SYS_HZ * 40) / 1000;

  • This looks like a hardware failure. Get an oscilloscope and look at the signals.

    Check the timing. Check the NAND busy signal. Check the busy signal polling in the NAND driver.

    regards

    Wolfgang

  • Thanks Wolfgang.

    I too suspect the ECC uncorrectable btiflips consistently on the same partition (/dev/mtd6) to most likely be a hardware error. However, for 1 of  ~8 NAND flashes followed by nanddumps, there were no ECC uncorrectable or corrected bitflips (i.e., it behaved perfectly), which defies it being a hardware problem.

    Also, I regularly see ECC corrected bitflips, ~3 on average, when doing a nanddump of all 17 partitions for 512 MB of the same model of NAND on two separate devices, both of which are new. Would you consider this number of ECC corrected bitflips for this size of NAND during nanddump and/or nandtest runs on a regular basis to be normal?

    Thanks,
    Bob

  • Bob,

    if the errors are only on one partition of the flash, it might be because of the usage pattern. Are there log files on that partition? If the partition is small, there might be a rapid wear out. NAND flashes get more errors if wearing out.

    3 bit errors on a 512 MByte Device seems to be normal.

    But 17 partitions? How do you manage to do an effective wear leveling with 17 partitions? Wear leveling is not done across partition boundaries.

    I use 1 UBI device on the same NAND device, and wear leveling is done across the whole chip.

    regards

    Wolfgang

  • Thanks Wolfgang.

    Partition /dev/mtd6 (that with the ECC uncorrectable bitflips) isn't actively read from or written to as it's reserved for other uses. I've flashed the NAND partitions 12+ times trying to diagnose this problem, so that's the only aspect of wear leveling potentially affecting this problematic partition. Given that it's uniquely exhibited the uncorrectable bitflips since having first flashed it and only once not shown any ECC related errors, I'm still inclined to believe it's a hardware issue in this area of NAND.

    I'm glad to hear 3 you consider ECC corrected bitflips / 512 MB nanddump as being normal. It'd be nice if Micron and/or TI published "normal" bitflip correction rates. Micron claims this NAND to operate "flawlessly", which doesn't help with understanding whether 1 or 1e06 ECC corrected bitflips is normal, even if "flawless."
    MLO, u-boot.img, u-boot env, uImage and rootfs are commonly in separate partitions. TI's default NAND partitioning scheme actually partitions 4 separate copies just for MLO. When copies of these are also imaged for factory reset, that already brings the count to 10. None of the factory reset partitions are expected to need much for wear leveling.
    Thanks,
    Bob
  • Any chance a TI expert could chime in on this post?

    I have 2 very big concerns regarding the source in TI's 06.00.00.00 AM335x EVM Starter Kit SDK regarding the NAND driver:

    1) The source code in 06.00.00.00 in board_support/linux-3.2.0-psp04.06.00.11/drivers/mtd/nand is quite different than that for TI's Linux 3.2.0 kernel source at http://git.ti.com/ti-linux-kernel/ti-linux-kernel/blobs/v3.2/drivers/mtd/nand/.

    2) The TI Linux kernel NAND driver source code has changed immensely between versions 3.2.0 (http://git.ti.com/ti-linux-kernel/ti-linux-kernel/blobs/v3.2/drivers/mtd/nand/) and 3.16 (http://git.ti.com/ti-linux-kernel/ti-linux-kernel/blobs/v3.16/drivers/mtd/nand/omap2.c), especially in areas surrounding ECC.

    As such, what does TI recommend using for reliable NAND driver code on the AM335x?

  • I am starting to look at this post.

    Steve K.

  • Hello Steve,

    Any update?

    Also, I've included the information I provided to Michael Stevens at TI Applications Support on Aug. 11, 2014.

    Thanks,
    Bob

    We are using Micron part number MT29F4G08ABADAH4 (512 MB NAND) on a custom board with an AM3352 ARMv7 Processor rev 2 (v7l) CPU. We also use the TI AM335x EVM Starter Kit SDK ti-sdk-am335x-evm-06.00.00.00 (ti-sdk-am335x-evm-06.00.00.00-Linux-x86-Install), which embeds Linux kernel (linux-3.2.0-psp04.06.00.11) and u-boot (u-boot-2013.01.01-psp06.00.00.00). Default ECC BCH8 is being used
     
    We are experiencing an unusually high number of ECC uncorrectible bitflips and corrected bitflips when running nanddump (version 1.5.0) on NAND-flashed partitions. We require nanddump to produce the final NAND image for manufacturing.
     
    Micron has stated the cited problems below are similar to those reported ~2 months ago by a client of theirs that ulitmately solved the problem with newer TI software, but were not informed as to which software.
     
    The concerns are:
     
    1) nanddump on /dev/mtd6 repeatedly (9 out of 10 tests) and consistently reports ECC uncorrectablebitflips. This is for a partition that is not actively used and does not suffer from wear-leveling. However, it's not reproducible on another device with the same kind of NAND and as such, we're willing to dismiss this as a hardware issue, but do wonder why nanddump did not report any ECC-related errors when only this partition was flashed and then had nanddump run on it. We flash nand using the u-boot commands as shown below.
     
    2) nanddump on average reports 3 or more ECC corrected bitflips on several other generally larger partitions, those with the kernel image or rootfs, and consistently on more than one device with this type of NAND. Micron has suggested this is a higher than expected number of corrections, especially given it's new NAND.
     
    3) The source code in 06.00.00.00 in board_support/linux-3.2.0-psp04.06.00.11/drivers/mtd/nand is quite different than that for TI's Linux 3.2.0 kernel source at http://git.ti.com/ti-linux-kernel/ti-linux-kernel/blobs/v3.2/drivers/mtd/nand/. How can TI label this kernel version 3.2.0 in the SDK, given it has substantial differences? How do we ultimately know what kernel s/w we're using and how reliable is the NAND driver in this SDK?

    4) The TI Linux kernel NAND driver source code has changed immensely between versions 3.2.0 (http://git.ti.com/ti-linux-kernel/ti-linux-kernel/blobs/v3.2/drivers/mtd/nand/) and 3.16 (http://git.ti.com/ti-linux-kernel/ti-linux-kernel/blobs/v3.16/drivers/mtd/nand/omap2.c), especially in areas surrounding ECC. How do we safely ameliorate the version in the SDK?
    Which TI software should we be using to get around the issues we're seeing above and what solution does TI recommend regarding reliable NAND (and ECC correction) software? Note that we cannot move to a newer TI SDK version or newer kernel at this point as we're currently attempting to go production and have an assembly line awaiting resolution to this problem.  Also, is nanddump 1.5.0 reliable?
    A separate question: would doubling the timeout values in drivers/mtd/nand/nand_base.c adversely affect u-boot writes to NAND (nand write command)? I otherwise frequently get timeouts when writing nand partitions, especially for larger UBI rootfs paritions on this same Micron NAND (the following values have all been doubled):

       u32 timeo = (CONFIG_SYS_HZ * 40) / 1000;
       ...
       if (state == FL_ERASING)

          timeo = (CONFIG_SYS_HZ * 800) / 1000;
       else
          timeo = (CONFIG_SYS_HZ * 40) / 1000;

    Details follow.
    Thanks,
    Bob
     

    /dev/mtd6 is flashed as follows (other partitions are flashed similarly):

       mmc rescan
       mw.b 0x81000000 0xFF 0x1E0000
       fatload mmc 0 0x81000000 u-boot.img
       iminfo 0x81000000
       nand erase 0x60000 0x1E0000
       nand write 0x81000000 0x600000 0x1E0000

    nanddump (version 1.5.0) is run as follows (a SD card is mounted at /media/mmcblk0p2): 

       nanddump --bb=skipbad -o -f /media/mmcblk0p2/nanddump_mtd06 /dev/mtd6

    ECC uncorrectable bitflips occur about 90% of the time on /dev/mtd6, sometimes in the hundreds:

    root:~# nanddump --bb=skipbad -o -f /media/mmcblk0p2/nanddump_mtd06 /dev/mtd6
       ECC failed: 0
       ECC corrected: 0
       Number of bad blocks: 0
       Number of bbt blocks: 0
       Block size 131072, page size 2048, OOB size 64
       Dumping data starting at 0x00000000 and ending at 0x001e0000...
       ECC: 3 uncorrectable bitflip(s) at offset 0x00000000
       ECC: 3 uncorrectable bitflip(s) at offset 0x00000800
       ECC: 3 uncorrectable bitflip(s) at offset 0x00001000
       ECC: 4 uncorrectable bitflip(s) at offset 0x00001800
       ...
       ECC: 2 uncorrectable bitflip(s) at offset 0x00176800

    The file output by nanddump for /dev/mtd6 is different than other partitions of the same size flashed with the same image.

    When nanddump is run a second time on the same /dev/mtd partition, the ECC uncorrectible bitflip errors do not necessarily occur again, but the subsequent run shows a record of 650 ECC failures:   

    root:~# nanddump --bb=skipbad -o -f nanddump_mtd06 /dev/mtd6
       ECC failed: 650
       ECC corrected: 1
       Number of bad blocks: 0
       Number of bbt blocks: 0
       Block size 131072, page size 2048, OOB size 64
       Dumping data starting at 0x00000000 and ending at 0x001e0000...
       root:~#

    Running nandtest seems to work fine, but it does report the large number of ECC failures, and surprisingly (at least to me as it's new NAND) shows a 1 bit ECC correction occuring: 

    root@skydrop:~# nandtest -p 10 /dev/mtd6
    ECC corrections: 0
    ECC failures : 650
    Bad blocks : 0
    BBT blocks : 0
    001c0000: checking...
    Finished pass 1 successfully
    001c0000: checking...
    Finished pass 2 successfully
    00020000: reading...
    1 bit(s) ECC corrected at 00020000
    001c0000: checking...
    Finished pass 3 successfully
    001c0000: checking...
    Finished pass 4 successfully
    001c0000: checking...
    Finished pass 5 successfully
    001c0000: checking...
    Finished pass 6 successfully
    001c0000: checking...
    Finished pass 7 successfully
    001c0000: checking...
    Finished pass 8 successfully
    001c0000: checking...
    Finished pass 9 successfully
    001c0000: checking...
    Finished pass 10 successfully

    Frequently, other generally bigger partitions (those w/ uImage (5MB) or rootfs (~140MB)) report ECC corrected bitflips:

    nanddump --bb=skipbad -o -f /media/mmcblk0p2/nanddump_mtd12 /dev/mtd12
    ECC failed: 0
    ECC corrected: 0
    Number of bad blocks: 0
    Number of bbt blocks: 0
    Block size 131072, page size 2048, OOB size 64
    Dumping data starting at 0x00000000 and ending at 0x00500000...
    ECC: 1 corrected bitflip(s) at offset 0x0027a000

  • Am very disappointed that Steve K. has contributed absolutely nothing to the resolution of this problem. The good news is that it's been resolved and is being shared with the rest of the community:

    We determined root cause for the ECC uncorrectable bitflips on /dev/mtd6 and why it was only happening on certain boards. In the script we use for flashing NAND from u-boot that we shared early on, the range parameter passed to 'nand erase' for the 4 u-boot partitions was insufficiently large; it was only erasing the first 3 of 4 u-boot partitions (each are 0x1E0000 in size). It was
       nand erase 0x60000 0xA50000
    and should have been:
       nand erase 0x60000 0x780000
    The reason we were seeing this inconsistently is b/c was also perform factory resets and the factory reset code does not have this nand erase range error (it erases each partition separately before writing). Turns out the NAND that didn't exhibit the ECC uncorrectable bit errors had had a factory reset done previously.
    We're also satisfied that the number of ECC-corrected bitflips we're seeing are far below the 3 corrected bits for every 528 bits spec and as such, have no other outstanding issues or concerns regarding NAND integrity.
  • Hi,


    Even I'm facing similar issue. I get the message while reading NAND,

    ECC: 1 uncorrectable bitflip(s) at offset 0xbef3cb6000040000

    My problem is I want to change data from 0xFFFFFFFF -> 0xFFFFFFFE at a particular offset. Since I cannot access particular offset (ONLY page wise is allowed) I am performing below steps.

    MY ASSUMPTION:

    0xFFFFFFFF -> 0xFFFFFFFE doesn't require erase cycle.


    STEPS:

    ======

    1, Read the contents of NAND.

    2. change the required data only. (0xFFFFFFFF -> 0xFFFFFFFE).

    3. write back the entire page again. (No errors. Success).

    4. Read back the page. Got the message ECC: 1 uncorrectable bitflip(s) at offset 0xbef3cb6000040000.

    But the data is written as expected. Able to do nand dump.

    Can anybody help in resolving this?

    1. Can we ignore this message during read?

    2. Can I do bit change of 1->0 without erase?

    3. OR Is there ant way to access only particular offset?

    Thanks and Regards

    Vaishnavi