This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
I frequently get ECC uncorrectable bitflip errors when running nanddump on a particular partition /dev/mtd6. I also consistently get ECC corrected bitflip on at least 2 other partitions (total of 17 partitions). Micron believes the uncorrectable bitflip errors are due to a TI fix not in the SDK being using. Also, the consistent number of corrected bitflips for several partitions seems to be high (I'd expect none for new NAND).
I'm using the TI AM335x EVM Starter Kit ti-sdk-am335x-evm-06.00.00.00-Linux-x86-Install on a custom board with Micron MT29F4G08ABADAH4 NAND, AM3352 ARMv7 Processor rev 2 (v7l) CPU. Default ECC BCH8 is used.
/dev/mtd6 is flashed as follows:
mmc rescan
mw.b 0x81000000 0xFF 0x1E0000
fatload mmc 0 0x81000000 u-boot.img
iminfo 0x81000000
nand erase 0x60000 0x1E0000
nand write 0x81000000 0x600000 0x1E0000
nanddump (version 1.5.0) is run as follows (a SD card is mounted at /media/mmcblk0p2):
nanddump --bb=skipbad -o -f /media/mmcblk0p2/nanddump_mtd06 /dev/mtd6
ECC uncorrectable bitflips occur about 80% of the time on /dev/mtd6, sometimes in the hundreds:
root:~# nanddump --bb=skipbad -o -f /media/mmcblk0p2/nanddump_mtd06 /dev/mtd6
ECC failed: 0
ECC corrected: 0
Number of bad blocks: 0
Number of bbt blocks: 0
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x001e0000...
ECC: 3 uncorrectable bitflip(s) at offset 0x00000000
ECC: 3 uncorrectable bitflip(s) at offset 0x00000800
ECC: 3 uncorrectable bitflip(s) at offset 0x00001000
ECC: 4 uncorrectable bitflip(s) at offset 0x00001800
...
ECC: 2 uncorrectable bitflip(s) at offset 0x00176800
The file output by nanddump for /dev/mtd6 is different than other partitions of the same size flashed with the same image.
When I run nanddump a second time on the same partition, the ECC uncorrectible bitflip errors do not necessarily occur again, but the subsequent run shows a record of 650 ECC failures:
root:~# nanddump --bb=skipbad -o -f nanddump_mtd06 /dev/mtd6
ECC failed: 650
ECC corrected: 1
Number of bad blocks: 0
Number of bbt blocks: 0
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x001e0000...
root:~#
Running nandtest seems to work fine, but it does report the large number of ECC failures, and surprisingly (at least to me as it's new NAND) shows a 1 bit ECC correction occuring:
root@skydrop:~# nandtest -p 10 /dev/mtd6
ECC corrections: 0
ECC failures : 650
Bad blocks : 0
BBT blocks : 0
001c0000: checking...
Finished pass 1 successfully
001c0000: checking...
Finished pass 2 successfully
00020000: reading...
1 bit(s) ECC corrected at 00020000
001c0000: checking...
Finished pass 3 successfully
001c0000: checking...
Finished pass 4 successfully
001c0000: checking...
Finished pass 5 successfully
001c0000: checking...
Finished pass 6 successfully
001c0000: checking...
Finished pass 7 successfully
001c0000: checking...
Finished pass 8 successfully
001c0000: checking...
Finished pass 9 successfully
001c0000: checking...
Finished pass 10 successfully
The uncorrectible bitflip reported by nanddump occur with or w/o the following patches intended for Spansion NAND: http://www.spansion.com/Support/Software/linux-psp-04.04.00.01-NAND.zip.
The biggest concern is factory reset partitions on this same NAND having uncorrectable bitflip errors when read and being subsequently copied to all the other partitions resulting in a bricked device.
A separate question: would doubling the timeout values in drivers/mtd/nand/nand_base.c adversely affect u-boot writes to NAND (nand write command)? I otherwise frequently get timeouts when writing nand partitions, especially for larger UBI rootfs paritions on this same Micron NAND (the following values have all been doubled):
u32 timeo = (CONFIG_SYS_HZ * 40) / 1000;
...
if (state == FL_ERASING)
timeo = (CONFIG_SYS_HZ * 800) / 1000;
else
timeo = (CONFIG_SYS_HZ * 40) / 1000;
This looks like a hardware failure. Get an oscilloscope and look at the signals.
Check the timing. Check the NAND busy signal. Check the busy signal polling in the NAND driver.
regards
Wolfgang
Thanks Wolfgang.
I too suspect the ECC uncorrectable btiflips consistently on the same partition (/dev/mtd6) to most likely be a hardware error. However, for 1 of ~8 NAND flashes followed by nanddumps, there were no ECC uncorrectable or corrected bitflips (i.e., it behaved perfectly), which defies it being a hardware problem.
Also, I regularly see ECC corrected bitflips, ~3 on average, when doing a nanddump of all 17 partitions for 512 MB of the same model of NAND on two separate devices, both of which are new. Would you consider this number of ECC corrected bitflips for this size of NAND during nanddump and/or nandtest runs on a regular basis to be normal?
Thanks,
Bob
Bob,
if the errors are only on one partition of the flash, it might be because of the usage pattern. Are there log files on that partition? If the partition is small, there might be a rapid wear out. NAND flashes get more errors if wearing out.
3 bit errors on a 512 MByte Device seems to be normal.
But 17 partitions? How do you manage to do an effective wear leveling with 17 partitions? Wear leveling is not done across partition boundaries.
I use 1 UBI device on the same NAND device, and wear leveling is done across the whole chip.
regards
Wolfgang
Any chance a TI expert could chime in on this post?
I have 2 very big concerns regarding the source in TI's 06.00.00.00 AM335x EVM Starter Kit SDK regarding the NAND driver:
1) The source code in 06.00.00.00 in board_support/linux-3.2.0-psp04.06.00.11/drivers/mtd/nand is quite different than that for TI's Linux 3.2.0 kernel source at http://git.ti.com/ti-linux-kernel/ti-linux-kernel/blobs/v3.2/drivers/mtd/nand/.
2) The TI Linux kernel NAND driver source code has changed immensely between versions 3.2.0 (http://git.ti.com/ti-linux-kernel/ti-linux-kernel/blobs/v3.2/drivers/mtd/nand/) and 3.16 (http://git.ti.com/ti-linux-kernel/ti-linux-kernel/blobs/v3.16/drivers/mtd/nand/omap2.c), especially in areas surrounding ECC.
As such, what does TI recommend using for reliable NAND driver code on the AM335x?
Hello Steve,
Any update?
Also, I've included the information I provided to Michael Stevens at TI Applications Support on Aug. 11, 2014.
Thanks,
Bob
u32 timeo = (CONFIG_SYS_HZ * 40) / 1000;
...
if (state == FL_ERASING)
timeo = (CONFIG_SYS_HZ * 800) / 1000;
else
timeo = (CONFIG_SYS_HZ * 40) / 1000;
/dev/mtd6 is flashed as follows (other partitions are flashed similarly):
mmc rescan
mw.b 0x81000000 0xFF 0x1E0000
fatload mmc 0 0x81000000 u-boot.img
iminfo 0x81000000
nand erase 0x60000 0x1E0000
nand write 0x81000000 0x600000 0x1E0000
nanddump (version 1.5.0) is run as follows (a SD card is mounted at /media/mmcblk0p2):
nanddump --bb=skipbad -o -f /media/mmcblk0p2/nanddump_mtd06 /dev/mtd6
ECC uncorrectable bitflips occur about 90% of the time on /dev/mtd6, sometimes in the hundreds:
root:~# nanddump --bb=skipbad -o -f /media/mmcblk0p2/nanddump_mtd06 /dev/mtd6
ECC failed: 0
ECC corrected: 0
Number of bad blocks: 0
Number of bbt blocks: 0
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x001e0000...
ECC: 3 uncorrectable bitflip(s) at offset 0x00000000
ECC: 3 uncorrectable bitflip(s) at offset 0x00000800
ECC: 3 uncorrectable bitflip(s) at offset 0x00001000
ECC: 4 uncorrectable bitflip(s) at offset 0x00001800
...
ECC: 2 uncorrectable bitflip(s) at offset 0x00176800
The file output by nanddump for /dev/mtd6 is different than other partitions of the same size flashed with the same image.
When nanddump is run a second time on the same /dev/mtd partition, the ECC uncorrectible bitflip errors do not necessarily occur again, but the subsequent run shows a record of 650 ECC failures:
root:~# nanddump --bb=skipbad -o -f nanddump_mtd06 /dev/mtd6
ECC failed: 650
ECC corrected: 1
Number of bad blocks: 0
Number of bbt blocks: 0
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x001e0000...
root:~#
Running nandtest seems to work fine, but it does report the large number of ECC failures, and surprisingly (at least to me as it's new NAND) shows a 1 bit ECC correction occuring:
root@skydrop:~# nandtest -p 10 /dev/mtd6
ECC corrections: 0
ECC failures : 650
Bad blocks : 0
BBT blocks : 0
001c0000: checking...
Finished pass 1 successfully
001c0000: checking...
Finished pass 2 successfully
00020000: reading...
1 bit(s) ECC corrected at 00020000
001c0000: checking...
Finished pass 3 successfully
001c0000: checking...
Finished pass 4 successfully
001c0000: checking...
Finished pass 5 successfully
001c0000: checking...
Finished pass 6 successfully
001c0000: checking...
Finished pass 7 successfully
001c0000: checking...
Finished pass 8 successfully
001c0000: checking...
Finished pass 9 successfully
001c0000: checking...
Finished pass 10 successfully
Frequently, other generally bigger partitions (those w/ uImage (5MB) or rootfs (~140MB)) report ECC corrected bitflips:
nanddump --bb=skipbad -o -f /media/mmcblk0p2/nanddump_mtd12 /dev/mtd12
ECC failed: 0
ECC corrected: 0
Number of bad blocks: 0
Number of bbt blocks: 0
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x00500000...
ECC: 1 corrected bitflip(s) at offset 0x0027a000
Am very disappointed that Steve K. has contributed absolutely nothing to the resolution of this problem. The good news is that it's been resolved and is being shared with the rest of the community:
Hi,
Even I'm facing similar issue. I get the message while reading NAND,
ECC: 1 uncorrectable bitflip(s) at offset 0xbef3cb6000040000
My problem is I want to change data from 0xFFFFFFFF -> 0xFFFFFFFE at a particular offset. Since I cannot access particular offset (ONLY page wise is allowed) I am performing below steps.
MY ASSUMPTION:
0xFFFFFFFF -> 0xFFFFFFFE doesn't require erase cycle.
STEPS:
======
1, Read the contents of NAND.
2. change the required data only. (0xFFFFFFFF -> 0xFFFFFFFE).
3. write back the entire page again. (No errors. Success).
4. Read back the page. Got the message ECC: 1 uncorrectable bitflip(s) at offset 0xbef3cb6000040000.
But the data is written as expected. Able to do nand dump.
Can anybody help in resolving this?
1. Can we ignore this message during read?
2. Can I do bit change of 1->0 without erase?
3. OR Is there ant way to access only particular offset?
Thanks and Regards
Vaishnavi