We have a board which was based on the AM335x EVM. We are using the am3358 with NAND. The NAND is the MT29F4G08ABADAH4 512 MBytes. We are running u-boot and linux. Every once in a while we get an indication at linux boot up that the rootfs is corrupted.
The MTD partitions look like this:
[root@MT2000 proc]# cat mtd
dev: size erasesize name
mtd0: 00020000 00020000 "SPL"
mtd1: 00020000 00020000 "SPL.backup1"
mtd2: 00020000 00020000 "SPL.backup2"
mtd3: 00020000 00020000 "SPL.backup3"
mtd4: 001e0000 00020000 "U-Boot"
mtd5: 00020000 00020000 "U-Boot Env"
mtd6: 00500000 00020000 "Kernel0"
mtd7: 08000000 00020000 "RootFS0"
mtd8: 00500000 00020000 "Kernel1"
mtd9: 08000000 00020000 "RootFS1"
mtd10: 0f380000 00020000 "Data"
We have two sets of kernel and rootfs partitions that we swap when needed. The NAND programming is done within linux, using "flash_erase /dev/mtdx 0 0" and "nandwrite -ap /dev/mtdx <file>", writing to the non-active rootfs and kernel partitions. After programming, we set a U-boot environment variable and reboot. U-boot uses the variable to load the right set of partitions.
When the unit boots and shows corruption, things seem to be mostly intact. A recent event from the console startup log is this (we use squashfs for the rootfs):
------------------------
Starting system message bus: [ 6.829193] end_request: I/O error, dev
mtdblock
7, sector 12480
[ 6.844573] SQUASHFS error: squashfs_read_data failed to read block
0x6131fc
[ 6.851989] SQUASHFS error: Unable to read fragment cache entry [6131fc]
[ 6.859039] SQUASHFS error: Unable to read page, block 6131fc, size d78d
[ 6.866119] SQUASHFS error: Unable to read fragment cache entry [6131fc]
[ 6.873138] SQUASHFS error: Unable to read page, block 6131fc, size d78d
[ 6.880218] SQUASHFS error: Unable to read fragment cache entry [6131fc]
[ 6.887237] SQUASHFS error: Unable to read page, block 6131fc, size d78d
[ 6.895538] SQUASHFS error: Unable to read fragment cache entry [6131fc]
[ 6.902587] SQUASHFS error: Unable to read page, block 6131fc, size d78d
/etc/init.d/S30dbus: line 26: /usr/bin/dbus-uuidgen: Input/output error done
------------------------
So in this case the problem was for mtdblock7, which is the ROOTFS0 partition. This happened at the first bootup after programming mtd6 with the kernel and mtd7 with the rootfs. There were no indications of errors during the programming. I rebooted a few times and got the same errors. I programmed these files into other boards and had no problems.
I switched to the other set of partitions and rebooted. I then ran nandtest for the first time, expecting to find errors, but did not.:
------------------------
[root@MT2000 ~]# nandtest -p 5 /dev/mtd7 ECC corrections: 0
ECC failures : 0
Bad blocks : 0
BBT blocks : 0
07fe0000: checking...
Finished pass 1 successfully
07fe0000: checking...
Finished pass 2 successfully
002c0000: reading...
1 bit(s) ECC corrected at 002c0000
00840000: reading...
1 bit(s) ECC corrected at 00840000
00c40000: reading...
1 bit(s) ECC corrected at 00c40000
07fe0000: checking...
Finished pass 3 successfully
004e0000: reading...
1 bit(s) ECC corrected at 004e0000
03840000: reading...
1 bit(s) ECC corrected at 03840000
07f20000: reading...
1 bit(s) ECC corrected at 07f20000
07fe0000: checking...
Finished pass 4 successfully
02260000: reading...
1 bit(s) ECC corrected at 02260000
048e0000: reading...
1 bit(s) ECC corrected at 048e0000
07fe0000: checking...
Finished pass 5 successfully
[root@MT2000 ~]#
I re-programmed the partitions with the same kernel and rootfs data, and it now is working fine. I also re-ran the nandtest on the same partition, and expected that there would be outstanding ECC's at the start of its run, but there were none. Is this how ECCs are supposed to work? Or is that all internal to the AM3358 and nandtest is doing "software" ECC (and maybe confusing things)?
Are flash_erase and nandwrite -ap the correct utilities to use?
Any insights on what to look for to solve this issue would be appreciated.
Thanks,
Rod Campbell