AM335x NAND data intermittent problems

Rod Campbell1

Other Parts Discussed in Thread: AM3358

We have a board which was based on the AM335x EVM. We are using the am3358 with NAND. The NAND is the MT29F4G08ABADAH4 512 MBytes. We are running u-boot and linux. Every once in a while we get an indication at linux boot up that the rootfs is corrupted.

The MTD partitions look like this:

[root@MT2000 proc]# cat mtd

dev: size erasesize name

mtd0: 00020000 00020000 "SPL"

mtd1: 00020000 00020000 "SPL.backup1"

mtd2: 00020000 00020000 "SPL.backup2"

mtd3: 00020000 00020000 "SPL.backup3"

mtd4: 001e0000 00020000 "U-Boot"

mtd5: 00020000 00020000 "U-Boot Env"

mtd6: 00500000 00020000 "Kernel0"

mtd7: 08000000 00020000 "RootFS0"

mtd8: 00500000 00020000 "Kernel1"

mtd9: 08000000 00020000 "RootFS1"

mtd10: 0f380000 00020000 "Data"

We have two sets of kernel and rootfs partitions that we swap when needed. The NAND programming is done within linux, using "flash_erase /dev/mtdx 0 0" and "nandwrite -ap /dev/mtdx <file>", writing to the non-active rootfs and kernel partitions. After programming, we set a U-boot environment variable and reboot. U-boot uses the variable to load the right set of partitions.

When the unit boots and shows corruption, things seem to be mostly intact. A recent event from the console startup log is this (we use squashfs for the rootfs):

------------------------

Starting system message bus: [ 6.829193] end_request: I/O error, dev

mtdblock

7, sector 12480

[ 6.844573] SQUASHFS error: squashfs_read_data failed to read block

0x6131fc

[ 6.851989] SQUASHFS error: Unable to read fragment cache entry [6131fc]

[ 6.859039] SQUASHFS error: Unable to read page, block 6131fc, size d78d

[ 6.866119] SQUASHFS error: Unable to read fragment cache entry [6131fc]

[ 6.873138] SQUASHFS error: Unable to read page, block 6131fc, size d78d

[ 6.880218] SQUASHFS error: Unable to read fragment cache entry [6131fc]

[ 6.887237] SQUASHFS error: Unable to read page, block 6131fc, size d78d

[ 6.895538] SQUASHFS error: Unable to read fragment cache entry [6131fc]

[ 6.902587] SQUASHFS error: Unable to read page, block 6131fc, size d78d

/etc/init.d/S30dbus: line 26: /usr/bin/dbus-uuidgen: Input/output error done

------------------------

So in this case the problem was for mtdblock7, which is the ROOTFS0 partition. This happened at the first bootup after programming mtd6 with the kernel and mtd7 with the rootfs. There were no indications of errors during the programming. I rebooted a few times and got the same errors. I programmed these files into other boards and had no problems.

I switched to the other set of partitions and rebooted. I then ran nandtest for the first time, expecting to find errors, but did not.:

------------------------

[root@MT2000 ~]# nandtest -p 5 /dev/mtd7 ECC corrections: 0

ECC failures : 0

Bad blocks : 0

BBT blocks : 0

07fe0000: checking...

Finished pass 1 successfully

07fe0000: checking...

Finished pass 2 successfully

002c0000: reading...

1 bit(s) ECC corrected at 002c0000

00840000: reading...

1 bit(s) ECC corrected at 00840000

00c40000: reading...

1 bit(s) ECC corrected at 00c40000

07fe0000: checking...

Finished pass 3 successfully

004e0000: reading...

1 bit(s) ECC corrected at 004e0000

03840000: reading...

1 bit(s) ECC corrected at 03840000

07f20000: reading...

1 bit(s) ECC corrected at 07f20000

07fe0000: checking...

Finished pass 4 successfully

02260000: reading...

1 bit(s) ECC corrected at 02260000

048e0000: reading...

1 bit(s) ECC corrected at 048e0000

07fe0000: checking...

Finished pass 5 successfully

[root@MT2000 ~]#

I re-programmed the partitions with the same kernel and rootfs data, and it now is working fine. I also re-ran the nandtest on the same partition, and expected that there would be outstanding ECC's at the start of its run, but there were none. Is this how ECCs are supposed to work? Or is that all internal to the AM3358 and nandtest is doing "software" ECC (and maybe confusing things)?

Are flash_erase and nandwrite -ap the correct utilities to use?

Any insights on what to look for to solve this issue would be appreciated.

Thanks,

Rod Campbell

over 9 years ago

0 Biser Gatchev-XID over 9 years ago

TI__Guru**** 393215 points

Hi Rod,

I will forward this to the SW team. Can you post what Linux version you are using?

0 Wolfgang Muees1 over 9 years ago

Genius 3685 points

Your NAND chip is demanding 4bit ECC. Which ECC do you use? I know this type of error. It comes from bit errors in the NAND. You seem to have problems with your ECC correction.

0 Ivan Matrakov over 9 years ago in reply to Wolfgang Muees1

TI__Expert 4475 points

Hi,

Using ECC is declared for SDK 7 in a .dts file and for SDK 6 in a board file. In nandtest application the ECC is active. Please write what SDK do you using SDK 6 or 7. And can you check if this nand work well in a u-boot on any of your set. One good site for MTD utils is:
http://processors.wiki.ti.com/index.php/Mtdutils

BR
Ivan

0 Rod Campbell1 over 9 years ago in reply to Biser Gatchev-XID

Prodigy 110 points

We are using buildroot to manage the kernel and rootfs build. Here's the linux version info from the boot log:

Linux version 3.2.0 (rodc@rodc-linux-lap) (gcc version 4.7.3 2013
0328 (prerelease) (crosstool-NG linaro-1.13.1-4.7-2013.04-20130415 - Linaro GCC
2013.04) ) #24 Mon Sep 15 15:35:36 EDT 2014

0 Rod Campbell1 over 9 years ago in reply to Wolfgang Muees1

Prodigy 110 points

We are using BCH8. In the board setup code this is ecc_opt = OMAP_ECC_BCH8_CODE_HW.

The Micron datasheet for our flash says that 4-bit ECC is the minimum required ECC. I am pretty certain that we did not change this from what TI uses in the original EVM code, and the datasheet for the flash they use says the same thing as ours regarding ECC.

0 Rod Campbell1 over 9 years ago in reply to Ivan Matrakov

Prodigy 110 points

We are using SDK am335x-evm-05.06.00

I have not "torture tested" the "nand erase.chip" and "nand write" in u-boot, but I have used them several times and have not noticed any issues. One thing that I did notice is that when I have the bootup problem with the rootfs, and I then re-boot and stop at u-boot, if I do "nand bad" to check for bad blocks, there are never any bad blocks detected.

The NAND configuration is done in board-am335xevm.c:

static void evm_nand_init(int evm_id, int profile)
{
   struct omap_nand_platform_data *pdata;
   struct gpmc_devices_info gpmc_device[2] = {
       { NULL, 0 },
       { NULL, 0 },
   };

   setup_pin_mux(nand_pin_mux);
   pdata = omap_nand_init(am335x_nand_partitions,
       ARRAY_SIZE(am335x_nand_partitions), 0, 0,
       &am335x_nand_timings);
   if (!pdata)
       return;
   pdata->ecc_opt =OMAP_ECC_BCH8_CODE_HW;
   pdata->elm_used = true;
   gpmc_device[0].pdata = pdata;
   gpmc_device[0].flag = GPMC_DEVICE_NAND;

   omap_init_gpmc(gpmc_device, sizeof(gpmc_device));
   omap_init_elm();
}

0 Ivan Matrakov over 9 years ago in reply to Rod Campbell1

TI__Expert 4475 points

Hi,

Did you change the timing parameters in a am335x_nand_timings structure.

BR
Ivan

0 Rod Campbell1 over 9 years ago in reply to Ivan Matrakov

Prodigy 110 points

No, the parameters were left unchanged. And I re-compared the datasheets for the NAND that TI used on the EVM and the NAND that we use on our boards. These are both Micron NANDs and the timing, DC and AC electrical characteristics are identical.

I have discovered that the file system that we picked for our rootfs - squashfs, may have problems when used with NAND. This is a read-only fs, but it apparently does not do NAND bad-block detection. So it could apparently exhibit intermittent problems even though it never modifies the NAND. Squashfs is running on top of MTD.

However, my device (which currently is exhibiting the problem I reported), does not have a bad block problem. I checked the NAND with u-boot's bad block checker. From linux I also ran "nanddump -f /dev/null /dev/mtd7" to have a nand-aware utility read over the partition (/dev/mtd7). Here is the result:

[root@MT2000 sbin]# nanddump -f /dev/null /dev/mtd7
ECC failed: 0
ECC corrected: 5
Number of bad blocks: 0
Number of bbt blocks: 0
Block size 131072, page size 2048, OOB size 64
Dumping data starting at 0x00000000 and ending at 0x08000000...
ECC: 1 corrected bitflip(s) at offset 0x00537800
[root@MT2000 sbin]#

So it found some ECCs and corrected a bitflip, but no bad block found. I don't know if the "bitflip" incident or the ECC correction is something that squashfs would not detect and therefore cause it to read incorrect data and report the errors I showed earlier.

Does anyone know if this could explain the issues I reported?

Does someone have an idea about what read-only rootfs to use with NAND?

Thanks.

0 Ivan Matrakov over 9 years ago in reply to Rod Campbell1

TI__Expert 4475 points

Hi Rod,

Usually squashfs works with a journal, are you sure that in your squashfs you switched off the journal. Which version of Squashfs filesystem do you use.

BR
Ivan

0 Rod Campbell1 over 9 years ago in reply to Ivan Matrakov

Prodigy 110 points

Hi Ivan,

We are using squashfs 4.0. I don't see anything about a journal either in the make menuconfig for the linux squashfs options or in the doc for squashfs. squashfs is is a read-only filesystem, so wouldn't a journal be unnecessary and unwanted?

When I dig into the web info, there are definitely some warnings about not using squashfs with NAND unless you have an underlying way to manage the bad blocks (like using UBIFS underneath). Here are a few:

Dealing with NAND bad blocks in production programming 12/2013
http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/100/t/251129.aspx

NAND flash bad blocks combined with squashfs could be a dangerous combination 5/2010
http://www.lg-hack.info/cgi-bin/sn_forumr.cgi?cid=2675&fid=2679&tid=2721

Discussion about FS. There is a link inside warning about using squashfs with nand
http://wiki.openwrt.org/doc/techref/filesystems#squashfs

Some of these links are older, but there does not seem to be any update to squashfs that I can find about being able to use it with NAND by itself. Also, I'm not sure if the AM335x on-board NAND assist (the ELM) changes any of this.

Thanks,

Rod

0 Wolfgang Muees1 over 9 years ago in reply to Rod Campbell1

Genius 3685 points

Using squashfs over bare NAND is no sensible idea.

Even if your data on the suashfs is read-only, there will be bits toggling in the NAND device, and you need a layer of wear/error managment above bare NAND. If you use parts of the NAND in R/W mode, writing to the NAND might toggle bits inside the R/O area.

Using UBI is a good idea. You can also use a r/o partition of ubifs to store your firmware (as we do). Works good.

0 Rod Campbell1 over 9 years ago in reply to Wolfgang Muees1

Prodigy 110 points

Wolfgang,

Thanks for your input. I contacted the guys that work on MTD and UBI at linux-mtd@lists.infradead.org. They replied that UBI is a good answer just as you did. The link at the bottom of the the following email snippet points to some doc about how to set up UBI for just this usage:

--------------------------------

On Mon, Nov 24, 2014 at 10:21:16PM +0100, Richard Weinberger wrote:

> On Mon, Nov 24, 2014 at 9:41 PM, Rod Campbell <rod.campbell@telosalliance.com> wrote:

> > Hello,

> >

> > Is there an update about implementing a read-only file system on

> > NAND? The most recent discussion I could find was from May 2012:

> >

> > http://lists.infradead.org/pipermail/linux-mtd/2012-May/041200.html

> >

> > The conclusion seemed to be to use ubi underneath something like

> > squashfs to provide bad-block awareness. Is this still the recommended solution?

> Yes. Using the new UBI block driver this is trivial.

To expand on that:

http://www.linux-mtd.infradead.org/doc/ubi.html#L_ubiblock

Brian

--------------------------------

So I think using UBI underneath squashfs is how we will proceed. The main thing I'm wondering about is increased boot time. But data integrity comes first.

One gotcha I experienced tracking this down is that the reports I read talked about squashfs not handling "bad blocks". And as far as I know (after running nand-aware utilities like nanddump and nandtest) this NAND MTD partition does NOT have bad blocks. But it DOES have a "bitflip" reported when I run nanddump. So I guess there are other issues that cause squashfs to read bad data besides an actual "bad block" situation.

I say this not because I want to try to use squashfs without UBI underneath, but because I thought that there may be some additional issue causing the intermittent problems we are experiencing - such as ECC not set up right, other NAND parameters not right, difference between NAND parameters in u-boot compared to linux and compared to the NAND chip itself, etc.

Thanks,

Rod

0 Rod Campbell1 over 9 years ago in reply to Wolfgang Muees1

Prodigy 110 points

I wanted to add some information about what we ended up doing.

I wanted to use ubi under squashfs. But it seems that this capability is not available until linux 3.15 or so. We are using 3.2.0. I did not want to do kernel patching, etc. to get this capability into our kernel.

So I decided to use ubifs. I tell ubi that the volume is static (as opposed to dynamic), and mount the ubifs rootfs as ro. This has been working well in my trials so far. The only problem is that the boot times are longer. We had a 24 second boot time using squashfs on NAND. Now we have a 30 second boot time using ubifs. Obviously the squashfs version was unreliable, so we needed to do something.

In my research I saw that there is a way to speed up ubifs mount times "UBI Fastmap", but that is also in a later kernel - 3.7. Here is a link that mentions both ubi under squashfs and UBI Fastmap:
http://free-electrons.com/pub/conferences/2014/elc/opdenacker-boot-time/

Processors

Processors forum

AM335x NAND data intermittent problems