Bad Data CRC during kernel boot

Richard Counts

Other Parts Discussed in Thread: DA8XX

I realize it is unlikely anyone will have an answer but we are soliciting experiences. We are using the 6446 Davinci chip and booting from NAND. This product has been inproduction for some time with no problems. We are starting to see booting problems with some systems more or less randomly in the field and in production. Once a system gets into the no-boot mode it continues. However, if we rewrite the kernel image to the exact same place, everything works just fine.

Investigation with nanddump shows the kernel is indeed corrupt with one or two bits flipped.

Any insights would be welcome. We are using u-boot 1.3.4.

The actual error message we get is:

Loading from NAND 256MiB 1,8V 8-bit, offset 0xa0000

Image Name: Linux-2.6.10_mvl401-davinci_evm_

Image Type: ARM Linux Kernel Image (uncompressed)

Data Size: 1448100 Bytes = 1.4 MB

Load Address: 80008000

Entry Point: 80008000

## Booting kernel from Legacy Image at 80700000 ...

Image Name: Linux-2.6.10_mvl401-davinci_evm_

Image Type: ARM Linux Kernel Image (uncompressed)

Data Size: 1448100 Bytes = 1.4 MB

Load Address: 80008000

Entry Point: 80008000

Verifying Checksum ... Bad Data CRC

ERROR: can't get kernel image!

BeadPix U-Boot >

over 14 years ago

0 Jeff Hayes over 14 years ago

TI__Expert 6850 points

Richard,

Can you explain what you mean when you say, "Once a system gets into the no-boot mode it continues."? What particular NAND device are you using and are you aware of the manufacturer's claims with regards to device lifetime and number of reads/writes guaranteed? How often are you booting these devices and how often are you re-flashing?

0 Richard Counts over 14 years ago in reply to Jeff Hayes

Prodigy 60 points

Hi Thank you for the help. What we mean by "no-boot mode continues" is that typically a system will be working just fine for some time. Booting two or three times per day is average. Then we wil get the CRC DATA error. From that time on the system will always show the CRC DATA error.

If we re-flasher the kernel only then the system will boot just fine. The no-boot condition happens sometimes after only 1 day of use or somtimes after more than a year of use. We have run reliability test with over 8000 re-boots and NO failures. So the reproduction of this issue is very problematic.

We are using the NAND flash part number MT29F2G08 which is a Micron 2G bit part. Actual markings on chip (9GD12 NW101).

When we examine the contents of the NAND using nanddump we see a 1 or 2 bit difference between the original image and the image nanddump gets. If we simply reflash the kernel only the system boots OK. As far as the NAND reliability goes we only see the problem in the kernel space which should only written once during manufacturing and only read during boot. We use u-boot to flash the kernel.

U-Boot loader version 1.3.4

Linux Kernel version MontaVista 2.6.10

The flash is partitioned as follows:

Bootloader: 512K

u-boot params: 128K

kernel: 4MB

Filesystem: 128MB

0 Richard Counts over 14 years ago in reply to Richard Counts

Prodigy 60 points

One other tidbit of information. When we read the NAND with u-boot we get different bit errors in the kernel image than we get with Linux nanddump. The two images agree except that in general nandump will show on of the two errors moved or not existing at all.

So to summarize if we use u-boot to get the kernel image we see it is correct with the exception of two bits which are in the same page flipped i.e. 0 to 1 or 1 to 0. Now if we boot from RAM to bypass u-boot we see the NAND has a bad kernel image but usually one of the flipped bits will be either not flipped or moved.

Any ideas would be welcome! Micron says that their NAND is perfect and that this is not a part problem but a software issue. Once the kernel boots the NAND is not really used and its never written after the initial manufacturing write either. So we don't see how we could wear it out.

0 Duy Khanh Tran over 14 years ago in reply to Richard Counts

Prodigy 10 points

Hi,

we have the same problem with Micron's MT29F1G SLC NAND flash. I think you don't see the bit flipping when reading in Linux is because you have ECC enabled. So the bit flipping are indicated and fixed when reading from the NAND flash.

Have you found some solution to increase the reliability of the system? I mean from the Bootstrap, U-Boot till Linux system? We switched to UBIFS and are using HW ECC, but still need to increase the reliability of the Bootstrap and U-Boot.

Best regards, TDK

0 Zhiqian Wang over 14 years ago in reply to Duy Khanh Tran

Prodigy 10 points

Duy Khanh Tran said:

Hi,

we have the same problem with Micron's MT29F1G SLC NAND flash. I think you don't see the bit flipping when reading in Linux is because you have ECC enabled. So the bit flipping are indicated and fixed when reading from the NAND flash.

Have you found some solution to increase the reliability of the system? I mean from the Bootstrap, U-Boot till Linux system? We switched to UBIFS and are using HW ECC, but still need to increase the reliability of the Bootstrap and U-Boot.

Best regards, TDK

We had exactly the same problem and got it fixed. The problem was with the timing of reading of FSR register after addr_calc_st bit is set in FSR and it's

very subtle, and some time it may return wrong results and hence the skipped bit flip. In another word, there is bit flip in the nand,

and omap138's ECC hardware doesn't report if FSR is read at the wrong time.

We had it fixed by adding a udelay(this->chip_delay) after NANDFCR's bit 13 is set to start the address calculation

in uboot/cpu/arm926ejs/da8xx/nand.c module.

Also, the armubl we are using has the same timing issue and it needs to be fixed too. Otherwise, it wouldn't even start uboot

if uboot's partition has bit flips.

Hope it helps

0 Richard Counts over 14 years ago in reply to Zhiqian Wang

Prodigy 60 points

In the end we found that our problem was related to a major bug in u-boot version 1.3.4. That version of u-boot does not properly support large page size NAND devices. It is too bad that neither TI or DENZ make it easy to see what bugs have been verified in the various versions of u-boot. In many industries upgrading to new versions of third party software after market release is very difficult and time consuming. Often the reply we get from support groups is just upgrade to the latest version. This cannot be taken on lightly.

Below is a copy of our final analysis for this issue. Be careful with this version of u-boot because this failure can take many months to surface as the NAND devices degrade over very long periods of time. Without proper ECC correction NAND devices wont be reliable of course in any case.

Problem Summary – Boot Failure/Kernel Corruption

· Systems failing to boot due to corrupted Linux Kernel.

◦ May fail on first attempt to boot

◦ May boot successfully many times before failure is observed

◦ May not fail at all

· Once failure has occured, the unit cannot recover on its own -

· Reflashing just the kernel appears to correct the issue, at least temporarily

Failure Analysis

· Failed systems all show a an invalid CRC calculated for the kernel image stored on the NAND flash.

· When compared to valid kernel image, corrupt image would show either a 1 or 2 bit error in the image.

· Each corrupted kernel would showed corruption in a different bit address.

· System fails to boot due to a CRC error detected in the Linux kernal image, not due to a read failure.

Root Cause

The root cause of these problems was that the NAND driver in the old U-Boot (1.3.4) had defective ECC usage, and was incompatible with the Kernel's NAND driver.

The u-boot driver did not compute valid ECC codes to cover bit error detection/correction to conver an entire page – only first 512 bytes.

◦ One-bit error in first 512 bytes could be corrected.

◦ One or more bit errors in second, third or fourth 512-byte area could not be detected or corrected

This would affect any page written by the U-Boot driver on initial flashing, with undetermined side effects – it would all depend on what data may be corrupted.

ECC Usage and Effects of defective ECC

To be able to correct a 1-bit error and detect 2 or more bit errors requires 24 bits of ECC per 512 bytes of encoded data, or 3 bytes per 512 bytes.

For 512-byte per page NAND flash chips, this was sufficient to cover the entire page.

For 2048-byte per page NAND, such as the Micron NAND chip, 4*3 bytes or 12 bytes of ECC were needed.

The old U-Boot driver would write 12 bytes of ECC data, however the 4^th through 12^th bytes were all 0's, regardless of the page data.

To validate the integrity of the data, ECC algorithms compare the value of the ECC computed while reading the media, with the ECC stored on the media. By incorrectly computing the ECC (both on the write, and one the read-back), U-Boot would fail to detect any bit errors that would occur in the range of byte 512 through 2047. ¾ of every page written and ready by U-Boot would not have any ECC protection. The only way an error would be detected would be by the kernel CRC check.

This was proved by testing the ECC effectiveness over an entire NAND Flash page using the old U-Boot Flash Driver and the Linux kernel NAND driver. It was found that the old NAND driver would fail to detect many errors, and worse yet, would not provide ECC coverage for most of the NAND Page, meaning the kernel driver would not be able to correct or detect 1 or more bit errors in a page.

The kernel NAND driver was tested using the exact same tests, to determine whether or not it's NAND driver was defective. It was found that the kernel NAND driver behaved as expected. The test results are also included below.

Solution: Replace U-Boot with Mainline version with fixed NAND Drivers.

The mainline version of U-Boot had fixed the NAND driver and ECC usage, so to fully enable ECC correction and detection over an entire page, the new NAND driver would be necessary. Once the mainline U-Boot source was ported to the system hardware, the ECC was again tested, to ensure that the ECC codes fully covered the entire page. Testing showed that the new U-Boot did write ECC codes that covered the full page of data, allowing up to 4 bits of data to be corrected per page over the entire media, regardless of whether the data is read by U-Boot or by the Kernel NAND driver.

0 John Anderson over 14 years ago in reply to Richard Counts

Genius 5240 points

Thanks for posting this Richard. It appears the version delivered with DVSDK 4 (dm365) is much later. And they've gone to using year and month as a version number now.

John A

0 Leon Pollak over 13 years ago in reply to Richard Counts

Intellectual 960 points

Dear Richard.

Thank you for this very informative mail.

I tried to upgrade the u-boot to the last TI provided version (in DVSDK-4.02). The problem was, that this upgrade pulls after it upgrade of everything! And this seems to be almost impossible - part of the drivers and framework seems not to exist, part have changed drastically, etc....

Can you be so kind to share the way you solved this updating ONLY the u-boot?

Many thanks ahead for any help.

0 Leon Pollak over 13 years ago in reply to Leon Pollak

Intellectual 960 points

OK, the problem of update is solved - we have now the new U-Boot running.

What remains still unclear is the following:

The old U-Boot burned not only the kernel (which it reads back by itself) but also the YAFFS root file system, which is read by kernel. If the kernel has the correct ECC algorithms, while old U-Boot has the incorrect ones, how the kernel succeeds in reading bad YAFFS ECC correctly?

I will be very thankful for clarification.

0 Borisa Jevtic over 8 years ago in reply to Leon Pollak

Prodigy 10 points

Does anybody know if this was fixed and if so in which version of u-boot?

I am afraid to use latest u-boot as I am running old hardware:

U-Boot 1.3.4T2 (Feb 28 2012 - 20:31:25) - AT91SAM9260@200 MHz -
DRAM: 64 MB
NAND: 256 MiB
Net: smc911x, macb0

Processors

Processors forum

Bad Data CRC during kernel boot