[EZSDK] NAND ECC Errors

Jon S.

Other Parts Discussed in Thread: AM3517

Hi there,

I'm seen a large amount of JFFS2 badblock and NAND CRC errors during normal use on our DM8148-based platform running off of NAND. The error messages are either "Read of newly-erased block at <address> failed: -74. Putting on bad_list" or CRC check failures.

Aside from the stream of messages, the OS runs fine. I'm concerned that the entire NAND will gradually be mismarked "bad" due to issues with the CRC scheme.

This exact issue appears to be covered in this forum post, but the fix appears to be relevant only for the IPNC RDK.

http://e2e.ti.com/support/embedded/linux/f/354/t/191572.aspx

Could someone shed some light on this as it pertains to the DM8148/8168 EZSDKs?

With respect to ECC algorithm, I've been using the defaults provided with the EZSDK, and have followed the PSP Flashing Tools Guide for flashing U-Boot, the kernel and the rootfs to NAND, which indicates BCH8 should be used.

I'm using EZSDK 5.05 for the DM8148.

Thank you,

Jon

over 12 years ago

0 Jon S. over 12 years ago

Expert 1240 points

After further research, I see that my question has already been very clearly answered in this thread by Pekon Gupta. (Many thanks, Pekon!)

http://e2e.ti.com/support/embedded/linux/f/354/p/230270/808995.aspx#808995

From the PSP user guides, it was not clear to me that a couple one-line changes needed to be made to what's provided in the EZSDK-provided kernel when using NAND + JFFS2.

However, I certainly might have missed something. I'd still love to hear any further details folks have to share regarding the ECC scheme situation for the 8148, especially if changes are planned in future EZSDK releases.

- Jon

0 Jon S. over 12 years ago in reply to Jon S.

Expert 1240 points

I unmarked the above post as "Verified" as I'm still running into some issues. I've tried to re-summarize the issue here.

I should also note that we're using a 29F2G16ABAEA NAND, which I believe should be the same as the NAND on the EVM, or at least have the same timing parameters.

From my understanding (based upon the PSP documentation), BCH8 *cannot* be used for JFFS2 on NAND devices. Please correct me here if I'm wrong.

I had been using JFFS2 + NAND, and running into messages like the one below. When I wrote files to the JFFS2 file system, I'd see plenty of theses errors, and blocks would be mismarked bad. Over time, the NAND had little free space left, due to blocks being mismarked bad.

mtd->read(0x100 bytes from 0x0) returned ECC error

At this point, I started following the post I linked above, switching from BCH8 to Hamming for the filesystem. (I also see that anything access from Linux should use Hamming as well.)

Below is the procedure I followed, and the issues encountered:

(1) Per Pekon's E2E post, I applied the two attached patches and rebuilt the kernel, to replace the use of BCH8 with Hamming code.

5483.board-flash-hamming.patch.txt

3683.omap2-nand-hamming.patch.txt

(2) Reflashed the kernel and root file system, ensuring that I switched to Hamming ECC mode prior to flashing (nandecc hw 0). I should note here that for both U-Boot and U-Boot-min, BCH8 is used (nandecc hw 2).

(3) Boot the system. The system appears to boot fine and everything operates as expected. Our applications run fine; we're able to capture and record video.

I exercised the flash a little bit, using dd (if=/dev/urandom). I no longer saw any of the MTD messages noting CRC failures.

However, I do see a failure in the early boot stages, which appears to be a failed write and CRC error. (Looks like single bit errors to me, based upon a diff of the "write" and "read" data.) A boot log is attached; you can see this failure at the end of the log:

6786.first_boot_after_flash.txt

(4) Reboot the system. At this point we start hitting kernel oops and panics, somewhere in JFFS2 code. Log attached:

5775.kernel_crash.txt

0 Pekon Gupta over 12 years ago in reply to Jon S.

TI__Prodigy 540 points

Hi Jon,

Jon S. said:

However, I do see a failure in the early boot stages, which appears to be a failed write and CRC error. (Looks like single bit errors to me, based upon a diff of the "write" and "read" data.) A boot log is attached; you can see this failure at the end of the log:

Yes you are correct, there are single-bit-flips in data-dump provided by you. But single-bit-flips are occurring at multiple places within the same page. Thus i assume HAM1 ECC correction was not able to correct it.

(a)

00000190: 2e 69 73 2d 77 72 69 74 65 61 62 6c 65 ff ff ff

00000190: 2e 69 73 2d 77 72 69 74 65 61 62 6c 65 ff ff df

(b)

00000200: 17 08 cc d5 aa ed a7 87 df 3d a5 4a 37 30 2d 70

00000200: 17 08 cc d5 aa ef a7 87 df 3d a5 4a 37 30 2d 70

Just to prove can you please provide me the logs of your first-boot sequence (when CRC error was first observed), in following 2 cases:

CASE-1: Apply the following patch and re-run the setup without any change. (below diff is based on latest code on Arago/omap3 git repo)

0245.omap2_v1.patch.txt

CASE-2: Use ECC-type==HAM1_SW.

Use S/W implementation for calculating HAM1 ECC code instead of H/W. Though there would be some performance penalty during NAND R/W, but this would help in isolating the problem. You can change ECC-type using direction from following link

http://processors.wiki.ti.com/index.php/TI81XX_PSP_UBOOT_User_Guide#BCH_Flash_OOB_Layout

with regards, pekon

0 Jon S. over 12 years ago in reply to Pekon Gupta

Expert 1240 points

Hello Pekon,

Thank you for taking a look at this. I will obtain that information first thing when I return to the office tomorrow, and post it here ASAP.

In the meantime, here's a more extensive log that I collected yesterday, for what it's worth.

4705.log.txt

Some items of interest in this log:

Line 51: The command used to flash the JFFS2 rootfs from U-Boot.
Line 187: 'nand dump' of the page that will cause problems at the first boot.

This is one that was "emptied" via the mw.b $load_addr 0xFF $rootfs_size prior to tftp'ing the image to $load_addr.

Line 719: The first boot's "Write verify error." Again, a diff shows a couple single bit errors in the same page.
Line 1131: After a reboot, this 'nand dump' from U-Boot shows that the data matches the "write" portion of the error message.

So this means the read back that occurred @ line 719 is what failed? (As opposed to the write?)
I'm not too familiar with the OOB layout, but aren't those 0's in this page's OOB suspicious?

Line 1634: A 'nand read' from this problematic page now shows 2 bad compares. (The two bad bytes we had?)
Line 1547: ECC errors preceding kernel panic, on 2nd boot.

Other notes:

I am using EZSDK version 5.05.02.01, which includes the PSP 04.04.00.01 (2.6.37) kernel. I see that a number NAND-related fixes have been pushed to linux-omap3.git after this version (e.g., your 976d48c63a2d1a22f26832883ebafbbfbe7f9b8d commit).

Please let me know if there's a particular tag or changeset you recommend I switch to, given what I'm experiencing.

After unknowingly having BCH8 hardcoded in the kernel, but flashing my (JFFS2) rootfs from U-Boot using the HW Hamming ECC, eventually my nand was "filled" with mismarked bad blocks, leaving me no space eventually. I was forced to perform a 'nand scrub'.

I did however try booting a NFS rootfs and running 'nandtest -m' to see if I could re-identify any blocks that might have been bad from the factory. No bad blocks were found after running quite a few iterations.
I will switch over to a "fresh" NAND to test this on as soon as I can get one (early this week), to rule out bad block issues.

Thank you for your time and help,

Jon

0 Pekon Gupta over 12 years ago in reply to Jon S.

TI__Prodigy 540 points

Hi Jon,

I could re-create the issue at ur side. I think there are 'uncorrectable' bit-flips occurring during NAND access.

But as you have enabled 'NAND.write_verify', so these 'uncorrectable' ECC errors are visible during NAND-write accesses itself (which otherwise would have been caught during NAND reads). Thus, NAND writes are failing and exiting.

So plz apply following patches, and send the logs for confirmation. (Below patches will enable error reporting in HAM1 ECC scheme)

6318.omap2_v2.patch.txt

4382.nand_base_v2.patch.txt

with regards, pekon

0 Jon S. over 12 years ago in reply to Pekon Gupta

Expert 1240 points

Hi Pekon,

Here's a log for the Case 1. Note the "*Err bit-flips found" @ line 546, 547, and 1335-1347.

3808.log_case1.txt

I'm still working on getting a log for Case 2 (SW ECC). I tried making the two changes for this, but keep getting a stream of "NAND: OMAP2: omap_correct_data: *Err: bit-flips found" messages at boot, followed by a kernel panic. Here's the changes I made for this case:

Set SW ECC via 'nandecc sw' in U-Boot
changed board_nand_data.ecc_opt = OMAP_ECC_HAMMING_CODE_DEFAULT in arch/arm/mach-omap2/board-flash.c, (under the correct CPU conditional, of course)

Here's a log for the v2 patches. I was able to reboot multiple times without failure after applying these patches. "*Err" messages around lines 501, 1148, 1176, 1588.

4213.log_v2patches.txt

[Update] After further usage of the v2 patches, I did see some uncorrectable errors.

*Err: Correcting single bit ECC error at offset: 112, bit: 6
NAND: OMAP2: omap_correct_data: *Err: bit-flips found
*Err: Correcting single bit ECC error at offset: 73, bit: 7
NAND: OMAP2: omap_correct_data: *Err: bit-flips found
*Err: Correcting single bit ECC error at offset: 177, bit: 0
NAND: OMAP2: omap_correct_data: *Err: bit-flips found
*Err: UNCORRECTED_ERROR

*Err: UNCORRECTED_ERROR default
mtd->read(0x70 bytes from 0x294d044) returned ECC error
*Err: Correcting single bit ECC error at offset: 112, bit: 6
NAND: OMAP2: omap_correct_data: *Err: bit-flips found
*Err: Correcting single bit ECC error at offset: 73, bit: 7
NAND: OMAP2: omap_correct_data: *Err: bit-flips found
*Err: Correcting single bit ECC error at offset: 177, bit: 0
NAND: OMAP2: omap_correct_data: *Err: bit-flips found
*Err: UNCORRECTED_ERROR default
mtd->read(0x2c bytes from 0x294d844) returned ECC error

Thank you,

Jon

0 Pekon Gupta over 12 years ago in reply to Jon S.

TI__Prodigy 540 points

Hi Jon,

CASE-1: (Using HAM1_HW)

Jon S. said:

[Update] After further usage of the v2 patches, I did see some uncorrectable errors:

*Err: Correcting single bit ECC error at offset: 112, bit: 6
NAND: OMAP2: omap_correct_data: *Err: bit-flips found
*Err: Correcting single bit ECC error at offset: 73, bit: 7
NAND: OMAP2: omap_correct_data: *Err: bit-flips found
*Err: Correcting single bit ECC error at offset: 177, bit: 0
NAND: OMAP2: omap_correct_data: *Err: bit-flips found
*Err: UNCORRECTED_ERROR

These 'uncorrectable' ECC errors should be the real issue causing you kernel to panic. HAM1 (Hamming code) can 'detect upto 2-bit-flips errors, but correct only 1-bit-flip errors' So somewhere your NAND is seeing multi-bit-flip errors.

The patches have done nothing, except printing the exact cause of error, and not returning a NAND->write_verify failure, So that your application continues. But eventually somewhere your application would crash, as your NAND data has been corrupted of 'uncorrectable bit-flips'.

Therefore I suggest its better to move to higher ECC schemes like BCH8. But given the constrain of NAND OOB/spare area layout, BCH8 cannot be used along with JFFS2 for 2K/64B NAND devices which has 64bit of OOB(spare) region per page. So you have following choices.

(a) Move to UBIFS File-system. (preferred)

(b) Use a NAND device which has more OOB(spare) bytes per page of NAND (like 4K/224B NAND).

Caution: TI ROM code automatically selects BCH16 ECC scheme for 4K/224B NAND device. However both U-boot and Kernel currently do _not_ support BCH16. So you have to use some 3rd party tool to flash your U-Boot image on NAND, if using 4K/224B NAND device.

CASE-2: (Using HAM1_SW)

Also few observations from your log. I hope you are taking care of following for CASE-2 (switching from HW_ECC to SW_ECC)

(1) To change the ECC scheme in Kernel you have to edit 2 separate files (not just 1)

$KERNEL/arch/arm/board-flash.c: nand-board-init() board_nand_data.ecc_opt = OMAP_ECC_HAMMING_CODE_DEFAULT;
$KERNEL/driver/nand/omap2.c: omap_nand_probe(): pdata->ecc_opt = OMAP_ECC_HAMMING_CODE_DEFAULT; // (Hope you din't miss this)

(2) Your U-Boot flashing code, still has following command // (hope you din't miss this)

flash_rootfs=setenv serverip $tftp_server; mw.b $load_addr 0xFF $rootfs_size; if tftp $load_addr $tftp_path/$rootfs_file; then nandecc hw 0;
You need to make it ..
flash_rootfs=setenv serverip $tftp_server; mw.b $load_addr 0xFF $rootfs_size; if tftp $load_addr $tftp_path/$rootfs_file; then nandecc sw 0;

with regards, pekon

0 Jon S. over 12 years ago in reply to Pekon Gupta

Expert 1240 points

Hi Pekon,

I was sure to change to 'nandecc sw' in (2), but I accidentally missed omap_nand_probe() in (1). I can perform the Case-2 tests and report back if you think it's worthwhile, but it sounds like I should just go with BCH8 + UBIFS. I'll start going that route.

I saw that there's been some UBIFS-related patches to linux-omap3.git in December 2012 -- any advice on whether I need to pull some patches down or just go with what's provided in the EZSDK? (EZSDK 5.05.02.00 comes with linux-2.6.37-psp04.04.00.01). For example, I saw this patch:

http://arago-project.org/git/projects/?p=linux-omap3.git;a=commit;h=243977171ae666f012cc38c76e28bc0fe3d532f5

Thank you for your time and help -- it's greatly appreciated.

Jon

0 Pekon Gupta over 12 years ago in reply to Jon S.

TI__Prodigy 540 points

Hi Jon,

Yes you should pull-in all the latest patches on TI81xx from Arago/omap3 repo.

And, choosing BCH8 + UBIFS is better, and more scalable solution for you..

Also, Please mark this post as Answered | Verified if you issues are resolved. Thanks.

with regards, pekon

0 Jon S. over 12 years ago in reply to Pekon Gupta

Expert 1240 points

Hi Pekon,

I'll be sure to do that. My apologies, I intended to mark the thread verified last night before I left.

Thank you very much for your time and help.

- Jon

0 Doug Brown over 12 years ago in reply to Jon S.

Intellectual 320 points

Hey there,

I just wanted to add my two cents -- I've been working with the AM3517 Craneboard which has the exact same NAND chip (MT29F2G16ABAEA). According to its datasheet, this particular NAND chip requires at least a 4-bit ECC per 528 bytes (512 + 16 OOB, presumably), except on the first 128KB block (which only requires a 1-bit ECC per 528 bytes). So I think that would explain why you were sometimes seeing more than one bit flip. I was seeing similar problems with 1-bit ECC so I switched to BCH4 for storing u-boot, the kernel, and the filesystem (JFFS2). It hasn't been very long, but so far, so good...hopefully it continues to work okay. Sounds like BCH8 + UBIFS is another great approach too--good to know!

Doug

Processors

Processors forum

[EZSDK] NAND ECC Errors