This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
Hi there,
I'm seen a large amount of JFFS2 badblock and NAND CRC errors during normal use on our DM8148-based platform running off of NAND. The error messages are either "Read of newly-erased block at <address> failed: -74. Putting on bad_list" or CRC check failures.
Aside from the stream of messages, the OS runs fine. I'm concerned that the entire NAND will gradually be mismarked "bad" due to issues with the CRC scheme.
This exact issue appears to be covered in this forum post, but the fix appears to be relevant only for the IPNC RDK.
http://e2e.ti.com/support/embedded/linux/f/354/t/191572.aspx
Could someone shed some light on this as it pertains to the DM8148/8168 EZSDKs?
With respect to ECC algorithm, I've been using the defaults provided with the EZSDK, and have followed the PSP Flashing Tools Guide for flashing U-Boot, the kernel and the rootfs to NAND, which indicates BCH8 should be used.
I'm using EZSDK 5.05 for the DM8148.
Thank you,
Jon
After further research, I see that my question has already been very clearly answered in this thread by Pekon Gupta. (Many thanks, Pekon!)
http://e2e.ti.com/support/embedded/linux/f/354/p/230270/808995.aspx#808995
From the PSP user guides, it was not clear to me that a couple one-line changes needed to be made to what's provided in the EZSDK-provided kernel when using NAND + JFFS2.
However, I certainly might have missed something. I'd still love to hear any further details folks have to share regarding the ECC scheme situation for the 8148, especially if changes are planned in future EZSDK releases.
- Jon
I unmarked the above post as "Verified" as I'm still running into some issues. I've tried to re-summarize the issue here.
I should also note that we're using a 29F2G16ABAEA NAND, which I believe should be the same as the NAND on the EVM, or at least have the same timing parameters.
From my understanding (based upon the PSP documentation), BCH8 *cannot* be used for JFFS2 on NAND devices. Please correct me here if I'm wrong.
I had been using JFFS2 + NAND, and running into messages like the one below. When I wrote files to the JFFS2 file system, I'd see plenty of theses errors, and blocks would be mismarked bad. Over time, the NAND had little free space left, due to blocks being mismarked bad.
mtd->read(0x100 bytes from 0x0) returned ECC error
At this point, I started following the post I linked above, switching from BCH8 to Hamming for the filesystem. (I also see that anything access from Linux should use Hamming as well.)
Below is the procedure I followed, and the issues encountered:
(1) Per Pekon's E2E post, I applied the two attached patches and rebuilt the kernel, to replace the use of BCH8 with Hamming code.
5483.board-flash-hamming.patch.txt
3683.omap2-nand-hamming.patch.txt
(2) Reflashed the kernel and root file system, ensuring that I switched to Hamming ECC mode prior to flashing (nandecc hw 0). I should note here that for both U-Boot and U-Boot-min, BCH8 is used (nandecc hw 2).
(3) Boot the system. The system appears to boot fine and everything operates as expected. Our applications run fine; we're able to capture and record video.
I exercised the flash a little bit, using dd (if=/dev/urandom). I no longer saw any of the MTD messages noting CRC failures.
However, I do see a failure in the early boot stages, which appears to be a failed write and CRC error. (Looks like single bit errors to me, based upon a diff of the "write" and "read" data.) A boot log is attached; you can see this failure at the end of the log:
6786.first_boot_after_flash.txt
(4) Reboot the system. At this point we start hitting kernel oops and panics, somewhere in JFFS2 code. Log attached:
Hi Jon,
Jon S. said:However, I do see a failure in the early boot stages, which appears to be a failed write and CRC error. (Looks like single bit errors to me, based upon a diff of the "write" and "read" data.) A boot log is attached; you can see this failure at the end of the log:
Yes you are correct, there are single-bit-flips in data-dump provided by you. But single-bit-flips are occurring at multiple places within the same page. Thus i assume HAM1 ECC correction was not able to correct it.
(a)
00000190: 2e 69 73 2d 77 72 69 74 65 61 62 6c 65 ff ff ff
00000190: 2e 69 73 2d 77 72 69 74 65 61 62 6c 65 ff ff df
(b)
00000200: 17 08 cc d5 aa ed a7 87 df 3d a5 4a 37 30 2d 70
00000200: 17 08 cc d5 aa ef a7 87 df 3d a5 4a 37 30 2d 70
Just to prove can you please provide me the logs of your first-boot sequence (when CRC error was first observed), in following 2 cases:
CASE-1: Apply the following patch and re-run the setup without any change. (below diff is based on latest code on Arago/omap3 git repo)
CASE-2: Use ECC-type==HAM1_SW.
Use S/W implementation for calculating HAM1 ECC code instead of H/W. Though there would be some performance penalty during NAND R/W, but this would help in isolating the problem. You can change ECC-type using direction from following link
http://processors.wiki.ti.com/index.php/TI81XX_PSP_UBOOT_User_Guide#BCH_Flash_OOB_Layout
with regards, pekon
Hello Pekon,
Thank you for taking a look at this. I will obtain that information first thing when I return to the office tomorrow, and post it here ASAP.
In the meantime, here's a more extensive log that I collected yesterday, for what it's worth.
Some items of interest in this log:
- This is one that was "emptied" via the mw.b $load_addr 0xFF $rootfs_size prior to tftp'ing the image to $load_addr.
Other notes:
Thank you for your time and help,
Jon
Hi Jon,
I could re-create the issue at ur side. I think there are 'uncorrectable' bit-flips occurring during NAND access.
But as you have enabled 'NAND.write_verify', so these 'uncorrectable' ECC errors are visible during NAND-write accesses itself (which otherwise would have been caught during NAND reads). Thus, NAND writes are failing and exiting.
So plz apply following patches, and send the logs for confirmation. (Below patches will enable error reporting in HAM1 ECC scheme)
with regards, pekon
Hi Pekon,
Here's a log for the Case 1. Note the "*Err bit-flips found" @ line 546, 547, and 1335-1347.
I'm still working on getting a log for Case 2 (SW ECC). I tried making the two changes for this, but keep getting a stream of "NAND: OMAP2: omap_correct_data: *Err: bit-flips found" messages at boot, followed by a kernel panic. Here's the changes I made for this case:
Here's a log for the v2 patches. I was able to reboot multiple times without failure after applying these patches. "*Err" messages around lines 501, 1148, 1176, 1588.
[Update] After further usage of the v2 patches, I did see some uncorrectable errors.
*Err: Correcting single bit ECC error at offset: 112, bit: 6
NAND: OMAP2: omap_correct_data: *Err: bit-flips found
*Err: Correcting single bit ECC error at offset: 73, bit: 7
NAND: OMAP2: omap_correct_data: *Err: bit-flips found
*Err: Correcting single bit ECC error at offset: 177, bit: 0
NAND: OMAP2: omap_correct_data: *Err: bit-flips found
*Err: UNCORRECTED_ERROR
*Err: UNCORRECTED_ERROR default
mtd->read(0x70 bytes from 0x294d044) returned ECC error
*Err: Correcting single bit ECC error at offset: 112, bit: 6
NAND: OMAP2: omap_correct_data: *Err: bit-flips found
*Err: Correcting single bit ECC error at offset: 73, bit: 7
NAND: OMAP2: omap_correct_data: *Err: bit-flips found
*Err: Correcting single bit ECC error at offset: 177, bit: 0
NAND: OMAP2: omap_correct_data: *Err: bit-flips found
*Err: UNCORRECTED_ERROR default
mtd->read(0x2c bytes from 0x294d844) returned ECC error
Thank you,
Jon
Hi Jon,
CASE-1: (Using HAM1_HW)
Jon S. said:[Update] After further usage of the v2 patches, I did see some uncorrectable errors:
*Err: Correcting single bit ECC error at offset: 112, bit: 6
NAND: OMAP2: omap_correct_data: *Err: bit-flips found
*Err: Correcting single bit ECC error at offset: 73, bit: 7
NAND: OMAP2: omap_correct_data: *Err: bit-flips found
*Err: Correcting single bit ECC error at offset: 177, bit: 0
NAND: OMAP2: omap_correct_data: *Err: bit-flips found
*Err: UNCORRECTED_ERROR
These 'uncorrectable' ECC errors should be the real issue causing you kernel to panic. HAM1 (Hamming code) can 'detect upto 2-bit-flips errors, but correct only 1-bit-flip errors' So somewhere your NAND is seeing multi-bit-flip errors.
The patches have done nothing, except printing the exact cause of error, and not returning a NAND->write_verify failure, So that your application continues. But eventually somewhere your application would crash, as your NAND data has been corrupted of 'uncorrectable bit-flips'.
Therefore I suggest its better to move to higher ECC schemes like BCH8. But given the constrain of NAND OOB/spare area layout, BCH8 cannot be used along with JFFS2 for 2K/64B NAND devices which has 64bit of OOB(spare) region per page. So you have following choices.
(a) Move to UBIFS File-system. (preferred)
(b) Use a NAND device which has more OOB(spare) bytes per page of NAND (like 4K/224B NAND).
Caution: TI ROM code automatically selects BCH16 ECC scheme for 4K/224B NAND device. However both U-boot and Kernel currently do _not_ support BCH16. So you have to use some 3rd party tool to flash your U-Boot image on NAND, if using 4K/224B NAND device.
CASE-2: (Using HAM1_SW)
Also few observations from your log. I hope you are taking care of following for CASE-2 (switching from HW_ECC to SW_ECC)
(1) To change the ECC scheme in Kernel you have to edit 2 separate files (not just 1)
(2) Your U-Boot flashing code, still has following command // (hope you din't miss this)
with regards, pekon
Hi Pekon,
I was sure to change to 'nandecc sw' in (2), but I accidentally missed omap_nand_probe() in (1). I can perform the Case-2 tests and report back if you think it's worthwhile, but it sounds like I should just go with BCH8 + UBIFS. I'll start going that route.
I saw that there's been some UBIFS-related patches to linux-omap3.git in December 2012 -- any advice on whether I need to pull some patches down or just go with what's provided in the EZSDK? (EZSDK 5.05.02.00 comes with linux-2.6.37-psp04.04.00.01). For example, I saw this patch:
http://arago-project.org/git/projects/?p=linux-omap3.git;a=commit;h=243977171ae666f012cc38c76e28bc0fe3d532f5
Thank you for your time and help -- it's greatly appreciated.
Jon
Hi Jon,
Yes you should pull-in all the latest patches on TI81xx from Arago/omap3 repo.
And, choosing BCH8 + UBIFS is better, and more scalable solution for you..
Also, Please mark this post as Answered | Verified if you issues are resolved. Thanks.
with regards, pekon
Hi Pekon,
I'll be sure to do that. My apologies, I intended to mark the thread verified last night before I left.
Thank you very much for your time and help.
- Jon
Hey there,
I just wanted to add my two cents -- I've been working with the AM3517 Craneboard which has the exact same NAND chip (MT29F2G16ABAEA). According to its datasheet, this particular NAND chip requires at least a 4-bit ECC per 528 bytes (512 + 16 OOB, presumably), except on the first 128KB block (which only requires a 1-bit ECC per 528 bytes). So I think that would explain why you were sometimes seeing more than one bit flip. I was seeing similar problems with 1-bit ECC so I switched to BCH4 for storing u-boot, the kernel, and the filesystem (JFFS2). It hasn't been very long, but so far, so good...hopefully it continues to work okay. Sounds like BCH8 + UBIFS is another great approach too--good to know!
Doug