DM8148 ECC issues: BCH8 + UBIFS

Jon S.

Other Parts Discussed in Thread: DM385

Hi there,

After having issues with JFFS2 and Hamming HW ECC on the DM8148 and our NAND, I switched to BCH8 + UBIFS, per the solution in this thread.

I got the BCH8 + UBIFS running, but I'm still seeing ECC errors, such as the following.

UBI error: ubi_io_read: error -74 (ECC error) while reading 129024 bytes from PEB 449:2048, read 129024 bytes
UBI error: ubi_io_read: error -74 (ECC error) while reading 131072 bytes from PEB 1813:0, read 131072 bytes
UBI error: torture_peb: read problems on freshly erased PEB 1813, must be bad
UBI error: erase_worker: failed to erase PEB 1813, error -5
UBI: mark PEB 1813 as bad
UBI: 17 PEBs left in the reserve

After a number of reboots, all PEBs reserved for bad blocks are used, and the filesytems goes into a read-only mode. (We'll need it to be RW in our application.)

The procedure I followed was:

Start with EZSDK 5.05.02.00
Fetch master from linux-omap3.git so I have all the lastest NAND/ECC related patches. (At the time of this post, [master] and [ti81xx-master] are @ 5efb81ac9e9b0c72a8e4edf159b7b5e06980ee56. I added my defconfig and updated the EZSDK make rules to point to this kernel, and did a clean build.
Take my existing rootfs and create my ubi image. Here are my notes on this, which shows the parameter used. It appears that our device supports 512-byte subppages. 3808.ubifs_notes.txt
[Update - The PEB --> LEB conversion in my attached notes is wrong: It should be 128 KiB / 126 KiB]
Flash the ubi image and boot the system as shown in the attached log.In this log you'll see the ECC errors and PEBs being marked bad at each reboot. 7838.ecc_error_log.txt
[Update - Please see the log from the next post instead; it is more up-to-date]

In the meantime, I'm going to try switching to SW ECC to rule out any HW ECC issues. I'll post back with those results as soon as I can.

Has anyone else resolved this issue? I've seen a number of similar threads without clear solutions. However, I'm not sure if they ended up using the latest in linux-omap3.git. If anyone could provide some advice on isolating and addressing this issue, it would be greatly appreciated.

Thank you!

Jon

over 11 years ago

0 Jon S. over 11 years ago

Expert 1240 points

So a bit of progress has been made... fixed up our board file to use NAND_OMAP_BUS_16 in the board_nand_init() call.

We based our stuff of the EVM code from the EZSDK, which checked bit 16 of a status register for the bus width, and then passed the board_nand_init() a 0 or 2. I see NAND_OMAP_BUS_16 is defined as 1 now, after further reviewing changes made in linux-omap3.git. I'm wondering if an 8-bit bus width is hard-coded down in U-Boot still...

At this point the board's booting but seemingly getting stuck doing UBI torture tests on a PEB:

UBI: run torture test for PEB 1813
UBI: PEB 1813 passed torture test, do not mark it as bad
UBI error: ubi_io_read: error -74 (ECC error) while reading 512 bytes from PEB 412:512, read 512 bytes
UBI error: ubi_io_read: error -74 (ECC error) while reading 512 bytes from PEB 1813:512, read 512 bytes
UBI: run torture test for PEB 1813
UBI: PEB 1813 passed torture test, do not mark it as bad
UBI error: ubi_io_read: error -74 (ECC error) while reading 512 bytes from PEB 412:512, read 512 bytes
UBI error: ubi_io_read: error -74 (ECC error) while reading 512 bytes from PEB 1813:512, read 512 bytes
... And so...

It seems to be stuck in an infinite loop torturing that one PEB. Is that "passed torture test" and error that follows contradictory?

Boot log: 3187.ecc_err_log_2.txt

[UPDATE]: We tried marking those PEBs as bad (via 'nand bad' in U-Boot). The system will boot fine a few times, and then fall back into this "infinite torture test loop" with two different PEBs after successive reboots.

It was brought to my attention that similar issues are documented in the following thread. As I noted in my first post, I cloned the latest from Arago's linux-omap3.git. I'm unsure as to whether Renjith's patches still need to be applied, or the existing commits to the Arago repo should suffice..

http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/716/t/211477.aspx?pi70909=2

At the moment, I'm reviewing all patches mentioned in the above thread, trying to determine which may be relevant to a fix for us.

Thanks,

- Jon

0 Daniel70334 over 11 years ago in reply to Jon S.

Expert 1745 points

I had a similar problem with my part. The solution for me was to disable subpage access, by adding a "-s 1024 -O 4096" to every ubiformat and "-O 4096" to ubiattach ...

Such was the advise in http://processors.wiki.ti.com/index.php/UBIFS_Support

0 Jon S. over 11 years ago in reply to Daniel70334

Expert 1240 points

Hi Daniel,

Disabling subpage access did the trick. Many thanks for sharing this -- I really appreciate it!

I will be marking your answer "Verified" in a day or two. I'm holding off just to see if TI might share some additional info regarding the NAND controller/HW ECC and subpage accesses.

Under this section of that wiki page I did see the that they were not using sub-pages. However, it was not clear to me that this was not supported. I simply figured that the author was not using a NAND with subpages, and that I should use this functionality since my device supports it.

I've been spending some time today with the UBIFS FAQ, and saw this:

"If the NAND flash supports sub-pages, then what can be done is ECC codes can be calculated on per-sub-page basis, instead of per-NAND page basis. In this case it becomes possible to read and write sub-pages independently.

But obviously, even though the NAND chip may support sub-pages, the NAND controller may disallow them. Indeed, if the flash is managed by a controller which calculates ECC codes on per-NAND page basis, then it is impossible to do I/O in sub-page fractions. E.g. this is the case for the OLPC XO-1 laptop) - its NAND chip supports sub-pages, but the NAND controller does not."

When mtdinfo -u indicated that my device supported subpages, I started going down this sub-page route. I didn't see any limitations expressed in the TRM or datasheet. (I was looking under the GPMC and associated ECC sections). So my final question is, does the DM8148 and current SW support subpage access? (If so, could you point me to the TRM section?)

Thanks,

Jon

0 Pekon Gupta over 11 years ago in reply to Jon S.

TI__Prodigy 540 points

Hi Jon,

Jon S. said:

So my final question is, does the DM8148 and current SW support subpage access? (If so, could you point me to the TRM section?)

(1)

Currently we do not support any sub-page access, this is documented as part of Release-Notes.

(Please always refer to Release Notes of every release, as this would save your time in testing & debugging for limitations | bugs which are already known.)

But yes, sub-page support is our next priority, and so we might get it done soon.

(2)

Also please pull in latest code for TI814x from arago/omap3.git, as this has some patches for BCH8 ECC scheme which may be important for you.

It also has multiple-patches for auto-detection of NAND bus-width for an ONFI compliant device.

(3)

Yes BCH8+UBIFS is a way to go both in terms of scalability and robustness of the system. And once we add the sub-page support, UBIFS will become more efficient in using the NAND storage capacity.

with regards, pekon

0 Jon S. over 11 years ago in reply to Pekon Gupta

Expert 1240 points

Hello Pekon,

(1)

Thank you for verifying the sub-page access limitation and pointing those release notes out. Coming from the PSP bundled with the EZSDK, I admittedly misses those release notes while quickly trying to bring my kernel up to date.

(2) We got the latest arago linux-omap3 kernel running -- so far so good!

We did indeed see the bus-width auto-detected correctly. It actually helped us debug an incorrect bus-width setting when we first upgraded to this kernel. After reviewing the git log, I see there's actually a number of useful items to us in the current repo!

Thanks again for your time an help!

Jon

0 Pekon Gupta over 11 years ago in reply to Jon S.

TI__Prodigy 540 points

Hi Jon,

Not sure if you are still working on TI device. But just for update we now support

- NAND subpage read and write accesses

- NAND BCH16 ECC in kernel.

For updates please refer to latest patches on arago/omap3.git

with regards, pekon

0 Jon S. over 11 years ago in reply to Pekon Gupta

Expert 1240 points

Hi Pekon,

Thank you very much for sending this information way -- we are indeed moving forward with this part and are pleased to hear about this update!

I'll be sure to review the lastest arago/omap3.git changes. Within a month or two, I may be looking for/starting a thread regarding some of the arago/omap3.git changes, with respect to ethernet functionality. After we last spoke, I found that pulling in the arago/omap3 changes broke some ethernet support (larger packets seem to grind things to a halt), but I haven't had time to merge in pack hunks at a time to identify the culprit. Note I have not yet concluded that there's any issue in the code base -- it very well may be a defect introduced during my merging.

Just wanted to mention that networking item, in case you or someone else has already seen this. If I remember correctly, in the changes, some CPSW code or inits have been moved to or from the board file, and some queuing patches had been applied...

0 Pekon Gupta over 11 years ago in reply to Jon S.

TI__Prodigy 540 points

Hi Jon,

Sorry I do not have any clue about Network drivers so can't help you with that.So you have to follow-up independently with some expert there.

It would be good to open-up as independent E2E forum post for you issue. OR raise it via AE supporting you.

with regards, pekon

0 Michael Po over 11 years ago in reply to Pekon Gupta

Intellectual 290 points

I've been trying to get "something" working on my 2GB nand flash here. JFFS2 refuses to work with large erase blocks/OOB counts, so I'm down to using ubifs. I'm using the latest kernel with BCH16 support. I currently tried BCH8 and BCH16, with no differences

/ # mtdinfo /dev/mtd3
mtd3
Name: Rootfs
Type: nand
Eraseblock size: 262144 bytes, 256.0 KiB
Amount of eraseblocks: 2048 (536870912 bytes, 512.0 MiB)
Minimum input/output unit size: 4096 bytes
Sub-page size: 1024 bytes
OOB size: 224 bytes
Character device major/minor: 90:6
Bad blocks are allowed: true
Device is writable: true

I've been following this topic, and read the wiki page to try to format and use a partition on that nand (using ubiformat /dev/mtd3 -s 1024 -O 2048 OR without using the subpage parameter) but as soon as I do ubiattach with the same -O, I get hundreds of errors (one or more per erase block):

UBI error: ubi_io_read: error -74 (ECC error) while reading 64 bytes from PEB 2047:0, read 64 bytes
UBI error: ubi_io_read: error -74 (ECC error) while reading 1024 bytes from PEB 2047:2048, read 1024 bytes

Anyone has a clue how to use that NAND? The attach eventually works, but the filesystem I create on it will also report ECC errors when umounted/remounted, so there is an underlying problem...

UBI: max. sequence number: 0
UBI: attached mtd3 to ubi0
UBI: MTD device name: "Rootfs"
UBI: MTD device size: 512 MiB
UBI: number of good PEBs: 2048
UBI: number of bad PEBs: 0
UBI: number of corrupted PEBs: 0
UBI: max. allowed volumes: 128
UBI: wear-leveling threshold: 4096
UBI: number of internal volumes: 1
UBI: number of user volumes: 0
UBI: available PEBs: 2024
UBI: total number of reserved PEBs: 24
UBI: number of PEBs reserved for bad PEB handling: 20
UBI: max/mean erase counter: 0/0
UBI: image sequence number: 1609032271
UBI: background thread "ubi_bgt0d" started, PID 204
UBI device number 0, total 2048 LEBs (528482304 bytes, 504.0 MiB), available 2024 LEBs (522289152 bytes, 498.1 MiB), LEB size 258048 bytes (252.0 KiB)

0 Leon Pollak over 9 years ago in reply to Pekon Gupta

Intellectual 950 points

Hello, Pekon and all.

We can't overcome the NAND-BCH8-UBIFS problems.

We took the last Arago git kernel (from Oct 2013), which is supposed to contain all known patches for all known issues and we still have ECC errors just on simple nand_write / nand_dump operations. Although the dumped content is exactly the same as written one, we see the ECC error messages on the console.

We also noted that the patches from Thomas discussed here
e2e.ti.com/.../211477
seem to not appear in the Arago git repository, which is questionable...

Can somebody enlighten the situation? How can the working kernel with some (UBIFS or JFFS2-with-external-clean-marker) file system be obtained?

Thanks!

0 Ivan Frederiks over 8 years ago in reply to Pekon Gupta

Intellectual 370 points

Hello!

I'm almost sure that BCH8 support is broken by series of commits that add BCH16 support. I mean commits starting from "mtd:nand:omap2: cleaning of omap_correct_data for BCH8 ECC scheme" and ending at "mtd:nand:omap2: [4/4] add support for BCH16 ECC scheme".

I found this issue while porting DM385 IPNC RDK Linux kernel to custom DMVA4 board with 8-bit NAND chip with 2k pages. All mtd utilities were misbehaving with BCH8 enabled until I reverted BCH16 patches. Maybe I will find some time to elaborate this problem and search for BCH8 setup regression.

Pekon Gupta , if you are still here please check your patches for BCH8 issues.

Thank you in advance!

0 ranchu over 8 years ago in reply to Ivan Frederiks

Guru 20755 points

Hello,

Did anyone solve the ecc nightmare for using bch8 or bch4 ?

Regards,

Ran

0 Ivan Frederiks over 8 years ago in reply to ranchu

Intellectual 370 points

I got BCH8 fully working in ROM bootloader, U-Boot and Linux.

0 Ron Shpasser over 8 years ago in reply to Ivan Frederiks

Prodigy 60 points

Hello Ivan,

Does it mean that in order to make it work you had to revert all of the commits you mentioned before.

Thanks in advance,

Ron.

0 Ivan Frederiks over 8 years ago in reply to Ron Shpasser

Intellectual 370 points

Hello, Ron!

Ron Shpasser said:
Does it mean that in order to make it work you had to revert all of the commits you mentioned before.

Yep. To be more specific:

arago-project.org/.../

df2a2be mtd:nand:omap2: [4/4] add support for BCH16 ECC scheme
6fa1d35 mtd:nand:omap2: [3/3] add support for BCH16 ECC scheme
72bff96 mtd:nand:omap2: [2/3] add support for BCH16 ECC scheme
142b05d mtd:nand:omap2: [1/3] add support for BCH16 ECC scheme
ec6e2aa mtd:nand:omap2: [0/3] add support for BCH16 ECC scheme
1f956da mtd:nand:omap2: cleaning of omap_correct_data for BCH8 ECC scheme

Because of the holidays, TI E2E™ design support forum responses will be delayed from Dec. 25 through Jan. 2. Thank you for your patience.

Processors

Processors forum

DM8148 ECC issues: BCH8 + UBIFS