UBIFS error with 2.6.32 TI kernel (AM35x-OMAP35x-PSP-SDK-03.00.01.06.tgz) - !!!read-only filesystem!!!

Matteo Mattei

Prodigy 220 points

Other Parts Discussed in Thread: OMAP3530

Hello,

we have two set of custom boards with OMAP3530 and different NAND chips (BOARD1 and BOARD2).

Relevant configurations for our original hardware revision are the followings:

BOARD1

Micron NAND flash - MT29F4G08ABCHC-ET
2.6.29 Linux kernel
YAFFS2 flash file system (system boots from flash memory)

BOARD2

Micron NAND flash - MT29F4G08ABBDAH4-IT
2.6.32 Linux kernel (long term support release)
UBIFS flash file system (system boots from flash memory)

The issue initially occurred during power cycle testing with BOARD2, Linux 2.6.32, and the UBIFS file system:

Power on for 4 minutes (more than enough time for the system to boot completely), power off for 1 minute, and repeat.
Intermittently UBIFS flash file system initialization fails, related to recovery from unclean shutdown at power off. The failure is persistent across subsequent power cycles and requires reinitializing the board's flash memory to recover. The specific error was
- UBIFS function ubifs_leb_unmap() returns an error that is handled by switching the file system to read only mode, which prevents normal startup.

Observations regarding the problem:

Occurs at a constant 25C or during temperature cycling over -30C to 70C. Conclusion: temperature is not a factor.
Occurs with BOARD1 and BOARD2 when using Linux 2.6.32 and UBIFS file system.
Does not occur with BOARD1 using Linux 2.6.29 (old TI PSP) and YAFFS2 file system.

Our initial attempt to address the issue (with the Linux 2.6.32/UBIFS configuration) was:

Analyze OMAP NAND flash timing. Timing was ruled out as cause of the issue, although improvements were identified and implemented.
Apply a UBIFS patch to address ubifs_leb_unmap() error.
Set 2.6.32 Linux kernel IO scheduler to NOOP to return to 2.6.29 behavior (by default IO scheduling is enabled in 2.6.32).

The results were:

Significant decrease in failure rate -- from 3-4 out of 8 test units in < 1 day to 1 out of 8 test units in 6 days.
The ubifs_leb_unmap() error appears to be resolved, although the external symptoms are the same (file system switched to read only mode, failure to boot).
The current failure appears to be at a lower level beneath UBIFS in UBI or MTD, and related to incorrect handling of bad blocks during error recovery at system start. It is possible that UBIFS is triggering the error via incorrect usage of UBI/MTD.

The new error we have is this:

UBI: attaching mtd6 to ubi0
UBI: physical eraseblock size: 131072 bytes (128 KiB)
UBI: logical eraseblock size: 126976 bytes
UBI: smallest flash I/O unit: 2048
UBI: VID header offset: 2048 (aligned 2048)
UBI: data offset: 4096
UBI: max. sequence number: 1621513
UBI: attached mtd6 to ubi0
UBI: MTD device name: "FileSystem"
UBI: MTD device size: 847 MiB
UBI: number of good PEBs: 6769
UBI: number of bad PEBs: 7
UBI: number of corrupted PEBs: 0
UBI: max. allowed volumes: 128
UBI: wear-leveling threshold: 4096
UBI: number of internal volumes: 1
UBI: number of user volumes: 1
UBI: available PEBs: 0
UBI: total number of reserved PEBs: 6769
UBI: number of PEBs reserved for bad PEB handling: 134
UBI: max/mean erase counter: 2417/239
UBI: image sequence number: 0
UBI: background thread "ubi_bgt0d" started, PID 587
UBIFS: recovery needed
UBIFS error (pid 590): ubifs_scan: corrupt empty space at LEB 997:116771
UBIFS error (pid 590): ubifs_scanned_corruption: corruption at LEB 997:116771
UBIFS error (pid 590): ubifs_scan: LEB 997 scanning failed
UBIFS error (pid 590): do_commit: commit failed, error -117
UBIFS warning (pid 590): ubifs_ro_mode: switched to read-only mode, error -117
UBIFS: recovery completed
UBIFS: mounted UBI device 0, volume 0, name "ROOTFS"
UBIFS: file system size: 839819264 bytes (820136 KiB, 800 MiB, 6614 LEBs)
UBIFS: journal size: 33521664 bytes (32736 KiB, 31 MiB, 264 LEBs)
UBIFS: media format: w4/r0 (latest is w4/r0)
UBIFS: default compressor: none
UBIFS: reserved for root: 4952683 bytes (4836 KiB)

Do you have any idea on what we can do to solve this issue?

Thanks,

Matteo

over 13 years ago

0 Lorenzo Indiani over 13 years ago

TI__Prodigy 40 points

Matteo,

As you are encountering corruptions at file system level you should go deeper and see what's going on at MTD level (raw block level).

Compile the kernel with option CONFIG_MTD_TESTS = module.

Have the MTD tools (a selection of ARM binaries) available in the kernel: especially flash_eraseall.

Use flash_eraseall to erase a MTD partition.

Then use the module mtd_stresstest.ko to read / write the MTD partition. Stress the MTD partition for hours.

mtd_stresstest can show some errors. These errors are potentially due to the fact that numbers of NAND ECC bits required by memory to protect each NAND block may be not aligned between memory specification and adopted ECC engine. MT29F4G08ABBDAH4-IT memory require a 4 bit ECC.

Which ECC engine and algorithm are you using? The Memory internal one or the one provided in the OMAP GPMC?

Regards

Lorenzo

0 Matteo Mattei over 13 years ago in reply to Lorenzo Indiani

Prodigy 220 points

Hi Lorenzo,

I tested the new code (kernel 2.6.32) with BORD1 and BOARD2.

While all mtd_tests passed with BOARD 1 (with the old flash chip), they failed with the BOARD 2 (with the new flash part).

I am using the ECC engine provided in the OMAP GPMC.

My work of the past days was to backport some code from 2.6.37 (from PSP-04.02.00.07) and 3.2.x (from kernel.org) related to drivers/mtd in order to align our code to the new drivers. The only thing I did not back-port is the prefetch mechanism since I don't use it and some pieces of omap2.c.

However the mtd_tests continue to fail with the BOARD 2.

Furthermore, using a script that stresses the flash with continuous reads/writes I obtain a lot of "UBI: scrubbed" messages in dmesg and sometimes a BCH decoding failure:

BCH decoding failed
UBI error: ubi_io_read: error -74 (ECC error) while reading 4144 bytes from PEB 4998:124928, read 4144 bytes
UBIFS error (pid 2000): try_read_node: cannot read node type 1 from LEB 2202:120832, error -74
BCH decoding failed
UBI error: ubi_io_read: error -74 (ECC error) while reading 4144 bytes from PEB 4998:124928, read 4144 bytes
UBI: scrubbed PEB 4232 (LEB 0:5820), data moved to PEB 5105
UBI: scrubbed PEB 2702 (LEB 0:4052), data moved to PEB 2289
UBI: scrubbed PEB 930 (LEB 0:5512), data moved to PEB 5105
UBI: scrubbed PEB 1524 (LEB 0:6486), data moved to PEB 1975
UBI: scrubbed PEB 4769 (LEB 0:5940), data moved to PEB 1975
UBI: scrubbed PEB 2702 (LEB 0:5342), data moved to PEB 1975
UBI: scrubbed PEB 5474 (LEB 0:6490), data moved to PEB 1975
UBI: scrubbed PEB 2896 (LEB 0:6290), data moved to PEB 6426
UBI: scrubbed PEB 2041 (LEB 0:6426), data moved to PEB 1975
UBI: scrubbed PEB 3611 (LEB 0:5662), data moved to PEB 1975
UBI: scrubbed PEB 5545 (LEB 0:6367), data moved to PEB 1136
UBI: scrubbed PEB 3187 (LEB 0:5914), data moved to PEB 3697
UBI: scrubbed PEB 143 (LEB 0:5441), data moved to PEB 3697
UBI: scrubbed PEB 4962 (LEB 0:5761), data moved to PEB 3697
UBI: scrubbed PEB 4063 (LEB 0:5806), data moved to PEB 3697
UBI: scrubbed PEB 2934 (LEB 0:2858), data moved to PEB 3697
UBI: scrubbed PEB 1349 (LEB 0:4295), data moved to PEB 3697
UBI: scrubbed PEB 223 (LEB 0:6478), data moved to PEB 3697
UBI: scrubbed PEB 5513 (LEB 0:628), data moved to PEB 3697
UBI: scrubbed PEB 5368 (LEB 0:5825), data moved to PEB 3697
UBI: scrubbed PEB 1349 (LEB 0:4693), data moved to PEB 3697
UBI: scrubbed PEB 764 (LEB 0:6219), data moved to PEB 3697
UBI: scrubbed PEB 4417 (LEB 0:6373), data moved to PEB 3697
UBI: scrubbed PEB 5649 (LEB 0:6052), data moved to PEB 3697
UBI: scrubbed PEB 4450 (LEB 0:2857), data moved to PEB 3697
UBI: scrubbed PEB 5368 (LEB 0:6530), data moved to PEB 3697
UBI: scrubbed PEB 4200 (LEB 0:6298), data moved to PEB 3697
UBI: scrubbed PEB 3213 (LEB 0:6356), data moved to PEB 3697
UBI: scrubbed PEB 2774 (LEB 0:3130), data moved to PEB 3697
UBI: scrubbed PEB 4723 (LEB 0:6517), data moved to PEB 3697
UBI: scrubbed PEB 4492 (LEB 0:5674), data moved to PEB 3697
UBI: scrubbed PEB 3391 (LEB 0:6235), data moved to PEB 3697
UBI: scrubbed PEB 3009 (LEB 0:6356), data moved to PEB 3697
UBI: scrubbed PEB 1387 (LEB 0:5618), data moved to PEB 3697
UBI: scrubbed PEB 2642 (LEB 0:5914), data moved to PEB 3697
UBI: scrubbed PEB 3011 (LEB 0:818), data moved to PEB 3697
UBI: scrubbed PEB 4691 (LEB 0:1587), data moved to PEB 3697
UBI: scrubbed PEB 4579 (LEB 0:1708), data moved to PEB 3697
UBI: scrubbed PEB 1467 (LEB 0:5872), data moved to PEB 3697
UBI: scrubbed PEB 3431 (LEB 0:5714), data moved to PEB 3697
BCH decoding failed
UBI error: ubi_io_read: error -74 (ECC error) while reading 126976 bytes from PEB 3214:4096, read 126976 bytes
UBI: scrubbed PEB 2545 (LEB 0:5916), data moved to PEB 3697
UBI: scrubbed PEB 3413 (LEB 0:3528), data moved to PEB 3697
UBI: scrubbed PEB 3391 (LEB 0:5772), data moved to PEB 3697
UBI: scrubbed PEB 3103 (LEB 0:1609), data moved to PEB 3697
UBI: scrubbed PEB 3778 (LEB 0:5668), data moved to PEB 3697
UBI: scrubbed PEB 2749 (LEB 0:6388), data moved to PEB 3697
BCH decoding failed
UBI error: ubi_io_read: error -74 (ECC error) while reading 126976 bytes from PEB 3214:4096, read 126976 bytes
UBI: scrubbed PEB 3975 (LEB 0:3213), data moved to PEB 3697
UBI: scrubbed PEB 4652 (LEB 0:2353), data moved to PEB 3697
UBI: scrubbed PEB 3466 (LEB 0:5488), data moved to PEB 3697
UBI: scrubbed PEB 5723 (LEB 0:6063), data moved to PEB 3697
UBI: scrubbed PEB 4074 (LEB 0:6119), data moved to PEB 3697
UBI: scrubbed PEB 3975 (LEB 0:3213), data moved to PEB 3697
UBI: scrubbed PEB 421 (LEB 0:6613), data moved to PEB 3697
UBI: scrubbed PEB 4403 (LEB 0:997), data moved to PEB 3697
UBI: scrubbed PEB 5649 (LEB 0:5781), data moved to PEB 3697
UBI: scrubbed PEB 4489 (LEB 0:5117), data moved to PEB 3697
UBI: scrubbed PEB 2365 (LEB 0:2882), data moved to PEB 3697
UBI: scrubbed PEB 2812 (LEB 0:6140), data moved to PEB 3697
UBI: scrubbed PEB 963 (LEB 0:5541), data moved to PEB 3697
UBI: scrubbed PEB 3574 (LEB 0:6223), data moved to PEB 3697
UBI: scrubbed PEB 5496 (LEB 0:6402), data moved to PEB 3697
UBI: scrubbed PEB 4315 (LEB 0:6523), data moved to PEB 3697
UBI: scrubbed PEB 5353 (LEB 0:6108), data moved to PEB 3697
BCH decoding failed
UBI error: ubi_io_read: error -74 (ECC error) while reading 126976 bytes from PEB 152:4096, read 126976 bytes
UBI: scrubbed PEB 888 (LEB 0:5511), data moved to PEB 3697
UBI: scrubbed PEB 3024 (LEB 0:6399), data moved to PEB 6335
UBI: scrubbed PEB 169 (LEB 0:6485), data moved to PEB 6318
UBI: scrubbed PEB 4503 (LEB 0:1004), data moved to PEB 6318
UBI: scrubbed PEB 5424 (LEB 0:3640), data moved to PEB 6542
UBI: scrubbed PEB 3579 (LEB 0:5447), data moved to PEB 6523
UBI: scrubbed PEB 3099 (LEB 0:5760), data moved to PEB 6518
BCH decoding failed
UBI error: ubi_io_read: error -74 (ECC error) while reading 4144 bytes from PEB 3332:12384, read 4144 bytes
UBIFS error (pid 6674): try_read_node: cannot read node type 1 from LEB 6522:8288, error -74
BCH decoding failed
UBI error: ubi_io_read: error -74 (ECC error) while reading 4144 bytes from PEB 3332:12384, read 4144 bytes
UBI: scrubbed PEB 963 (LEB 0:5304), data moved to PEB 6518
UBI: scrubbed PEB 4492 (LEB 0:1459), data moved to PEB 10
UBI: scrubbed PEB 4314 (LEB 0:6320), data moved to PEB 6757
UBI: scrubbed PEB 5872 (LEB 0:5743), data moved to PEB 6757
UBI: scrubbed PEB 5887 (LEB 0:2381), data moved to PEB 6757
UBI: scrubbed PEB 4883 (LEB 0:2833), data moved to PEB 6372
UBI: scrubbed PEB 6752 (LEB 0:6540), data moved to PEB 4877
UBI: scrubbed PEB 1428 (LEB 0:6364), data moved to PEB 4877
UBI: scrubbed PEB 3367 (LEB 0:6404), data moved to PEB 4877
BCH decoding failed
UBI error: ubi_io_read: error -74 (ECC error) while reading 126976 bytes from PEB 3214:4096, read 126976 bytes
UBI: scrubbed PEB 4723 (LEB 0:5381), data moved to PEB 4877
UBI: scrubbed PEB 5405 (LEB 0:6565), data moved to PEB 5460
UBI: scrubbed PEB 5891 (LEB 0:4377), data moved to PEB 5460
UBI: run torture test for PEB 5460
UBI: PEB 5460 passed torture test, do not mark it a bad
UBI: scrubbed PEB 5004 (LEB 0:555), data moved to PEB 5460
UBI: scrubbed PEB 5257 (LEB 0:6123), data moved to PEB 5460
UBI: scrubbed PEB 362 (LEB 0:6060), data moved to PEB 5460
UBI: scrubbed PEB 5004 (LEB 0:555), data moved to PEB 5460
UBI: scrubbed PEB 3157 (LEB 0:6391), data moved to PEB 5460
UBI: scrubbed PEB 510 (LEB 0:636), data moved to PEB 5460
UBI: scrubbed PEB 362 (LEB 0:6147), data moved to PEB 5460

Let me know if you have any idea on what I can do to solve the issue.

Thanks,

Matteo

0 Matteo Mattei over 13 years ago in reply to Matteo Mattei

Prodigy 220 points

With this fix the BCH algorithm works good:

===================================================================
--- linux-2.6.32/drivers/mtd/nand/omap_bch_decoder.c (revision 1897)
+++ linux-2.6.32/drivers/mtd/nand/omap_bch_decoder.c (working copy)
@@ -107,7 +107,7 @@
 if (elp_sum == 0) {
 /* calculate bit position in main data area */
 bit = ((i-1) & ~7)|(7-((i-1) & 7));
- if (i >= 2 * ecc_bits)
+ if (i >= ecc_bits)
 location[count++] =
 kk_shorten - (bit - 2 * ecc_bits) - 1;
 }
@@ -116,7 +116,7 @@
 /* Failure: No. of detected errors != No. or corrected errors */
 if (count != err_nums) {
 count = -1;
- printk(KERN_ERR "BCH decoding failed\n");
+ printk(KERN_ERR "BCH decoding failed count=%d err_nums=%d\n",count,err_nums);
 }
 for (i = 0; i < count; i++)
 pr_debug("%d ", location[i]);
@@ -386,6 +386,8 @@
 no_of_err = berlekamp(select_4_8, syn, err_poly);
 if (no_of_err <= (4 << select_4_8))
 no_of_err = chien(select_4_8, no_of_err, err_poly, err_loc);
+ else
+ return -1;
 
 return no_of_err;
 }

However the mtd tests with BOARD 2 fail again, the read-only error is still present and, as far as I understood, the root cause is more likely due to the unstable bits issue.

A possibile workaround to prevent the read-only occurrence is this:

===================================================================
--- linux-2.6.32/fs/ubifs/scan.c (revision 1897)
+++ linux-2.6.32/fs/ubifs/scan.c (working copy)
@@ -339,7 +339,7 @@
 if (!quiet)
 ubifs_err("corrupt empty space at LEB %d:%d",
 lnum, offs);
- goto corrupted;
+ //goto corrupted;
 }
 
 return sleb;

I understand that this change hides a real (potential) issue but for an unattended system I can't absolutely permit that the filesystem goes in read-only mode by its own.

At this point, do you have any hints on this?

Thanks.

0 Fabrice Goucem over 13 years ago in reply to Matteo Mattei

TI__Prodigy 30 points

What's the difference between MT29F4G08ABCHC-ET and MT29F4G08ABBDAH4-IT?

What are their ECC requirements?

Do you use the GPMC ECC engine (Hamming code only -> can only correct 1 bit per block) or the software engine (BCH)?

0 Matteo Mattei over 13 years ago in reply to Fabrice Goucem

Prodigy 220 points

The MT29F4G08ABCHC-ET requires 1-bit ECC.

The MT29F4G08ABBDAH4-IT requires 4-bit ECC.

I use the software engine (8-bit BCH because of the errata with 4-bit ECC and OMAP3530).

0 Fabrice Goucem over 13 years ago in reply to Matteo Mattei

TI__Prodigy 30 points

Could you give more details on why the MTD tests fail? Do you mean the mtd_stresstest?

Does the mtd_stresstest pass on BOARD1?

What do you mean by "unstable bit issues"?

0 Matteo Mattei over 13 years ago in reply to Fabrice Goucem

Prodigy 220 points

The failures happen with mtd_stresstest, mtd_oobtest, mtd_subpagetest etc... but only with BOARD2.

On BOARD1 all mtd_tests pass (also with 2.6.32 kernel).

The "unstable bits issue" is documented here http://www.linux-mtd.infradead.org/doc/ubifs.html#L_unstable_bits

0 Fabrice Goucem over 13 years ago in reply to Matteo Mattei

TI__Prodigy 30 points

How can you have have errors with MTD tests if your BCH algorithm works well?

If you properly reinitialise your NAND partition (with flash_eraseall) and then run each of the MTD tests (without any reboot), do you see errors?

0 Matteo Mattei over 13 years ago in reply to Fabrice Goucem

Prodigy 220 points

This is exactly what I did:

flash_eraseall filesystem partition
insmod mtd_*.ko dev=X
no powercycles.

I always have errors with BOARD2.

No errors with BOARD1 (using identical u-boot and kernel).

0 David Andrey over 13 years ago in reply to Matteo Mattei

Prodigy 235 points

Hi Matteo,

Have you any conclusion about this story ?

I have more or less the same situation (2.6.32, BCH4/8, MT29F4G16ABBDAHC, UBIFS) and often read/write errors which seems to be false positive as data are not corrupted.

Actually haven't updated to 2.6.37 code, as it seems, this hasn't solved your problem. (?)

David

0 Ron Olson over 13 years ago in reply to Matteo Mattei

Intellectual 680 points

Hi Matteo,

In your posting of March 29, 2012, you listed a patch to omap_bch_decoder.c that you said made the algorithm work correctly.

What was its origin, or did you create it yourself? I checked the latest TI 2.6.37 distribution from last month, and saw no similar patch there.

My interest is that this same module appears to be working incorrectly for me, using BCH8. So, I'm about to start down the grueling path of finding out why, and am looking for any headstarts I can find.

Regards,
Ron

0 Matteo Mattei over 13 years ago in reply to Ron Olson

Prodigy 220 points

Hi Ron,

the patch for omap_bch_decoder.c has been implemented by my own. Beyond that, there were several problems with 8bit ECC and UBI/UBIFS.

Now I have a quite good configuration with a very low failure rate but it required long time to tune the system, thanks also to the Micron's help.

As first thing I can suggest you to upgrade your UBI/UBIFS taking the latest code from the backports http://www.linux-mtd.infradead.org/doc/ubifs.html#L_source

Then apply the BCH pach I posted last March and then look for a way to protect also the empty space.

As final step, to prevent powerfail errors, I can suggest to add a powerfail detection and change the nand driver accordingly.

Matteo

Processors

Processors forum

UBIFS error with 2.6.32 TI kernel (AM35x-OMAP35x-PSP-SDK-03.00.01.06.tgz) - !!!read-only filesystem!!!