This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Bug fix for UBI ECC errors on TI814x/AM387x (master-ti81xx branch)

Other Parts Discussed in Thread: AM3874

Hi, I'm using the latest code out of of the master-ti81xx branch from the linux-omap3 arago repository as I needed 16-bit flash support (and also wanted BCH16). For the most part the code is working great but I ran into issues with UBI partitions. As far as I can tell the cause is commit 72bff96.

The issue seems to affect both my TI81848 and AM3874 hardware (in both BCH8 or BCH16 modes). Not sure if it's relevant to other related architectures. The easiest way I found to reproduce the bug is as follows:

  1. Erase NAND flash partition
  2. ubiformat the partition
  3. ubiattach to the partition

The UBI attach works but produces the following kernel dump:

UBI error: ubi_io_read: error -74 (ECC error) while reading 64 bytes from PEB 2:0, read 64 bytes
Backtrace:
[<c0048dc8>] (dump_backtrace+0x0/0x110) from [<c036031c>] (dump_stack+0x18/0x1c)
r6:00000040 r5:ffffffb6 r4:dd876800 r3:60000013
[<c0360304>] (dump_stack+0x0/0x1c) from [<c02317cc>] (ubi_io_read+0x1cc/0x2a4)
[<c0231600>] (ubi_io_read+0x0/0x2a4) from [<c0231afc>] (ubi_io_read_ec_hdr+0x74/0x204)
[<c0231a88>] (ubi_io_read_ec_hdr+0x0/0x204) from [<c0235b68>] (ubi_scan+0x12c/0x132c)
[<c0235a3c>] (ubi_scan+0x0/0x132c) from [<c022c690>] (ubi_attach_mtd_dev+0x554/0xbf8)
[<c022c13c>] (ubi_attach_mtd_dev+0x0/0xbf8) from [<c022cf64>] (ctrl_cdev_ioctl+0xd8/0x168)
[<c022ce8c>] (ctrl_cdev_ioctl+0x0/0x168) from [<c00d2a18>] (do_vfs_ioctl+0x4d4/0x548)
r6:40186f40 r5:dd8af700 r4:bec29bb8
[<c00d2544>] (do_vfs_ioctl+0x0/0x548) from [<c00d2ae4>] (sys_ioctl+0x58/0x7c)
r9:dd810000 r8:00000000 r7:00000003 r6:40186f40 r5:bec29bb8
r4:dd8af700
[<c00d2a8c>] (sys_ioctl+0x0/0x7c) from [<c0045280>] (ret_fast_syscall+0x0/0x30)
r8:c0045428 r7:00000036 r6:00000003 r5:0000c9ce r4:bec29bb8

What seems to happen is that the first page written by the ubiformat command actually has the wrong ECC. This is then detected and corrected by ubiattach (but the block is never marked bad because the next write to it works fine). Once UBIFS volumes are set up and actual data is written to the partition, a lot more of these errors crop up.

The bug seems to be caused by the ordering of the GPMC ECC configuration register writes, and the behaviour is not documented anywhere as far as I could find. The only way I could get it to work was to re-organise the register access to be the same as before the above mentioned commit.

Here is a patch that fixes the issue: 6472.gpmc_ecc.diff

I'm not sure why it works though (all the other ordering combinations I tried don't seem to work). If anyone has any insights into what's going on here or documentation I've missed I would appreciate the feedback.

  • It appears there's a second issue that comes up with the new branch that looks very similar to the one above, once again causing ECC errors when using BCH ECC.

    The easiest way I found to reproduce it:

    1. Erase NAND flash partition
    2. ubiformat the partition
    3. ubiattach to the partition
    4. ubimkvol create 2 volumes (important to make more than one)
    5. ubidetach from the partition
    6. ubiattach to the partition again

    The resulting kernel dumps look like this:

    UBI error: ubi_io_read: error -74 (ECC error) while reading 1024 bytes from PEB 47:1024, read 1024 bytes
    Backtrace:
    [<c0048dc8>] (dump_backtrace+0x0/0x110) from [<c03601a8>] (dump_stack+0x18/0x1c)
    r6:00000400 r5:ffffffb6 r4:dd8ca800 r3:60000113
    [<c0360190>] (dump_stack+0x0/0x1c) from [<c0231658>] (ubi_io_read+0x1cc/0x2a4)
    [<c023148c>] (ubi_io_read+0x0/0x2a4) from [<c0231b90>] (ubi_io_read_vid_hdr+0x78/0x21c)
    [<c0231b18>] (ubi_io_read_vid_hdr+0x0/0x21c) from [<c0230988>] (ubi_eba_copy_leb+0x34c/0x510)
    [<c023063c>] (ubi_eba_copy_leb+0x0/0x510) from [<c0233674>] (wear_leveling_worker+0x274/0x5ec)
    [<c0233400>] (wear_leveling_worker+0x0/0x5ec) from [<c023324c>] (do_work+0xb0/0xec)
    [<c023319c>] (do_work+0x0/0xec) from [<c0234670>] (ubi_thread+0xd8/0x164)
    r6:c394c000 r5:00000000 r4:dd8ca800 r3:00000000
    [<c0234598>] (ubi_thread+0x0/0x164) from [<c00843ac>] (kthread+0x90/0x98)
    [<c008431c>] (kthread+0x0/0x98) from [<c0070358>] (do_exit+0x0/0x5d8)
    r6:c0070358 r5:c008431c r4:dd869e40
    UBI: run torture test for PEB 47
    UBI: PEB 47 passed torture test, do not mark it as bad

    The issue appears to arise because the UBI driver believes subpage write are supported by the lower layers. At some point it will then issue a subpage write. This is however not supported by the OMAP2 NAND driver so the intermediating layer translates the subpage write into a full page write with a padding of 0xFF bytes.

    Multiple writes like that, with no erase commands between, would ordinarily work, as the 0xFF bytes don't overwrite existing data in NAND flash. However, this breaks in the ECC region since there is a unique ECC signature for 0xFF data and so subsequent writes end up overwriting each others' ECC output.

    This issue did not previously show up because the code contained a hack to disable ECC entirely for non-page aligned reads, however, this was removed in commit 45fc6a7 exposing this issue.

    The easiest fix seems to be to simply tell the driver we don't support subpage writes early enough for this to be propagated in the initialisation code. Here is a patch to do this: 3581.subpage_ecc.diff

    I would appreciate some feedback on this issue to confirm I'm on the right track and haven't missed something here.