TI folks,
I'm on a DM814x project, and am using NAND geometry just like the DM814x-EVM (512 MB, 2048 write size, 64 bit OOB - MT29F4G16). I have all of the latest Arago patches, except those that support sub-page access. So we have sub-page access completely disabled in UBIFS, as instructed by the TI wiki. After more than a year on the current hardware, we have started to get reports of flaky behavior, at least one of which is definitely a NAND issue: UBIFS fails to mount because the ubi_io_read() fails with -74, because omap_correct_data() fails, because elm_decode_bch_error() fails, because ECC_CORRECTABLE bit in the ELM_LOCATION_STATUS register doesn't go to 1 - indicating that the DM814x's error location process failed.
When I look at the OOB data for the block that cannot be corrected, I find:
ff f6 12 10 c1 10 06 0b 5b 90 14 36 47 43 32 00
03 dd 04 30 19 59 66 b0 1e 02 82 8b 54 00 01 9a
6e 31 15 c0 1d 2a c4 43 05 3a 09 00 14 70 6a 8b
92 40 0d 24 37 40 40 20 08 00 d7 ff 17 12 17 12
Notice:
- BBM bytes are 0xFF F6. I expected 0xFF 0xFF.
- Last six bytes are not 0xFF. I expected six 0xFF bytes, since ECC bytes cover only 58 bytes (2 bytes BBM + (2048/512)*14 bytes ECC).
Neither uboot "nand" tools nor linux "nanddump" recognize this block as a bad block, but both uboot and linux conclude that there are uncorrectable ECC errors.
So I've started running the kernel level MTD tests. The mtd_oobtest module is reporting 2 to 15 single bit flips over the whole FLASH when I run it. Should this test be passing with zero errors? This test is calling mtd->read_oob(), which I think may not be using ELM???
I've also run mtd_pagetest, mtd_readtest, mtd_nandecctest - they finished with zero errors.
What else should I try?
Also, I noticed an accusation of a TI bug in omap2.c by Ron Olson at this post: http://e2e.ti.com/support/embedded/linux/f/354/t/208299.aspx
Ron Olson said:We're using BCH8, and had to make the following change to get BCH8 to work. The code, after detecting that a bit correction was needed, made the correction in the wrong byte. My omap2.c module is a hybrid of TI's distributions for the am35x (Linux 2.6.37) and am335x (Linux 3.2).
Original code (module omap2.c, function omap_correct_data(), within case OMAP_ECC_BCH8_CODE_HW):
byte_pos = (BCH8_ECC_MAX - err_loc[j] - 1) / 8;
Modified code:
byte_pos = err_loc[j] / 8;
Is it true? The success of mtd_pagetest makes me think not ... but then, I tried to match the omap2.c code to the 528-byte example in the DM814x TRM, Table 1-151 , where bits are ordered from bit 7 to bit 0, then from bit 15 to bit 8 - I cannot understand how the BCH8 clause of omap_correct_data() can result in the mapping shown in the example.
Dan -
