This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

ECC failing when correcting errors from BCH8 codes

Other Parts Discussed in Thread: DM3730

I'm verifying the ECC operation in a clients product that is experiencing unreliable NAND flash. 

I'm testing the product by writing a corrupted data and OOB data into flash and reading it back. I use the MTD utilities from a Angstrom distribution running Linux 2.6.32. 

The system correct 1-8 bits of randoms errors in the data correctly. However when I corrupt a single bit of the BCH8 codes stored in the OOB, the correction fails and none of the data corrections are applied. 

The problem is in the algorithm starting at the decode_bch() function call. This source file (omap_bch_decoder.c) is identical to the source file in Arago. 

Here is a test case which uses all 0xFFFF for data and has the following BCH8 codes. The read code has a single bit error (first byte should be 0x10)

Calculated ECC
0x6c, 0xa0, 0xd3, 0x66, 0x5f, 0x79, 0x17, 0xb5, 0x31, 0xd4, 0x7e, 0x32, 0xe6

Read ECC
0x11, 0xae, 0xd1, 0xf6, 0x12, 0x6c, 0x65, 0x3d, 0x68, 0x86, 0x1a, 0xdb, 0x4a

BCH decoding failed detect=1 correct=0

Can someone please explain why this is happening?  It appears to be a short coming in the BCH decoding algorithm. I would expect the algorithm to correct error bits in its own BCH8 codes stored in the OOB of the flash.

 

  • Hi Steve,

    The decode_bch function uses hardware generated decoding and there is an errata which corresponds to your issue: Advisory 1.54 - GPMC Has Incorrect ECC Computation for 4-Bit BCH Mode. You can find the whole description of this errata in the Silicon Errata for DM3730 document at the link below:

    http://www.ti.com/lit/er/sprz318e/sprz318e.pdf

    Also you can check whether the following patches present to your source code. They implement the 4b/8b BCH based correction and detection.

    Kernel : http://arago-project.org/git/projects/?p=linux-omap3.git;a=shortlog;h=refs/heads/OMAPPSP_03.00.01.06

    X-loader: http://arago-project.org/git/projects/?p=x-load-omap3.git;a=shortlog;h=refs/heads/OMAPPSP_03.00.01.06

    U-boot : http://arago-project.org/git/projects/?p=u-boot-omap3.git;a=shortlog;h=refs/heads/OMAPPSP_03.00.01.06

    As a note, these patches implement the 4b/8b BCH based correction and detection, but do not include support for enabling on-die ECC processing.

    BR
    Tsvetolin Shulev

  • Thanks Tsvetolin 

    The errata does not apply since we are using BCH8 

    Advisory 1.54 GPMC Has Incorrect ECC Computation for 4-Bit BCH Mode

    Revision(s) Affected: 1.0

    Details: The GPMC supports 4- or 8-bit error detection BCH code. 4-bit error mode is using a

    wrong polynomial, as a result for this mode the GPMC will:

    • On page write, generate incorrect ECC parity.

    • On page read, generate an incorrect syndrome.

    This bug prevents having correct error location.

    Workaround(s): There is no workaround for this issue.

    As for the code, I had previously done work to review the BCH implementation and verify that its implemented correctly and none of the current bugs apply. 

    Our driver handles the read/writes from/to the GPMC interface via a DMA rather than have the processor directly handle it.  We believe that the transfer is correct because we can test and verify that up to 8 bit corrections can occur on the data bits. The problem only occurs on corruption of the BCH8 codes in the OOB of the flash. 

  • I performed an additional test on a different project board that is using a DM8168 with the same NAND flash part. In that test the driver did correct the single bit error. In that case the software is based on a third party SDK but was primarily base on Arago. Unfortunately I did not have access to instrument the kernel to dump the BCH codes.  But I can say through a NAND dump that the BCH codes between the projects for all 0xFFFF data were identical. 

    I also confirmed that the BCH decoder functions were identical. This means one of two things.

    1. That the BCH code read from the GPMC interface is different between OMAP35 and the DM8168 

    2. That there is an operation between the BCH read from the GPMC and the decoding calls that my driver is not performing or performing incorrectly.

    For point 2, my current driver immediately reads the GPMC BCH code and then applies the result to the decoder (with the read BCH data) 

      

      

  • On a previous post, I had requested access to software tools to generate the BCH codes. 

    http://e2e.ti.com/support/dsp/omap_applications_processors/f/447/p/286578/1004761.aspx#1004761

    Could that be made available to me? 

  • Hi Steve,

    According description of the MTD utils its subsystem supports bare NAND flashes with software and hardware ECC. I haven't test the possibility to use the software codes but it seems good idea. For more detailed information you can look at the MTD documentation page:

    http://www.linux-mtd.infradead.org/doc/general.html

    BR

    Tsvetolin Shulev

  • Thanks Tsvetolin 

    I've tested with the MTD utilties and found that error where the BCH code is not being corrected on my driver but on another TI driver  (on a different project it was corrected). The code is quite similar but the differences would be the BCH code calculated and read on the Davinchi versus the OMAP3.  

    I'm getting conflicting statements about the BCH8 algorithm. Can you tell me if  a BCH decoder can correct errors in the BCH code read from NAND flash.

    When I look at the BCH decoder it seems that it does not even use the read BCH from nand flash. But the driver does detect there is a bit error (which I had placed in the NAND flash OOB BCH code). Can you explain how this can happen? 

     


  • Hello Steve

    I found the same problem on AM3707 (flip bit in OOB area and "BCH decoding failed").

    For fix your should comment 1 line in omap_bch_decoder.c (method "chien") :

    ...

                if (i >= 2 * ecc_bits)
    ...

    I think it's a bug because method "chien" stores corrections ONLY for data area and skips corrections for OOB area .