This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM1808: Repair and move blocks with bitflips

Part Number: AM1808

We have some problems/questions regarding the scrubbing procedure in the AM1808 using UBIFS. We use 4 bit ECC.
In the scrubbing procedure, we read the full UBIFS partition and we observed that in case of a bitflip error, the whole block is repaired and moved to another location.

However, we have encountered some situation that we do not understand.

Your help is very much appreciated.

1         Questions:

1)      What is the meaning of “Error_address > 512“? (Ref. davinci_nand.c)?
We have observed that the error_address in davinci_nand.c was greater than 512 with the result that the related block was not repaired.
What is the meaning of NAND_ERR_ADD1/2_OFFSET and NAND_ERRVAL1/2_OFFSET (NANDERRADD1/2, NANDERRVAL1/2 register, see references) ?
Under which conditions can this value be greater or equal than 512?

2)      What are the limitations of the scrubbing procedure?
According to our understanding it is guaranteed to correct at least 4 bitflips.
As the number of bitflips increases over time, we need to ensure that a block gets repaired and moved before it reaches the 4 bitflips limit.
We observed that under certain situations a bitflip is not repaired by the scrubbing procedure (ref. Question 1).
Can there be situations that the number of bitflips reaches the 4 bit limit and the block does not get repaired/moved?

2         References

PROCESSOR:

root@am1808-evm:~# cat /proc/cpuinfo

Processor       : ARM926EJ-S rev 5 (v5l)

BogoMIPS       : 149.50

Features       : swp half thumb fastmult edsp java

CPU implementer : 0x41

CPU architecture: 5TEJ

CPU variant     : 0x0

CPU part       : 0x926

CPU revision   : 5

 

Hardware       : AM1808 EVM

Revision       : 0000

Serial         : 0000000000000000

 

KERNEL:

root@am1808-evm:~# uname -a

Linux am1808-evm 2.6.37-1.97.0-r22504 #1 PREEMPT Tue Mar 26 02:23:30 CET 2019 armv5tejl GNU/Linux

 

spruh82b AM1808 tech ref manual 2016-07.pdf, chapter 19.4

D0h         NANDERRADD1                   NAND Flash 4-Bit ECC Error Address Register 1   Section 19.4.20

D4h         NANDERRADD2                    NAND Flash 4-Bit ECC Error Address Register 2   Section 19.4.21

D8h         NANDERRVAL1                     NAND Flash 4-Bit ECC Error Value Register 1       Section 19.4.22

DCh        NANDERRVAL2                     NAND Flash 4-Bit ECC Error Value Register 2       Section 19.4.23

 

davinci_nand.c

The relevant code section is here:

 

 

/* Correct up to 4 bits in data we just read, using state left in the

* hardware plus the ecc_code computed when it was first written.

*/

static int nand_davinci_correct_4bit(struct mtd_info *mtd,

               u_char *data, u_char *ecc_code, u_char *null)

{

...

correct:

       /* correct each error */

       for (i = 0, corrected = 0; i < num_errors; i++) {

               int error_address, error_value;

 

               if (i > 1) {

                       error_address = davinci_nand_readl(info,

                                               NAND_ERR_ADD2_OFFSET);

                       error_value = davinci_nand_readl(info,

                                               NAND_ERR_ERRVAL2_OFFSET);

               } else {

                       error_address = davinci_nand_readl(info,

                                                NAND_ERR_ADD1_OFFSET);

                       error_value = davinci_nand_readl(info,

                                               NAND_ERR_ERRVAL1_OFFSET);

               }

 

               if (i & 1) {

                        error_address >>= 16;

                       error_value >>= 16;

               }

               error_address &= 0x3ff;

               error_address = (512 + 7) - error_address;

               if (error_address < 512) {

                        data[error_address] ^= error_value;

                       corrected++;

               }

       }

 

       return corrected;

}

 

  • Klaus,

    There is a NAND boot device errata that discusses error handling of ECC data in spare area which applies to your use case. The work around for that NAND boot issue needs to be implemented in your linux NAND run time driver. Errors >512 typically indicate bit flips on the spare bytes. 

    http://www.ti.com/lit/er/sprz313h/sprz313h.pdf (Check Advisory 2.3.13)

    Typically, we have seen users marking the block as bad if more than 4 bits or uncorrectable number of bits are reported so that the boot or driver can skip over the area but I am not sure how this works in the context of the UBIFS. I will consult with our Linux expert and try to check if our latest drivers have the work around implemented and try to understand how this would apply in the context of UBIFS beign used with a NAND based file system.

    Please review the errata document and its workaround and let us know if you have any further questions.

    Regards,

    Rahul

  • Rahul,

    thank you for your answer.

    The aspect you describe is related to the case of ECC errors in combination with the boot block. However, our problem is related to the UBIFS partition, such as the root partition So, I would appreciate if you contact your linux expert on this topic for further information.


    My understanding is that when we access a block with an ECC bitflip error, the block should be moved to another physical location and re-written with bitflips. That means that if we access all blocks early enough before the number of bitflips is too high to repair, blocks with bitflips will be moved and thus corrected. So, the number of available blocks may be reduced over time, but as long as we have enough blocks available, we will never encounter blocks that are unreadable.

    However, my obervation is different if I create a bitflip error in the spare data area. I noticed that the block can be read correctly but it is not moved and the error remains (Error>512 in this case). So, I assume that after a while, when the number of bitflips in the spare area exceeds 4 or more, the block will become unreadable, which will result in an outage of the device.

    So, my questions are:

    1) Is my assumption correct that, when number of bitflips in the spare area exceed a limit, the block will become unreadable, even if the block is frequently accessed and could be moved?

    2) if there is a bitflip in the spare data area AND in the data section, would the block be moved when it is accessed like in the case of a bitflip only in the data section?

    Thanks and Regards

    Klaus

  • Hi, Klaus,

    In the case of more than 4 bitflips, the driver code reads NANDFSR register and if ECC_STATE bits indicate more than 4 bitflips, the driver will not correct them and exit out. As far as why it is 512, we are still investigating, and try to understand why it is coded that way.

    Rex

  • Hi,

    maybe my questions were not precise enough.

    The purpose of my last questions was to estimate the risk that module becomes unusable due to bitflips. We have lots of controllers in the field and we encountered problem of not working module, so that we added the scrubbing procedure to avoid module outages.
    We are using UBIFS with a linux file system and the scope is a booted controller running linux.

    As I understand, it is guaranteed that at least 4 bitflips can be corrected. The scrubbing procedure ensures that blocks with bitflips are moved to another location correctly.
    If we run the scrubbing procedure periodically, we can ensure that there are no defective blocks.
    However, this statement is not valid for the OOB section, as bitflips here will not result in a move of a block. If there are more than 4 bitflips in the OOB area the block will become unusable, which may result in the module becoming unusable.

    Now, consider the case, we have 3 bitflips in the OOB and now another bitflip occurs in the related non OOB section (Data section). Will the block be moved?
    I would assume "yes", but I would like to have this confirmed.

    An addtional question is related to the fact that I am not able create such a situation (3 bitflips in OOO section, 1 in data section).
    With the following command I can write the OOB section only:

    nandwrite  --oob  -s 0x230000 /dev/mtd12 nand-chg.dmp

    So, I dump a block, change it via the hexeditor and write it again and I have an OOB bitflip.

    However, I did not find a procedure with nanddump/nandwrite how to create a bitflip in the data section.
    Dumping, change and re-write a block did not work for me, as I did not find a "nandwrite" option that writes the data only, but not the OOB.
    When I changed a bit in the data section (dump, change, write) and then re-applied the old OOB, the whole block was messed up.

    nanddump version is 1.31

    Thanks and regards

    Klaus

  • Hi, Klaus,

    The scrubbing seems to be an Linux upstream feature which we are not familiar with. Searching internet, I came across this Q&A page

    https://superuser.com/questions/372422/can-linux-scrub-memory

    It mentions scrubbing involves Linux updating virtual memory address to point to the new page location. The new location should still be corrected in subsequent READ if a bitflips happens, and should also be relocated. However, the page didn't mention if the new area is in OOB section.

    Is the OOB section the spare area which is also used for storing ECC checksum? If it is just another area in the NAND, then as described earlier, it should be moved again if bitflip occurs. If it is the Spare Area, we don't know how upstream driver handles it.

    Rex