UBIFS mount failure due to uncorrectable data

Daniel70334

TI folks,

I'm on a DM814x project, and am using NAND geometry just like the DM814x-EVM (512 MB, 2048 write size, 64 bit OOB - MT29F4G16). I have all of the latest Arago patches, except those that support sub-page access. So we have sub-page access completely disabled in UBIFS, as instructed by the TI wiki. After more than a year on the current hardware, we have started to get reports of flaky behavior, at least one of which is definitely a NAND issue: UBIFS fails to mount because the ubi_io_read() fails with -74, because omap_correct_data() fails, because elm_decode_bch_error() fails, because ECC_CORRECTABLE bit in the ELM_LOCATION_STATUS register doesn't go to 1 - indicating that the DM814x's error location process failed.

When I look at the OOB data for the block that cannot be corrected, I find:

ff f6 12 10 c1 10 06 0b 5b 90 14 36 47 43 32 00
03 dd 04 30 19 59 66 b0 1e 02 82 8b 54 00 01 9a
6e 31 15 c0 1d 2a c4 43 05 3a 09 00 14 70 6a 8b
92 40 0d 24 37 40 40 20 08 00 d7 ff 17 12 17 12

Notice:

BBM bytes are 0xFF F6. I expected 0xFF 0xFF.
Last six bytes are not 0xFF. I expected six 0xFF bytes, since ECC bytes cover only 58 bytes (2 bytes BBM + (2048/512)*14 bytes ECC).

Neither uboot "nand" tools nor linux "nanddump" recognize this block as a bad block, but both uboot and linux conclude that there are uncorrectable ECC errors.

So I've started running the kernel level MTD tests. The mtd_oobtest module is reporting 2 to 15 single bit flips over the whole FLASH when I run it. Should this test be passing with zero errors? This test is calling mtd->read_oob(), which I think may not be using ELM???

I've also run mtd_pagetest, mtd_readtest, mtd_nandecctest - they finished with zero errors.

What else should I try?

Also, I noticed an accusation of a TI bug in omap2.c by Ron Olson at this post: http://e2e.ti.com/support/embedded/linux/f/354/t/208299.aspx

Ron Olson said:

We're using BCH8, and had to make the following change to get BCH8 to work. The code, after detecting that a bit correction was needed, made the correction in the wrong byte. My omap2.c module is a hybrid of TI's distributions for the am35x (Linux 2.6.37) and am335x (Linux 3.2).

Original code (module omap2.c, function omap_correct_data(), within case OMAP_ECC_BCH8_CODE_HW):
byte_pos = (BCH8_ECC_MAX - err_loc[j] - 1) / 8;
Modified code:
byte_pos = err_loc[j] / 8;

Is it true? The success of mtd_pagetest makes me think not ... but then, I tried to match the omap2.c code to the 528-byte example in the DM814x TRM, Table 1-151 , where bits are ordered from bit 7 to bit 0, then from bit 15 to bit 8 - I cannot understand how the BCH8 clause of omap_correct_data() can result in the mapping shown in the example.

Dan -

over 12 years ago

0 Daniel70334 over 12 years ago

Expert 1745 points

Some more info. I read on-line about the mtd test util called integck. So I started running the following line on the mount point of my largest UBI volume:

integck -p -e /root/tst/

This utility basically writes and modifies files, unmount the volume, remounts it, and checks that everything came back. I find that, if I remove power from the unit while integck is running, then the issue is easily reproduced:

Non-FF BBM and last six bytes of OOB
UBIFS mount fails due to uncorrectable ECC error

Is there a possibility that GPMC interface is malfunctioning?

Dan -

0 Pavel Botev over 12 years ago in reply to Daniel70334

TI__Guru**** 170625 points

Dan,

This is the feedback I have from the team:

I think there is misconfiguration in selecting ecc-scheme when writing these pages.

(Any problem while updating | re-flashing the images remotely).

ff f6 12 10 c1 10 06 0b 5b 90 14 36 47 43 32 00

03 dd 04 30 19 59 66 b0 1e 02 82 8b 54 00 01 9a

6e 31 15 c0 1d 2a c4 43 05 3a 09 00 14 70 6a 8b

92 40 0d 24 37 40 40 20 08 00 d7 ff 17 12 17 12

(1) “17 12 17 12” bytes have a recurring pattern which does not indicate bit-flips, its some issue in driver or ecc-scheme usage.

(2) “ff f6” These bytes are used for bad-block marker, so they are never touched by omap driver.

Above pattern indicates, that at some point you have used ‘HAMMING_CODE_DEFAULT’ ecc-scheme (either by mistake) which could have caused this over-writing to OOB.

0 Daniel70334 over 12 years ago in reply to Pavel Botev

Expert 1745 points

Pavel,

Thanks for responding.

Pavel Botev said:

Above pattern indicates, that at some point you have used ‘HAMMING_CODE_DEFAULT’ ecc-scheme (either by mistake) which could have caused this over-writing to OOB.

That is interesting, because it maybe explains why the junk BBM and junk last six bytes are curiously similar when I examine them after reproducing the problem several times - lots of "12"s:

BBM ECC Data Last six bytes
1: ff f6 12 10...08 00 d7 ff 17 12 17 12
2: ff 3b 16 12...e1 00 1f ff 17 12 17 12
3: bf 5e 12 12...2a 00 df ff 1f 12 1f 12
4: df d7 13 10...27 00 5f ff 1f 12 1f 12
5: ff fb fb d3...26 00 ff ff ff f3 ff f3
6: 37 de 02 00...13 00 df ff 17 12 17 12
7: 5f 73 14 10...38 00 57 ff 17 12 17 12

So, following up on this:

board-support/linux-2.6.37-psp04.04.00.02/arch/arm/mach-omap2/gpmc.c, in gpmc_enable_hwecc()
- I added an early return in "default" case of "switch (mode)", in case bad "mode" is specified, which would lead to badbch_mod, which would result in non-BCH8 in GPMC_ECC_CONFIG register.
- If bch_mod was ever non-1 (i.e. not BCH8), I logged it and returned early. Again to prevent Hamming setting in GPMC_ECC_CONFIG register.
board-support/linux-2.6.37-psp04.04.00.02/arch/arm/mach-omap2/gpmc.c, in gpmc_calculate_ecc():
- I logged if there ever was any attempt for non-BCH8,and then forced ecc_type to OMAP_ECC_BCH8_CODE_HW.
board-support/linux-2.6.37-psp04.04.00.02/drivers/mtd/nand/nand_base.c in nand_write_page_hwecc():
- In the oob_poi for loop, I put in a check to ensure that eccpos[i] was never less than 2 or greater than 56+2, and then did a "continue" if it was.
board-support/linux-2.6.37-psp04.04.00.02/drivers/mtd/nand/omap2.c in omap_write_buf_pref():
- I added a check to log and exit early if len is ever not 512 and not 64.
- I added a check to log and exit early if the 58th byte of len-64 data is ever non-0xFF.

None of these checks were triggered - not during normal operation, and not when I removed the power (as far as I can tell). None of the early-returns prevented the issue from happening. So I cannot identify a place where GPMC's ECC engine is mis-programmed.

On the other hand ... I found that this error does not happen when we hit the PMIC's HDRST pin (with a push button switch on our board). So I guess that the issue occurs when 3.3V goes away, since that voltage is not provided by PMIC.

So I am wondering if PMIC's VMBCH2 interrupt can help us. We have the VCCS input connected to a 5V, which is generated from 3.3V. The PMIC IN1 interrupt is connected to DM814x NMIn pin input. Maybe I can avoid this issue by using NMI to stop CPU activity when voltage level gets too low?

Do you have any pointers of where and how to set up NMI? Sec. 14.1.4 of TRM gives me the impression that I can follow the example of an existing interrupt connected through GPIO ...

Dan -

0 Pavel Botev over 12 years ago in reply to Daniel70334

TI__Guru**** 170625 points

Dan,

Daniel70334 said:
Do you have any pointers of where and how to set up NMI?

The NMIn signal is mapped to the H7 pin, on register PINCNTL261[7:0] MUXMODE = 0x1. The Cortex-A8 INTC interrupt number 7 map the NMI.

I can provide you the below pointers, hope will be in help:

http://processors.wiki.ti.com/index.php/TI81XX_PSP_GPIO_Driver_User_Guide#IRQ_handling

http://processors.wiki.ti.com/index.php/Configuring_GPIO_Interrupts

http://e2e.ti.com/support/arm/sitara_arm/f/791/p/170369/933454.aspx#933454

http://e2e.ti.com/support/microcontrollers/stellaris_arm/f/471/t/63836.aspx

Best regards,
Pavel

0 Daniel70334 over 12 years ago in reply to Pavel Botev

Expert 1745 points

Thanks for the hints Pavel. Actually, the GPIO-based interrupts weren't quite helpful since NMIn is its own dedicated input rather than a GPIO, but nonetheless I was able to follow the example of some of the existing interrupts to set something up.

I was able to test NMI operation using the PMIC's RTC, and that worked fine - NMIn assert when expected, and my ISR ran. But when I tried setting up the VMBCH2 interrupt to tell me when VCCS got too low, I found that NMIn did a slow decay rather than a nice transition. I guess that the voltage input is decaying too slow for PMIC to give us a useful warning.

0 Pavel Botev over 12 years ago in reply to Daniel70334

TI__Guru**** 170625 points

Dan,

Regarding the PMIC issue, we have a special E2E forum for the PMIC recommended for DM814x device (TPS659xx):

http://e2e.ti.com/support/power_management/pmu/f/43.aspx

Discussing with the PMU TPS659xx team, I have the below feedback, which might be in help:

If I well understand your issue you are trying to turn on comp2 , if so here is how you do it.

COMP2 is disabled by default and can be enabled by software. The comparator trigger generates an interrupt which is programmable on the rising (VMBCH2_H_IT) or falling edge (VMBCH2_L_IT), hence the comparator can be used for detecting high or low battery scenarios. COMP2 generates an interrupt for the host. In sleep mode, this creates a wake-up interrupt for the host. In off mode, the comparator trigger generates a turn-on event. In backup or no supply modes, the comparator is not active.

The COMP2 threshold can be set from 2.5 to 3.5 V with 50-mV steps. Enabling the comparator is done through the voltage threshold selection bit VMBDCH2_SEL, which is set to 0 by default.

Interrupt is an output signal and VCCS has input voltage connected to it, whenever the voltage changes above or below this threshold there will be an interrupt generated. INT is a digital signal and there is no decay related to it.

0 Daniel70334 over 12 years ago in reply to Pavel Botev

Expert 1745 points

Pavel,

Thanks for the info.

Pavel Botev said:

The comparator trigger generates an interrupt which is programmable on the rising (VMBCH2_H_IT) or falling edge (VMBCH2_L_IT), hence the comparator can be used for detecting high or low battery scenarios.

Yes, I set bit VMBCH2_L_IT_MSK (mask = 0x40) in INT_MSK3_REG (address offset = 0x55).

Pavel Botev said:

The COMP2 threshold can be set from 2.5 to 3.5 V with 50-mV steps. Enabling the comparator is done through the voltage threshold selection bit VMBDCH2_SEL, which is set to 0 by default.

Yes, I set VMBCH2_REG (address offset = 0x6B) to 0x30, using code similar to that found at https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/mfd/tps65911-comparator.c?id=refs/tags/v3.14-rc4

Pavel Botev said:

Interrupt is an output signal and VCCS has input voltage connected to it, whenever the voltage changes above or below this threshold there will be an interrupt generated. INT is a digital signal and there is no decay related to it.

Unfortunately, when I removed power from the board, I was not able to get an interrupt. My assumption is that the 5V regulator output connected to VCCS did not drop fast enough before PMIC itself was dead (hence the decay down instead of sharp transition). So I have given up on this idea.

Instead, I've looked into ignoring ELM failures (since the origin of the bad ECC bytes cannot be identified), and enabling UBIFS-level checksumming (the "chk_data_crc" mount option) instead.

Dan -

0 ashok kumar6 over 10 years ago in reply to Daniel70334

Expert 1800 points

Hi Daniel,

I am facing few issues to port the MT29F4G16 nand driver into Android.

Here I reported the issue.

If possible, Please can you share us the uboot and kernel patch of the MT29F4G16 driver. Thanks in advance.

Regards

Ashok

0 Daniel70334 over 10 years ago in reply to ashok kumar6

Expert 1745 points

Ashok,

This thread was not specifically about 16-bit NAND support. However, at this point, all the 16-bit support for DM814xx is on-line at:

arago-project.org/.../

This is past "patch2", so you cannot just get the latest psp.

I actually do not have the BCH16 and sub-page support added in the last few commits, as by then, TI had stopped make DM81xx releases. So such a big change seemed risky. But the basic NAND support was all in place by then.

Good luck,

Dan -

Processors

Processors forum

UBIFS mount failure due to uncorrectable data