Bad NAND or Memory (DRAM) ?

ranchu

Guru 20755 points

Other Parts Discussed in Thread: OMAPL138, OMAP3530

Hello,

This is not a specific chip question, but a general question about ecc with TI's chip.

I am trying to understand nandecc hw,

there is a table with column

1. hadrware error detection

2. driver solution - error correction

My question:

1. Does driver solution here means hadware correction (not software) ?

2. How do we choose the hw error detection (between 1/4/8), or is it a feature of nand chip (and we can't change it) ?

3. if we have a chip with 4 ecc, does it means we can use nand ecc 8 (driver). according to a comment in this page, this is not something good:

What will happen if an 8-bit ECC NAND is used with our 4-bit ECC capable devices?

In this scenario, if more than 4 errors are detected, the errors can't be corrected. This can have serious consequences including boot failure. It is advisable to keep correcting the ECC errors in the designated read-only/boot sections of the NAND to reduce the chances of boot failure.

Regards,

Ran

over 9 years ago

0 Titusrathinaraj Stalin over 9 years ago

TI__Guru** 116100 points

HW error detection is that, error can be detected by the TI SOC (Like Davinci or OMAP) not by the external SW.
HW would take care the error detection & correction.

1. Does driver solution here means hadware correction (not software) ?

No, it means that SW should support for ECC feature.

2. How do we choose the hw error detection (between 1/4/8), or is it a feature of nand chip (and we can't change it) ?

You need to enable it in u-boot and kernel board file for that.

By default RBL supports 1bit HW ECC.

3. if we have a chip with 4 ecc, does it means we can use nand ecc 8 (driver). according to a comment in this page, this is not something good:

"

What will happen if an 8-bit ECC NAND is used with our 4-bit ECC capable devices?

In this scenario, if more than 4 errors are detected, the errors can't be corrected. This can have serious consequences including boot failure. It is advisable to keep correcting the ECC errors in the designated read-only/boot sections of the NAND to reduce the chances of boot failure.

You have to enable the 8bit SW ECC support in u-boot and kernel with 3rd party or your own algorithm (BCH Reed Solomon etc.,).

0 ranchu over 9 years ago in reply to Titusrathinaraj Stalin

Guru 20755 points

We have found a severe issue with omap35x chip on some board, which is going into large production. We are working on it for almost intensive 3 weeks, with no solution yet. We see that at some writes a bit in some address gets bad.
It is repeated on unique scenario (writing one large file after another one), and it happens in the same address in nand.

We have tried sw ecc, hw ecc, yet same result.
Not sure what we can do. We think maybe to read ecc information and try to understand from the information what is happening.

Any idea will help.

0 Shankari G over 9 years ago in reply to ranchu

TI__Mastermind 43955 points

Ran,

This question actually doesnot fall on Linux forum.

Please post it in OMAP35x forum, we may expect some appropriate response in that forum.

Please let me know if you want to move your post there.

0 ranchu over 9 years ago in reply to Shankari G

Guru 20755 points

Hi Titus, Shankari,

Thank you.

This is not specific for omap, but for any TI chip using nand. I am not sure omap forum is not good for that , becuase it is archive forum - I can't reply/post there (anyone else can ?), so it's a dead forum actually....

As I said, We have severe issue here, becuase this chip is going in few days into massive production, and we are in a real mess becuase of this nand issue. We have no idea so far.

I would like to add more information:

We are using MT29 Micron's NAND.

We have upgraded to latest TI arago version from tree, but still we get failures:

NAND write: device 0 offset 0x8500000, size 0x950000

9764864 bytes written: OK

NAND read: device 0 offset 0x8500000, size 0x950000

err_loc=3216

NAND read from offset 86c0000 failed -117

1835008 bytes read: ERROR

byte at 0x851e0014 (0xf5) != byte at 0x861e0014 (0x00)

1. Can it be a matter of bad blocks ?

2. Should we mark some blocks so that the chip shall ignore them ?

3. Is it possible that there are no bad blocks in device:

HH_UBOOT # nand bad

Device 0 bad blocks:
HH_UBOOT #

Thanks,

Ran

0 Titusrathinaraj Stalin over 9 years ago in reply to ranchu

TI__Guru** 116100 points

A2) Yes, we can make it block as bad using u-boot commands.
A1) May be.
A3) In my opinion, NO, all NAND device should have 4 to 5 bad blocks in NAND due to NAND chip production.

Due to ECC issue, what problem are you getting ? you are not able to boot uboot or Kernel or filesystem issue ?

Have you contacted your TI local FAE anytime ?

0 Shankari G over 9 years ago in reply to ranchu

TI__Mastermind 43955 points

Hi Ran,

I understand the seriousness that the mass production is going to happen.

I have written an email with your questions to factory team of OMAP35x to have a look at it.

We used OMAPL138 device with Micron NAND and didnot come across issues of this sort. According to me, we cannot generalise this issue for all the TI chip set.

Would you please give us a test case to reproduce this issue in the software? Let us try and see what suggestions or solutions can be given to it.

0 ranchu over 9 years ago in reply to Shankari G

Guru 20755 points

Hi Shankari, Titus,

Thank you very much for the suggestions.

1. We are using a board with micron MS29C4G48MAZAKC1-6I

2. All NAND programming is done from u-boot

3. We use omap3530

4. the log at the start of xloader/uboot can be seen below:

5. we use hw ecc, u-boot version is quite old

#define U_BOOT_VERSION "U-Boot 2008.10-00018-gc420173-dirty"

===

40Wrt reset

HH X-Loader 1.42

Elbit version 1.1
Starting OS Bootloader...
##################################################
############# ELBIT SYSTEMS LTD ##################
############# BSP GROUP ##################
############# VERSION 1.3.29 #################
##################################################
NAND device: Manufacturer ID: 0x2c, Chip ID: 0xbc (Micron NAND 512MiB 1,8V 16-bit)
NAND 512 MiB
NAND TEST ... Scanning device for bad blocks
Bad eraseblock 2048 at 0x10000000
total number of bad blocks - 1
U-Boot 2008.10-00018-gc420173-dirty (Jul 27 2016 - 07:57:12)
OMAP35X-Family-GP rev 2, CPU-OPP2 L3-165MHz
OMAP3 EVM board + LPDDR/NAND
In: serial
Out: serial
Err: serial
OMAP DISPC rev 3.0
Could not find exact pixel clock. Requested 6500 kHz, got 0 kHz
dispc_setup_plane 0, 83d6a057, sw 240, 0,0, 240x320 -> 240x320, (ilace 0)
Hit any key to stop autoboot: 0
HH_UBOOT #

If there is more information required to repeat this testing, please tell me.

We made addtional test today, we suspect that it is issue of bad blocks (not ecc), so I've tried to /erase/read/write '1' and '0' to all nand.

but I get no errors in comaring, I also see that the nand from manfucture is without and bad errors (isn't it supposed to have some bad errors even from factory?)

Regards,

Ran

0 ranchu over 9 years ago in reply to ranchu

Guru 20755 points

More info:

1. we write and read block of data
2. we get the issue (sometimes) on comparing what we read with the data we write.
3. usually is happens in same address.

Does it look like ecc issue or bad block issue ?

0 ranchu over 9 years ago in reply to ranchu

Guru 20755 points

We have continued the investigation.

1. on writing a 9M file to 0x85000000 , and 0x86000000 with tftp (from u-boot), and then doing "cmp" , we get that some word in some offset is wrong.
2. we then strated suspecting tftp or memory. (same or another ossue then as nand ?)
3. doing mw, md on same are (9M), with simple pattern 0x55 - OK (no problems)
4. doing mtest on the specific address - we've seen no issue, (and stopped the test after 5 minutes, maybe we need to let it run more ? )

Any idea will help,

Thanks,
Ran

0 Titusrathinaraj Stalin over 9 years ago in reply to ranchu

TI__Guru** 116100 points

This is not specific for omap, but for any TI chip using nand.

We don't heard such issues on other TI boards.
Provide the commands that you tried to reproduced in u-boot shell, let me try it on our OMAPL138 board.
Do you have any of our TI EVM board with you and tried the same on that boards ??
Have you tried to load the kernel and seen any badblocks listing while NAND driver loading in kernel boot log ?

Try to flash the filesystem and kernel in Linux itself by booting NFS or SD card.
It may give any clues for the issue, the problem might be with u-boot ECC/badblock handling.

0 ranchu over 9 years ago in reply to Titusrathinaraj Stalin

Guru 20755 points

Hi Titus,

I am trying to rebuild system with best ecc scheme.

I now use sw ecc instead of hw ecc.

But I get that if I write the jffs partition with sw ecc, thenI get error in mounting filesystem:

[ 10.188018] mtd->read(0x590 bytes from 0x14270) returned ECC error
[ 10.194915] mtd->read(0x32c bytes from 0x13cd4) returned ECC error
[ 10.201782] mtd->read(0x9c bytes from 0x13764) returned ECC error
[ 10.208587] mtd->read(0x668 bytes from 0x13198) returned ECC error
[ 10.215454] mtd->read(0x434 bytes from 0x12bcc) returned ECC error
[ 10.222351] mtd->read(0x1b0 bytes from 0x12650) returned ECC error
[ 10.229217] mtd->read(0x620 bytes from 0x619e0) returned ECC error
[ 10.236114] mtd->read(0x758 bytes from 0x120a8) returned ECC error
[ 10.242919] mtd->read(0x51c bytes from 0x11ae4) returned ECC error
[ 10.249816] mtd->read(0x314 bytes from 0x114ec) returned ECC error
[ 10.256683] mtd->read(0x110 bytes from 0x10ef0) returned ECC error

[ 15.174163] Kernel panic - not syncing: Attempted to kill init!
[ 15.180419] [<c0046398>] (unwind_backtrace+0x0/0xf8) from [<c02c867c>] (panic+0x74/0x1a4)
[ 15.189117] [<c02c867c>] (panic+0x74/0x1a4) from [<c0070c34>] (do_exit+0x5d4/0x6d0)
[ 15.197235] [<c0070c34>] (do_exit+0x5d4/0x6d0) from [<c0070d6c>] (do_group_exit+0x3c/0xbc)
[ 15.205993] [<c0070d6c>] (do_group_exit+0x3c/0xbc) from [<c007ce68>] (get_signal_to_deliver+0x31c/0x444)
[ 15.216033] [<c007ce68>] (get_signal_to_deliver+0x31c/0x444) from [<c00432ac>] (do_notify_resume+0xb4/0x680)
[ 15.226409] [<c00432ac>] (do_notify_resume+0xb4/0x680) from [<c0040824>] (work_pending+0x24/0x28)

When I try to write the filesystem as hw ecc, then it is OK.

So, In current scheme, evertyhing is read/write with sw ecc, except for writing the filesystem partition - hw instead of sw.

But I am not sure now.

There is no limitation with jffs2 with sw ecc, as far as I know.

Do you know why it behaves like this ?

Regards,

Ran

0 ranchu over 9 years ago in reply to Titusrathinaraj Stalin

Guru 20755 points

Hi,

The problem is that I don't have a scenario to reproduce this issue?

Is it a temparature dependent issue ? DDR ? NAND ?Tftp ? It's hard to say.

It comes and go.

Even wen I catch it in one board, at successive tests (without reseting the board), it might be reproducing the same result.

But after reset, or some other change. It does not reproduce itself.

It's easier to catch a pokemon...

Any idea ?

0 ranchu over 9 years ago in reply to ranchu

Guru 20755 points

Hi,

Although all the ecc scheme is now sw ecc.

I still must flash the jffs filesystem as hw, otherwise it will not mount the filesystem.

I see in code:

linux/drivers/mtd/nand/omap2.c

pdata->ecc_opt = OMAP_ECC_HAMMING_CODE_HW;

On tryong to change it to:

pdata->ecc_opt = OMAP_ECC_HAMMING_CODE_DEFAULT;

The filesystem is not mounted.

Is it becuase jffs2 requires hamming ?

Regards,

Ran

0 ranchu over 9 years ago in reply to ranchu

Guru 20755 points

Stilll seeing failures. Always in same address, after about 1 ~hour.
We've tried both hw and sw ecc.
How is it that it is always in the same address ?

Processors

Processors forum

Bad NAND or Memory (DRAM) ?