AM3354: EMMC write timeout on 6.1.119 kernel

Part Number: AM3354

Hi,

To fix the CVEs, I have upgraded SDK version to latest 09.03.05.02. The kernel version of this SDK is 6.1.119. I am suffering from a EMMC write timeout problem with fio test. This issue hasn't been observed in the old kernel v3.12.

Here is the detailed information:

I ran a fio stress testing with the following command:

fio --name=concurrent   --directory=$TEST_DIR  --rw=randrw --rwmixread=70 --bs=4k --size=200M --numjobs=4 --runtime=$DURATION --group_reporting 
--output=/tmp/fio_loop_$loop.json --output-format=json

around 1 hour later, the kernel reported an SDHCI timeout error:

[  406.240892] mmc1: Timeout waiting for hardware interrupt.                         

[  406.246364] mmc1: sdhci: ============ SDHCI REGISTER DUMP ===========

[  406.252832] mmc1: sdhci: Sys addr:  0x00000000 | Version:  0x00003101

[  406.259302] mmc1: sdhci: Blk size:  0x00000200 | Blk cnt:  0x00000008

[  406.265770] mmc1: sdhci: Argument:  0x00e9f8c9 | Trn mode: 0x00000023

[  406.272238] mmc1: sdhci: Present:   0x01f70506 | Host ctl: 0x00000000

[  406.278706] mmc1: sdhci: Power:     0x0000000f | Blk gap:  0x00000000

[  406.285174] mmc1: sdhci: Wake-up:   0x00000000 | Clock:    0x00000107

[  406.291641] mmc1: sdhci: Timeout:   0x0000000b | Int stat: 0x00000000

[  406.298109] mmc1: sdhci: Int enab:  0x027f000b | Sig enab: 0x027f000b

[  406.304577] mmc1: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000

[  406.311044] mmc1: sdhci: Caps:      0x01e10080 | Caps_1:   0x00000000

[  406.317511] mmc1: sdhci: Cmd:       0x0000193a | Max curr: 0x00000000

[  406.323979] mmc1: sdhci: Resp[0]:   0x00000900 | Resp[1]:  0x00000000

[  406.330446] mmc1: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000

[  406.336913] mmc1: sdhci: Host ctl2: 0x00000000

such CMD25 timeout error cannot be recovered, then after a while the file system reported serious errors and was corrupted.   After I power off and on the board, the file system was totally corrupted and kernel couldn't mount it anymore, the board was a brick.

 

I have done some investigation, as following,

1. can reproduce on all boards. I haved tested on more than 10 boards.

2. replace with old software image of v3.12 kernel, the problem cannot be reproduced.

3. the problem was firstly observed on 25MHz clk. I tried to lower the sdhci clk freq to 13 MHz, it was still reproduced. P.S. 50M clk was used in the v3.12 kernel image.

4. Captured the Logical analysis: in the CMD25 transfer EMMC replied CRC STATUS OK but host didn't continue to send the data.

5. Compared with the 3.12 driver code, I also tried to add SDHCI_QUIRK2_HOST_NO_CMD23 to disable the CMD23, the CMD25 timeout happended much rarely. Tested more than 60 hours, CMD25 timeout didn't happend. However CMD 6 flush cache time out happened.

6. Change to use PIO mode ( DMA disabled) , CMD25 timeout didn't happend, but CMD 6 flush cache time out happened.

7. Removed the MMC_CAP_AGGRESSIVE_PM cap, can reproduce.

8. Disabled CMD23 and increased the cache flush timeout to 10min, reproduced CMD25 timeout failure after around 30+ hours testing.

in 6.1.119 kernel, the host driver has been moved to omap-sdhci from omap_hsmmc. maybe this is the cause why I got this problem.

Thanks.