Part Number: AM3354
Hi,
To fix the CVEs, I have upgraded SDK version to latest 09.03.05.02. The kernel version of this SDK is 6.1.119. I am suffering from a EMMC write timeout problem with fio test. This issue hasn't been observed in the old kernel v3.12.
Here is the detailed information:
I ran a fio stress testing with the following command:
fio --name=concurrent --directory=$TEST_DIR --rw=randrw --rwmixread=70 --bs=4k --size=200M --numjobs=4 --runtime=$DURATION --group_reporting
--output=/tmp/fio_loop_$loop.json --output-format=json
around 1 hour later, the kernel reported an SDHCI timeout error:
[ 406.240892] mmc1: Timeout waiting for hardware interrupt.
[ 406.246364] mmc1: sdhci: ============ SDHCI REGISTER DUMP ===========
[ 406.252832] mmc1: sdhci: Sys addr: 0x00000000 | Version: 0x00003101
[ 406.259302] mmc1: sdhci: Blk size: 0x00000200 | Blk cnt: 0x00000008
[ 406.265770] mmc1: sdhci: Argument: 0x00e9f8c9 | Trn mode: 0x00000023
[ 406.272238] mmc1: sdhci: Present: 0x01f70506 | Host ctl: 0x00000000
[ 406.278706] mmc1: sdhci: Power: 0x0000000f | Blk gap: 0x00000000
[ 406.285174] mmc1: sdhci: Wake-up: 0x00000000 | Clock: 0x00000107
[ 406.291641] mmc1: sdhci: Timeout: 0x0000000b | Int stat: 0x00000000
[ 406.298109] mmc1: sdhci: Int enab: 0x027f000b | Sig enab: 0x027f000b
[ 406.304577] mmc1: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
[ 406.311044] mmc1: sdhci: Caps: 0x01e10080 | Caps_1: 0x00000000
[ 406.317511] mmc1: sdhci: Cmd: 0x0000193a | Max curr: 0x00000000
[ 406.323979] mmc1: sdhci: Resp[0]: 0x00000900 | Resp[1]: 0x00000000
[ 406.330446] mmc1: sdhci: Resp[2]: 0x00000000 | Resp[3]: 0x00000000
[ 406.336913] mmc1: sdhci: Host ctl2: 0x00000000
such CMD25 timeout error cannot be recovered, then after a while the file system reported serious errors and was corrupted. After I power off and on the board, the file system was totally corrupted and kernel couldn't mount it anymore, the board was a brick.
I have done some investigation, as following,
1. can reproduce on all boards. I haved tested on more than 10 boards.
2. replace with old software image of v3.12 kernel, the problem cannot be reproduced.
3. the problem was firstly observed on 25MHz clk. I tried to lower the sdhci clk freq to 13 MHz, it was still reproduced. P.S. 50M clk was used in the v3.12 kernel image.
4. Captured the Logical analysis: in the CMD25 transfer EMMC replied CRC STATUS OK but host didn't continue to send the data.
5. Compared with the 3.12 driver code, I also tried to add SDHCI_QUIRK2_HOST_NO_CMD23 to disable the CMD23, the CMD25 timeout happended much rarely. Tested more than 60 hours, CMD25 timeout didn't happend. However CMD 6 flush cache time out happened.
6. Change to use PIO mode ( DMA disabled) , CMD25 timeout didn't happend, but CMD 6 flush cache time out happened.
7. Removed the MMC_CAP_AGGRESSIVE_PM cap, can reproduce.
8. Disabled CMD23 and increased the cache flush timeout to 10min, reproduced CMD25 timeout failure after around 30+ hours testing.
in 6.1.119 kernel, the host driver has been moved to omap-sdhci from omap_hsmmc. maybe this is the cause why I got this problem.
Thanks.