AM623: copy files to eMMC trigger CQE error

Tony Tang

Part Number: AM623
Other Parts Discussed in Thread: SK-AM62B-P1

Tool/software:

In SDK11.0.9 release notes as below, it is reported from SDK10, is it new issue starts from SDK10 or is it exist for long?

EXT_EP-12076

copying files to eMMC triggers cqe error

Is it software driver issue or hardware related? Is there more analysis result of it?

5 months ago

0 Prashant Shivhare 5 months ago

TI__Guru 73081 points

Hi, the SIR or JIRA doesn't have any information available so I would have to reach out to the team for the information.

https://sir.ext.ti.com/jira/browse/EXT_EP-12076

0 Bin Liu 5 months ago in reply to Prashant Shivhare

TI__Guru**** 171061 points

Hi Tony,

The original issue EXT_EP-12086 has been closed in SDK10. Now kernel log still shows "CQE recovery" message in certain scenario but no MMC register dump happening anymore. This new jira EXT_EP-12076 is to track what causes "CQE recovery" message.

0 Tony Tang 5 months ago in reply to Bin Liu

TI__Mastermind 29362 points

One of customer's issue is almost same as EXT_EP-12086, CQE error can be triggered by copying specific file as TI test result with SDK8.x to latest SDK11.0.

0 Bin Liu 5 months ago in reply to Tony Tang

TI__Guru**** 171061 points

Hi, I am out of office for the next two weeks. Please expect delayed response.

0 Karan Saxena 5 months ago in reply to Bin Liu

TI__Guru* 77924 points

Hi Tony

I see you actively discussing this with the developers on JIRA as well.

To answer your outstanding question on what changed between 9.01.00.08 and 9.02.01.09 resolved the Kernel crash issue. This information is available on the internal JIRA in the last comment from the development team. There are patch sets posted on internal linux patch review which you can check.

https://jira.itg.ti.com/browse/LCPD-34059 / EXT_EP-12086

The CQE recovery message, as Bin mentioned, is still under debug and tracked under https://jira.itg.ti.com/browse/LCPD-40996 / EXT_EP-12076

Regards

Karan

0 Praneeth Bajjuri 5 months ago in reply to Karan Saxena

TI__Expert 6436 points

Tony,

1. On " One of customer's issue is almost same as EXT_EP-12086 "

Please share full log of CQE Error/Recovery (Including if there is a kernel crash seen inferring recovery not happening)

Would like to compare with current open investigation task ( 10.1/11.0 Baseline and injecting a particular test pattern from older codebase) .

2. https://sir.ext.ti.com/jira/browse/EXT_EP-12086 Record is updated to show all important fixes that went in 10.1 baseline ( Tuning Algorithm fixes in Uboot and Kernel , Updated TAP settings)

0 Tony Tang 5 months ago in reply to Praneeth Bajjuri

TI__Mastermind 29362 points

Hi Praneeth,

There is eMMC Kernel crash issue was fixed by refereeing to Kernel update, which is accessing out of bound problem, not relevant to this kind of issue.

SDK8.3 log of “Buffer I/O error” when fail during copying file:

I/O error on device mmcblk0gp0p3, logical block 29696
[   68.026112] Buffer I/O error on device mmcblk0gp0p3, logical block 29697
[   68.026122] Buffer I/O error on device mmcblk0gp0p3, logical block 29698
[   68.026130] Buffer I/O error on device mmcblk0gp0p3, logical block 29699
[   68.026138] Buffer I/O error on device mmcblk0gp0p3, logical block 29700
[   68.026151] Buffer I/O error on device mmcblk0gp0p3, logical block 29701
[   68.026161] Buffer I/O error on device mm[   68.026003] blk_update_request: I/O error, dev mmcblk0gp0, sector 933904 op 0x1:(WRITE) flags 0x4000 phys_seg 8 prio class 0
[   68.026032] EXT4-fs warning (device mmcblk0gp0p3): ext4_end_bio:345: I/O error 10 writing to inode 12 starting block 117250)
[   68.026070] Buffer cblk0gp0p3, logical block 29702
[   68.026169] Buffer I/O error on device mmcblk0gp0p3, logical block 29703
[   68.026176] Buffer I/O error on device mmcblk0gp0p3, logical block 29704
[   68.026191] Buffer I/O error on device mmcblk0gp0p3, logical block 29705
[   68.037820] JBD2: Detected IO errors while flushing file data on mmcblk0gp0p3-8

With SDK9.2 there are two error types: tunning CEQ recovery, and/or Buffer I/O error

Below are log of 5 rounds file copying covered each error type.

Round 1, 2, trigger running CQE recovery only.
Round 3, 4 trigger two error type.
Round 5: no error, “exe (1099): drop_caches: 3” is log of echo 3 > /proc/sys/vm/drop_caches in end of copying.

==USB COPY ROUND 1==
[   29.585040] mmc0: running CQE recovery
[   29.594630] mmc0: running CQE recovery
[   29.990018] exe (1099): drop_caches: 3

==USB COPY ROUND 2==
[   34.658565] mmc0: running CQE recovery
[   35.382460] exe (1099): drop_caches: 3

==USB COPY ROUND 3==
[   40.010500] mmc0: running CQE recovery
[   40.052957] mmc0: running CQE recovery
[   40.093203] mmc0: running CQE recovery
[   40.103971] I/O error, dev mmcblk0gp0, sector 1324048 op 0x1:(WRITE) flags 0x4000 phys_seg 9 prio class 2
[   40.326743] EXT4-fs warning (device mmcblk0gp0p3): ext4_end_bio:343: I/O error 10 writing to inode 28 starting block 166274)
[   40.338162] Buffer I/O error on device mmcblk0gp0p3, logical block 75776
[   40.345023] Buffer I/O error on device mmcblk0gp0p3, logical block 75777
[   40.351834] Buffer I/O error on device mmcblk0gp0p3, logical block 75778
[   40.358636] Buffer I/O error on device mmcblk0gp0p3, logical block 75779
[   40.365438] Buffer I/O error on device mmcblk0gp0p3, logical block 75780
[   40.372240] Buffer I/O error on device mmcblk0gp0p3, logical block 75781
[   40.379040] Buffer I/O error on device mmcblk0gp0p3, logical block 75782
[   40.385813] Buffer I/O error on device mmcblk0gp0p3, logical block 75783
[   40.392620] Buffer I/O error on device mmcblk0gp0p3, logical block 75784
[   40.399458] Buffer I/O error on device mmcblk0gp0p3, logical block 75785
[   40.642028] JBD2: Detected IO errors while flushing file data on mmcblk0gp0p3-8
[   40.870965] exe (1099): drop_caches: 3

==USB COPY ROUND 4==
[   45.377435] mmc0: running CQE recovery
[   45.439714] mmc0: running CQE recovery
[   45.652207] mmc0: running CQE recovery
[   45.662183] I/O error, dev mmcblk0gp0, sector 1324048 op 0x1:(WRITE) flags 0x4000 phys_seg 8 prio class 2
[   45.923407] EXT4-fs warning (device mmcblk0gp0p3): ext4_end_bio:343: I/O error 10 writing to inode 28 starting block 166274)
[   45.934807] buffer_io_error: 6134 callbacks suppressed
[   45.934831] Buffer I/O error on device mmcblk0gp0p3, logical block 75776
[   45.946869] Buffer I/O error on device mmcblk0gp0p3, logical block 75777
[   45.953767] Buffer I/O error on device mmcblk0gp0p3, logical block 75778
[   45.960622] Buffer I/O error on device mmcblk0gp0p3, logical block 75779
[   45.967447] Buffer I/O error on device mmcblk0gp0p3, logical block 75780
[   45.974289] Buffer I/O error on device mmcblk0gp0p3, logical block 75781
[   45.981108] Buffer I/O error on device mmcblk0gp0p3, logical block 75782
[   45.987925] Buffer I/O error on device mmcblk0gp0p3, logical block 75783
[   45.994758] Buffer I/O error on device mmcblk0gp0p3, logical block 75784
[   46.001566] Buffer I/O error on device mmcblk0gp0p3, logical block 75785
[   46.198519] JBD2: Detected IO errors while flushing file data on mmcblk0gp0p3-8
[   46.444361] exe (1099): drop_caches: 3

==USB COPY ROUND 5==
[   57.098847] exe (1099): drop_caches: 3

0 Judith Mendez 5 months ago in reply to Tony Tang

TI__Expert 8510 points

Hi Tony,

Can you please provide more context, what file are you flashing to MMC? Does the issue happen everytime you write to eMMC? Is it the same 8.6 tar file that triggers CQE recovery on am62x SK or is it a different data pattern?

Judith

0 Judith Mendez 5 months ago in reply to Judith Mendez

TI__Expert 8510 points

Also,

After CQE error recovery is triggered, can you show the output of /sys/kernel/debug/mmc0/*? for example:

> cat /sys/kernel/debug/mmc0/*

Thanks.

~ Judith

0 Tony Tang 5 months ago in reply to Judith Mendez

TI__Mastermind 29362 points

Judith Mendez said:
what file are you flashing to MMC?

Copy customer's own filesystem tar file from U-flash to eMMC.

They are two different filesystem tar files for SDK8.3 and SDK9.2 based project/board.

Judith Mendez said:
Does the issue happen everytime you write to eMMC?

Log provided upper. It doesn't happen every power cycle, not on every board, but some boards are very easy to happen, even 100%.

Judith Mendez said:
Is it the same 8.6 tar file that triggers CQE recovery on am62x SK or is it a different data pattern?

Customer uses their own file system tar file, not SDK8.6 tar file.

Customer did not replicate the issue on AM62-SK, just on their own board with their own filesystem tar file.

Judith Mendez said:
After CQE error recovery is triggered, can you show the output of /sys/kernel/debug/mmc0/*

https://e2e.ti.com/cfs-file/__key/communityserver-discussions-components-files/791/1663.SDK8.3_5F00_cat-mmc0-log

https://e2e.ti.com/cfs-file/__key/communityserver-discussions-components-files/791/SDK9.2_5F00_cat-mmc0-log

0 Judith Mendez 5 months ago in reply to Tony Tang

TI__Expert 8510 points

Hi Tony,

Seems like you also have data timeouts. We are also seeing data timeouts on am62x SK with 8.6 tar file. This is currently still under investigation.

The current waveforms don't show anything out of the ordinary, still not sure why the controller is reporting so many data timeouts and triggering CQE recovery.

~ Judith

0 Tony Tang 4 months ago in reply to Judith Mendez

TI__Mastermind 29362 points

Hi Judith,

In order to distinguish if READ or WRITE operation triggered the error. customer did experiment as below.

Linux SDK8.3, custom board, mounted 22ohm serial resistor on data line already. refer this post for serial resistor background:

https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1507836/am623-why-can-serial-resistor-eliminate-emmc-io-error

Experiment method: Do md5sum to a ~12MB size file repeatedly, compare result of each time to ensure only doing read operation.

Experiment condition: In temperature chamber at 65c. captured I/O error log, error persist once triggered.

dmesg_IO_error_of_eMMC_md5sum_operation.log

0 Bin Liu 4 months ago in reply to Tony Tang

TI__Guru**** 171061 points

Tony,

Thanks for sharing the log. I am not sure if the customer issue is the same as that on SK-AM62B-P1 EVM. The CQE message on the EVM only happens in eMMC write but not on read.

0 Christopher Roberts 4 months ago in reply to Bin Liu

TI__Genius 9540 points

Tony,

The issue on SK-AM62B-P1 EVM has been understood and resolved so that the CQE recovery no longer occurs.

This issue was narrowed down when the failing data transfer was localized to a pattern of 0xFF00 which then allowed us to reliably recreate the failure. This pattern causes data transitions from 0->1 on every transition, stressing the IO power supply. Once this was identified, the focus shifted to the EVM PDN. It is worth noting, this pattern only failed when transferred in a size of 1480B and above.

The SK-AM62B-P1 EVM had poor placement of the SOC_DVDD1V8 decaps. These should be placed near the SOC pins, but on the SK-AM62B-P1 EVM they were far from the SOC.

The EVM was modified following the attached rework instructions:

/cfs-file/__key/communityserver-discussions-components-files/791/ECN-SK_2D00_AM62B_2D00_P1-_2800_PROC142A_2900_-_1320_-add-capacitor-to-VDDSHV4-pin-T7.pdf

After these improvements were made, the 0xFF00 pattern transfer no longer failed.

For the customer to determine if they are similarly impacted they can do two things:

Transfer a pattern of 0xFF00 and see if it consistently reproduces the failure
Inspect their board layout to determine if the placement of the high frequency decaps is close to the SOC or if it resembles the EVM implementation.

Thanks,

Chris

0 Bin Liu 4 months ago in reply to Christopher Roberts

TI__Guru**** 171061 points

Hi Tony,

The 0xFF00 pattern file is attached below if the customer wants to test it.

ff00-3k.dat

0 Tony Tang 4 months ago in reply to Christopher Roberts

TI__Mastermind 29362 points

Hi Chris,

Test with the ff00-3K.dat file on custom board, can 100% trigger the IO error, even on the board which is not triggered by themselves test data file.

Then modify board capacitor:

HW: On custom board, VDDSHV4 pin capacitor placement as snapshot: 2mm trace to via, 0201 10V 1uF and 0402 6.3V 4.7uF.

Due to space limitation around the via, add extra cap on the original cap pad:

#1. Add one 0402 0.1uF and one 0402 1uF: still 100% IO error

#2. Add two 0402 4.7uF.: still 100% IO error.

Customer did not observe VDDSH4V 1V8 fall outside of recommended range during test, how about the VDDSHV4 power supply fluctuation on AM62-SK-P1 during stress test?

Further suggestion to custom board?

0 Bin Liu 4 months ago in reply to Tony Tang

TI__Guru**** 171061 points

Hi Tony,

Is this test on the project using SDK8.3 or the one using SDK9.1?

0 Tony Tang 4 months ago in reply to Bin Liu

TI__Mastermind 29362 points

Further test: Add 0.1uF on via on either board with SDK8.3 and SDK9.2, no improvement, still 100% trigger.

0 Mark M 4 months ago in reply to Tony Tang

TI__Mastermind 30140 points

Hi Tony,

Can you share information about the capacitors and placement of capacitors for the eMMC device? If the problem occurs with reads with the FF00 pattern, then that may indicate eMMC power issue.

Is it possible to share the layout file privately?

Also, the software team wants to focus on the SDK9.2 board instead of SDK8.3.

Regards,
Mark

0 Christopher Roberts 4 months ago in reply to Tony Tang

TI__Genius 9540 points

Tony,

Since we have root caused the AM62x EVM to the MMC IO power supply issue and resolved this on the EVM this is where we are trying to focus for this customer solution to see if it is the same root cause.

We have not been able to simulate to get a measure of loop inductance with the layout files as shared by the customer. Mark will follow up via email for additional requests to try and get this information. This will help give us a quantitative measure of PDN performance on the customer board.

We also request the customer to perform an experiment using the test setup they currently have to attempt to measure the power supply noise in both the read and write direction.

A way to get an estimate of the impact of power supply noise is to:

Hold one data signal high, toggle all other data signals 0->1->0->1... Probe the held high signal and look for fluctuations in voltage level.
Hold one data signal low, toggle all other data signals 0->1->0->1... Probe the held high signal and look for fluctuations in voltage level.

For reference, on the failing EVM this test was performed in the write direction only and the static high signal showed fluctuations from 1.4V to 2.V (1.8V nominal).

The customer will have to take care to ideally measure the impact of reads and writes separately since they are seeing issues in both directions. This will help to confirm there is some impact from power supply noise to see if the failure is the same as the EVM or if this is a separate issue.

Please follow up if there are any questions on setup or data pattern.

Thanks,

Chris

0 Bin Liu 4 months ago in reply to Christopher Roberts

TI__Guru**** 171061 points

Hi Tony,

Here is the eMMC controller register information:

The kernel devicetree k3-am62-main.dtsi, sdhci0 node has:

reg = <0x00 0x0fa10000 0x00 0x1000>, <0x00 0x0fa18000 0x00 0x400>;

This specifies the base address and size of the MMC0 controller MMR regions:

Region #1: base address 0x0fa10000, size 0x1000. MMR details are in the TRM (revB) Table 14-18888;

Region #2: base address 0x0fa18000, size 0x400. MMR details are in the TRM (revB), Table 14-1889.

0 Tony Tang 3 months ago in reply to Bin Liu

TI__Mastermind 29362 points

Update the waveform captured during the failure.

#1. The failure occurs in middle of transmit, so its not related to signal integrity, or eMMC device compatibility.

#2. eMMC access in block basis of 512Byte, in the fail block, although stopped for ~128clock, but the total length is still 512clock, seems the eMMC controller doesn't stop transmit although the signal stopped output.

#3. Need analysis from IP, SOC structure level, not application level.

0 Bin Liu 3 months ago in reply to Tony Tang

TI__Guru**** 171061 points

Tony,

Thanks for the update. The communication continues offline.

Processors

Processors forum

AM623: copy files to eMMC trigger CQE error