AM4376: MMC/SD reset timeout handling

Ihar Filipau

Part Number: AM4376

Hello All!

We would like TI to clarify to us the timing of the MMC reset, specifically the timing of all the steps specified in the RTM, '0->1' and '1->0', that are described in spruhl7i RTM, 17.5.1.18 SD_SYSCTL, SRC and SRD bits.

Problem:

We have observed on our system* a random 20ms blockage of RT applications.

(*System: Linux RT, kernel v5.10, SDK v8.0-ish. ARM runs at 300MHz. MMCSD is clocked at 48MHz.)

Investigations have localized the issue to the "broken-cd" handling of the SD card interface connected to MMCSD1. (We couldn't connect the CD due to lack of free pins on the SoC. The issue disappears if the SD card is inserted.)

Analyzing the code* and comparing it to RTM (chapter 17.5.1.18 SD_SYSCTL Register in "spruhl7i") we don't understand were from the 20ms timeout comes. (The "MMC_TIMEOUT_US" in the Linux driver.)

(*Code: https://elixir.bootlin.com/linux/v5.10.149/source/drivers/mmc/host/omap_hsmmc.c#L975 )

(From RTM, briefly, SRC/SRD resets: set reset bit to 1; wait for transition of the bit from 0 to 1; wait for transition of the bit from 1 to 0.)

The timing of the reset is not specified in the RTM. (Longest time I could find in the RTM was 80 cycles for some internal initialization, which at 48MHz (clock of our MMC) is way way shorter than 20ms.)

It appears that sometimes this reset is triggered from interrupt context, and thus blocks most/all of the system for 20ms. (On RT Linux tests where wee see RT errors, sometimes it appears to block everything except for the WDT interrupt.)

It's unclear why the reset routine waits full 20ms (as if running into timeout) yet doesn't produce the error message. And the SD cards works when plugged.

Right now my assumption is that MMCSD1 performs reset faster than the ARM/driver could detect the 0->1 transition, and thus driver is stuck waiting 20ms for the start of reset, while in fact the MMCSD1 has long resetted itself (the '0' that is being read back is indicator that reset had been finished, thus no error message).

We would like TI to clarify to us the timing of the MMC reset, specifically the timing of all the steps specified in the RTM, '0->1' and '1->0'.

over 2 years ago

0 peaves over 2 years ago

TI__Guru 62855 points

The procedure is clearly defined in the register description.

The proper procedure is: (a) Set to 1 to start reset, (b) Poll for 1 to identify start of reset, and (c) Poll for 0 to identify reset is complete. mmc_dat finite state machine in both clock domain are also reset.

Some of the time that is required for the reset to complete is related to hardware, but most of it is related the software operating system and device interconnect delays due to other internal traffic. We do not define the hardware delays, so you must follow the reset procedure of polling for completion.

Regards,
Paul

0 Ihar Filipau over 2 years ago in reply to peaves

Prodigy 70 points

I'm inquiring about the *timing* of the reset, not procedure. The timeing of the reset is *not* documented.

The 20ms used in Linux drivers (omap_hsmmc and sdhci-omap) is mentioned nowhere in the RTM. (Just for the comparison, the worst other timing for reset I could find in other drivers for other hardware is 0.5ms.) Also note that for OMAP it's two busy-waiting loops, one per transition, and each does busy-waiting for up to 20ms in interrupt context.

0 peaves over 2 years ago in reply to Ihar Filipau

TI__Guru 62855 points

The time it takes to complete the reset is variable based on the state of the hardware when applied and we have no plans to document the worst case time. That is why we provided a way to poll the status to minimize the impact to a time sensitivity application.

I'm not familiar with the software implementation, but I suspect a large polling interval was implemented in Linux to avoid any bandwidth impact on other processes. I will assign this thread to someone on our Linux software team to see if they can answer your software question.

Regards,
Paul

+1 Andreas Dannenberg over 2 years ago

TI__Guru 69632 points

Ihar Filipau said:
Right now my assumption is that MMCSD1 performs reset faster than the ARM/driver could detect the 0->1 transition, and thus driver is stuck waiting 20ms for the start of reset, while in fact the MMCSD1 has long resetted itself (the '0' that is being read back is indicator that reset had been finished, thus no error message).

Based on the analysis you have done I think this looks like a reasonable assumption as to the root cause.

For testing purposes, can you please try removing "ti,needs-special-reset" from all applicable arch/arm/boot/dts/am43* device tree files your system is using and see if this makes the issue go away? This should skip the initial waiting for the reset control bits to become '1', and if it works, could help further validate your theory (I'm not suggesting this to be used as a workaround!).

Thanks, Andreas

0 Andreas Dannenberg over 2 years ago in reply to Andreas Dannenberg

TI__Guru 69632 points

Ihar,

also can you please try the attached patch. Should your assumption of the driver missing the 0->1 transition be true this could be a more graceful way to handle those instances (instead of disabling ti,needs-special-reset). Please let me know what you find in terms of real-time behavior, I don't have a good way to test this.

Regards, Andreas

https://e2e.ti.com/cfs-file/__key/communityserver-discussions-components-files/791/0001_2D00_HACK_2D00_mmc_2D00_omap_2D00_hsmmc_2D00_Do_2D00_not_2D00_disable_2D00_interrupts_2D00_while_2D00_.patch

0 Ihar Filipau over 2 years ago in reply to Andreas Dannenberg

Prodigy 70 points

Thanks for the feedback! We would test the suggestions at earliest opportunity (next week the team is back from brief vacations.)

0 Ihar Filipau over 2 years ago in reply to Andreas Dannenberg

Prodigy 70 points

Hi Andreas,

The patch looks wrong to me. Or is this intentional? You do "spin_unlock_irqrestore()" before "spin_lock_irqrsave()". Only side effect is to enable interrupts and then again disable them (and leave them disabled after the loop). But as far as I understood the logic of Linux MMC, the interrupts are not disabled here - only the preemption is disabled. Please correct me if I'm wrong.

0 Andreas Dannenberg over 2 years ago in reply to Ihar Filipau

TI__Guru 69632 points

Hi Ihar,

the goal for the patch was to still allow other activity to happen if we hit the "20ms timeout condition" corner case by briefly opening up that spinlock that is used throughout the driver ("interrupt" wasn't probably the best way to describe this). This seemed like an experiment worth doing hence I suggested it since I don't have a good way to re-create your exact setup and "blocks most/all of the system for 20ms" issue.

I'm primarily interested seeing what you get by removing ti,needs-special-reset as discussed earlier. If we can confirm your suspicion as root cause we can think about how to work around it, and the patch I created may or may not help in this case.

Regards, Andreas

0 Ihar Filipau over 2 years ago in reply to Andreas Dannenberg

Prodigy 70 points

Hi Andreas,

I've marked your first comment with "ti,needs-special-reset" as the solution for the issue, because after applying a patch that removed the option from the SD card configuration, the firmware did pass the tests.

Unfortunately, I don't have log files to verify whether any errors appeared in the syslog. Further tests couldn't be conducted due to time constrains (and the normal reaction of everyone to pick the first patch that passed through the test team). Alas.

Thanks for your support and regards. Ihar

0 Andreas Dannenberg over 2 years ago in reply to Ihar Filipau

TI__Guru 69632 points

Hi Ihar,

thanks for the feedback, and glad to see this seems to solve the issue for you, at least that's what it looks like. I'd still recommend additional system-level / stress testing on your side for now to make sure there are no side effects.

I took a note here on my side to try to find somebody here (after the US Thanksgiving break) from the design team to discuss the concern some more, and to see if there is any way to keep the existing code logic intact but reducing the timeout down from the 20ms which seems overly excessive assuming some controller logic reset is what the timeout is supposed to cover. Ideally we would not "miss" the initial 0->1 transition of the reset bits but I don't think the way the HW is designed there's anything clean we can do on the Kernel side to completely eliminate this concern and ensure we always "catch" the transition in time.

Regards, Andreas

Processors

Processors forum

AM4376: MMC/SD reset timeout handling