Linux/OMAP-L138: Long delay in cppi41_dma_channel_abort causing application level errors

Trevor Sutter

Part Number: OMAP-L138

Tool/software: Linux

I am working with a custom board designed around the OMAP-L138, running linux. Upon experiencing application level issues (CPU starvation) during USB disconnects, I traced the cause down to cppi41_dma_channel_abort (in drivers/usb/musb/musb_cppi41.c). This function is being run in atomic context with all irqs blocked and there is a long mdelay:

	/* DA8xx Advisory 2.3.27: wait 250 ms before to start the teardown */
	if (musb->io.quirks & MUSB_DA8XX)
		mdelay(250);

It seems like a bad idea to have such a long delay run in this context. I have tried searching around to find more information on the origins or necessity of this delay, but there was little information in the referenced advisory and the comments accompanying the commit to linux-davinci (593bc4622a98c172dbb939103aef917d1800a663) don't provide much detail. I would like to know:

How necessary is this delay, and what are the risks run in removing it? I ask because this is causing my applications to experience CPU starvation, and if this is only in place to catch an edge-case issue that is only seen during a very specific use case, the best option might be to do without.

I am hoping someone here has more knowledge of this driver than I do, and I would appreciate any additional information or suggestions.

Thank you,

Note: If this is not the correct forum for such a question, please correct me. This is my first post.

over 6 years ago

0 Yordan Kovachev over 6 years ago

TI__Guru**** 161600 points

Hi,

Which Linux SDK version are you using?

Best Regards,
Yordan

0 Yordan Kovachev over 6 years ago

TI__Guru**** 161600 points

Hi,

How necessary is this delay, and what are the risks run in removing it?

The delay is added because of the following reason:
Teardown receive DMA is not working perfectly. This happens when a teardown is initiated by software during the endpoint is still active. Frequent teardown results in XDMA hung up situation.

All kernel versions (up until latest TISDK -> kernel v4.9.41) have this delay.

The resk is XDMA hanging up when there are frequent teardowns.

Best Regards,
Yordan

0 Trevor Sutter over 6 years ago in reply to Yordan Kovachev

Prodigy 175 points

Yordan,

Thank you for a quick response.

Yordan Kovachev said:

Which Linux SDK version are you using?

I am actually using kernel v4.12.0

Yordan Kovachev said:

All kernel versions (up until latest TISDK -> kernel v4.9.41) have this delay.

Do you mean to say that all these versions do NOT have this delay? I was under the impression that until this was committed (April 18, 2017, seen in v4.12-rc1) the delay did not exist.

Yordan Kovachev said:

The risk is XDMA hanging up when there are frequent teardowns.

My question could have been phrased better. I understand that the DMA hanging is the risk here. However, what I do not understand is why there is such a long delay in an atomic context where we can be assured other processes will not run. It seems like such a fix would have a good explanation for why it was done this way. Perhaps I should be contacting the author directly. I figure that asking here might prove beneficial to others who run into this problem.

Thanks,

0 Yordan Kovachev over 6 years ago in reply to Trevor Sutter

TI__Guru**** 161600 points

Hi,

Do you mean to say that all these versions do NOT have this delay? I was under the impression that until this was committed (April 18, 2017, seen in v4.12-rc1) the delay did not exist.

This is silicon bug described in the OMAP-L138 Errata document and it exists from a long time, so all software released for OMAP-L138 should have the described workaround (250ms): Linux kernel, TI RTOS, and custom bare metal drivers.
I was not in validating the bug & the workaround so I cannot answer why there is such a long delay, but I can assure you that this have been extensively tested and the team came with the exact value of 250 ms.
You can choose to go with shorter delay assuring that no other process is running (if this works for your application), but this is at your own risk.

Best Regards,
Yordan

Processors

Processors forum

Linux/OMAP-L138: Long delay in cppi41_dma_channel_abort causing application level errors