Linux/AM3354: Linux DMAengine/EDMA Question

Amanda Ross95

Part Number: AM3354

Tool/software: Linux

Hello there,

I am looking to do scatter-gather DMA transfers to a device connected to the GPMC bus. The GPMC bus operates in non-NAND mode. Our driver uses the dmaengine interface.

I can do single transfers by passing DMA_MEMCPY to dma_request_channel, then use dmaengine_prep_memcpy, and that works... The issue with that is that I need to context switch after every transfer (via wait_for_completion + a callback on the dma descriptor). This seems inefficient when doing several small-sized transfers, and by the way the 'device' works, we need to do a lot of tiny transfers.

I can't operate the DMA to GPMC in slave mode (i.e. dmaengine_prep_slave_sg, dmaengine_prep_slave_single), and as I understand it, slave only works when the GPMC is configured in NAND mode.

dmaengine_prep_dma_sg() seems to be exactly what I need. The EDMA driver in linux 4.4 doesn't implement this. I've briefly scanned TI's linux tree on gitorious, but didn't find any such implementation in either 4.4 or 4.9 branches.

Is there another similar interface that I can use to do these scatter-gather transfers? Are there plans to implement the DMA_SG interface in the edma driver?

Thanks for any input you can provide!

-Amanda

over 8 years ago

0 Yordan Kovachev over 8 years ago

TI__Guru**** 161600 points

Hi Amanda,

Is there another similar interface that I can use to d o these scatter-gather transfers? Are there plans to implement the DMA_SG interface in the edma driver?

I am not aware of any plans to implement this in the dma driver.

Best Regards,
Yordan

0 Pavel Botev over 8 years ago

TI__Guru**** 170625 points

Hi Amanda,

Scatter-gather DMA refers to the ability of a DMA engine to automatically perform a string of DMA operations from non-contiguous memory blocks in a single operation. "Scatter" refers to the ability to write to a number of non-contiguous blocks, while "gather" refers to the ability to read from a number of such blocks.

Scatter-gather requires the DMA programmer to provide a linked-list of so-called "descriptors" each of which contain the source and destination address of (contiguous) blocks of memory. The DMA controller is (typically) told where this list resides in host memory, and fetches each descriptor before each transfer.

Yes, in PSDK 3.03 (kernel 4.4.41) we have only "slave_sg" support:

linux-4.4.41/drivers/dma/edma.c -> s_ddev->device_prep_slave_sg = edma_prep_slave_sg;

And this scatter-gather usage is described in below files:

linux-4.4.41/Documentation/dmaengine/client.txt
linux-4.4.41/Documentation/dmaengine/provider.txt
linux-4.4.41/Documentation/DMA-API.txt
linux-4.4.41/Documentation/DMA-API-HOWTO.txt
linux-4.4.41/Documentation/dma-buf-sharing.txt

See also if the below links will be in help:

e2e.ti.com/.../546507
e2e.ti.com/.../149572
www.linuxjournal.com/.../7104

Regards,
Pavel

0 Alexandru Gagniuc over 8 years ago in reply to Pavel Botev

Prodigy 170 points

Hi,

I'm the one that opened up this question originally. The problem that I'm seeing is a huge overhead associated with chaining DMA transactions in software. Consider the following IO chain:

gpmc_write(SOME_CTL_REG, 4);
gpmc_read(SOME_CTL_REG, 4);
gpmc_write(CTL_REG2, 8);
gpmc_write(DATA_WINDOW, 100);

I can measure the latency by watching the GPMC CSN line with an oscilloscope. On a bare-metal system, I can chain these transaction in software with a latency of about 2 us. That is acceptable.

On the other hand, on a linux system, the minimum latency I see is about 20 us, and it goes as high as 60us. In terms of this minimum latency, it doesn't matter if I wait_for_completion_timeout(), or dma_sync_wait(), or if I disable the DMA_PREP_INTERRUPT flag to device_prep_dma_memcpy(). This latency limits the throughput to levels that are not suitable for driving the device on the other end of the bus.

What I'm trying to do is lower this 20us latency. I think the way to do this is to chain the DMA transactions in hardware, hence my original question. Is there some other optimization that I'm oblivious to, or do I have a better chance at hacking the edma driver?

CSN activation, bare-metal system:

CSN activation, linux 4.4 with dmaengine:

0 Pavel Botev over 8 years ago in reply to Alexandru Gagniuc

TI__Guru**** 170625 points

Alexandru,

EDMA is the right direction to reduce GPMC latency.

We don't quote latency number for interrupts due to the complexity of the Cortex-A8 and the many variations of sofware possible. Latencies can be very different from Linux to RTOS or bare metal solutions depending on many factors such as if the MMU is used and how it is used. Other major factors include any critical sections in the customers code base.

Also depending on your latency requirements, there are methods available to limit latency. One example of this is to lock down the L2 cache with any code you may have for which you require low latency.

Note that there are other options (instead Linux PSDK) that might reduce latency, options like StarterWare, Linux-RT, RTOS.

www.ti.com/.../processor-sdk-am335x

See also if the below e2e threads will be in help:

e2e.ti.com/.../1056619

e2e.ti.com/.../36625
e2e.ti.com/.../143795

e2e.ti.com/.../356576
e2e.ti.com/.../573232
e2e.ti.com/.../477311

Regards,
Pavel

Processors

Processors forum

Linux/AM3354: Linux DMAengine/EDMA Question