How to maximize EDMA3 performace?

yanbin xing

Other Parts Discussed in Thread: OMAP-L138

I'm using OMAP-L138. I have an two-channel-mixed source data which stored at internal SRAM, like this:

a1 b1 a2 b2 a3 b3............. (Both a and b are 1 byte data)

I want to pick out channel-a data only by using EDMA3 to copy a1 a2 a3…to external DRAM.

But unfortunately my EDMA3 got only 37MB/s to the general liner EDMA3 copy speed is about 500MB/s. Is there any thing wrong? Or, is there any good idea to make the speed higher?

over 13 years ago

0 kcastille over 13 years ago

TI__Guru 54422 points

Yanbin,

In order to accomplish this, I expect you're setting up an EDMA transfer w/ ACNT = 1, BCNT = N, AB-Sync, SBIDX = 2, DBIDX = 1. (right?)

In this case, the EDMA TC is forced to issue a bunch of 1 Byte read requests to the source addresses, which is not taking advantage of the wide bus interfaces and burst capability provided by the EDMA, interconnect, and SRAM busses. This is the root cause of your observed bandwidth degradation compared to a pure linear transfer (ACNT = large, which results in EDMA TC issuing burst requests to the source addresses)

As a side note, on the DST side, the EDMA TC logic recognizes that the transfer is really linear (since ACNT == DBIDX) and is able to perform burst transfers to the destination. While this doesn't improve the raw throughput observation (since the SRC side is still slow) it does provide better efficiency on that side of the transfer. Put differently, you're using the SRAM bandwidth inefficiently but using the DRAM bandwidth efficiently.

So ... your question is .... how can you make it faster?

You may be able to use the CPU to rearrange the buffer in L2 SRAM to the form of a1 a2 a3 ...; b1 b2 b3 .... And then use the EDMA to transfer linear blocks to DRAM. This may be unattractive since you're using CPU MIPS. In the end, I'm not sure if you'll get to a net faster result but I expect you would if the DSP code is written in a reasonably optimal way.

Other than that ... unless you're able to rearrange the data buffers somehow, the EDMA is fundamentally limited based on the ACNT=1 to achieve a relatively lower bandwidth.

Regards
Kyle

0 yanbin xing over 13 years ago in reply to kcastille

Prodigy 170 points

Kyle,

Thank you , I got it.

0 Doug Kim over 10 years ago in reply to kcastille

Intellectual 300 points

Hi, i also using L138, I need to burst data copy shared ram to DDR ram, and same situation, a1 b1 a2 b2 a3 b3............. (Both a and b are 1 byte data). My settings are ACNT=1 , BCNT= 0x40 , CCNT = 1 , SRCBIDX= 2, DSTBIDX = 1. I've got your meaning acnt is 1 case is not efficient, but it takes more than 20 us. I believe this is too much, if I use QDMA , is it better ? Do you have QDMA example for L138? Thanks in advance. - Doug

0 kcastille over 10 years ago in reply to Doug Kim

TI__Guru 54422 points

Doug,

QDMA and EDMA have the same performance characteristics. It may be faster to rearrange the data buffer in SRAM (using either EDMA or CPU), and then DMA the buffer out to SDRAM using large ACNT. The point is to use the faster on-chip SRAM for the ineffcient 1-B accesses, and then use long bursts to SDRAM.

Regards,

Kyle

0 Doug Kim over 10 years ago in reply to kcastille

Intellectual 300 points

Hi Kyle,

I agree about ACNT issue, but this is too much slow, EDMA clock is ( 228 MHz = period 4.6 ns ) but this result is (20 us / 0x40 = 312 ns) . 312 ns / 4.6 ns = 68. 68 times slow. I believe too much wait for each byte sending, that's why I thought about QDMA. I know basicalliy QDMA is same architecture but if QDMA can reduce this waiting time, It may be possible to optimize speed. I cannot understand using BCNT is why this much slow than ACNT. Is there any handshaking schema in BCNT processing ? If that we may be optimize that handshaking process. I don't have enough material for QDMA. I have EDMA3 document for L138 but there is no QDMA example. If QDMA also same process with EDMA, could you find out why that much delay? I have to optimize speed 4K byte copy per about 30 us with high and low byte sample using DMA (shared ram to DDR ram).

Best regards,

Doug

Processors

Processors forum

How to maximize EDMA3 performace?