EDMA vs Cache for SDRAM data transfer

laurent poyart

Hello,

I’m runing a test of 4096 bytes data transfer from a 16bits SDRAM(using EMIF) to internal SRAM memory on a C6414.

In a first test, I disable the L2 cache and the SDRAM is not cacheable. I run the memory transfer using the DAT module of the CSL (which runs a DMA). I get a data throughput next to the EMIF bandwidth. I can see in the profiler that 1 EDMA transfer occurs.

In a second test, I enable the L2 cache (256K) and the SDRAM is configured as cachable. I run the memory transfer using a memcpy. Compared to the first step I’ve twice the number of cyclesto execute the same transfer. I was effectively awaiting a greater transfer time because the CPU has to perform 1 DMA every 128 bytes (for each cache line) which introduces an overhead. I can effectively see in the profiler that the cache management run 32 EDMA requests (4096/128).In the profiler, I get cpu.stall.mem.l2.cache.miss.read=8653 and CPU.stall.mem.L1D=9101 among the ~10000 cycles of the memcpy.

I don”t really understand the factor 2 between the 2 configurations (Cache vs EDMA).

If anyone can help me.

Thanks in advance.

Laurent.

over 13 years ago

0 Massimo over 13 years ago

TI__Expert 6475 points

Laurent,

I understand from your message that you're using a software simulator. Even if you're on a cycle accurate simulator, it's better to double check the conclusions on a board, at least for the EMIF traffic.

Anyway, there can be several reasons why a memcpy might be slower than EDMA on a C6414. First of all memcpy() works only on bytes, since it cannot make any assumption on the word alignement of the arrays.

You can refer to this application note to get an estimate of data movement performance of C641x:

TMS320C64x EDMA Performance Data (spraa02.pdf)
http://www.ti.com/litv/pdf/spraa02

To move efficiently data with the C6400 CPU you shall try to exploit the full internat bus width, using the LDW/STW instructions. If I remember well, the C6400 should have 64bit internal buses, it shall support also LDDW/STDW.
If the array is 32bit aligned, I would expect this code to be much faster than a memcpy(), if compiled with -O3:

#pragma DATA_ALIGN(source, 4);
#pragma DATA_ALIGN(destination, 4);
Uint8 source[SIZE];
Uint8 destination[SIZE];

{

.....

for(i=0;i<SIZE;i+=4) {
_amem4((void *) &destination[i]) = _amem4((void *) &source[i]);

}

The C6000 Optimization Workshop on http://processors.wiki.ti.com/index.php/Hands-On_Training_for_TI_Embedded_Processors might offer some more hints.

Hope this helps

Best regards
Massimo

0 RandyP over 13 years ago in reply to Massimo

TI__Guru* 84110 points

Laurent,

In general, we consider EDMA to be a higher performing choice compared to CPU reads/writes even with cache enabled. The biggest advantage is that you can start the EDMA channel to do its transfer and then have the DSP do some instruction executions in parallel. The best scenario tends to be using the ping/pong buffering method where you will let the DSP operate on one buffer (ping) while the EDMA is moving data into or out of the other buffer (pong).

The primary efficiency that you gain with the EDMA is its use of longer bursts in accessing the SDRAM. The cache controller will do as good as it can, but there are design tradeoffs that tend to make EDMA faster even when you wait for the EDMA operation to complete, such as using QDMA.

Check the assembly implementation of your memcpy to see if it is optimized for using LDDW/STDW. Some RTS releases do that when possible.

Some people have had good luck improving cache performance for an array by doing a touch loop first to pre-load the cache by reading the first word from each cache line across the depth of the array. I remember someone describing better performance including the time for the touch loop. Just something you could try, if you want to.

I recommend using the EDMA, though. And try to design your system so you are not polling for completion the whole time the EDMA is running, but doing something useful or using a ping/pong technique.

Regards,
RandyP

0 laurent poyart over 13 years ago in reply to Massimo

Prodigy 180 points

Hello Massimo,

My data buffers are declared as long long and are 64bits aligned.

I've tried to use the _amem4 loop but it seems to be slower than the TI memcpy which seems to be optimized : it uses the LDNDW instruction (_memd8) and i can see the 8 bytes access increment using step by step debugging.

Best regards.

Laurent.

0 laurent poyart over 13 years ago in reply to RandyP

Prodigy 180 points

Hello,

Thanks for your answer.

I agree that EDMA has higher performance compared to CPU read/writes. In my test, i didn't want to compare direct EDMA access vs memcpy to copy data buffers from external memory. In fact, I have structured data (not signal processing or algorithm buffers) that I want to access in the same way wherever there are mapped : in internal or external memory. (ex: myStruct myData;   foo = myData.field1). Using the cache allows to load the cache line with the data on a read or write access and to trigger an EDMA in a transparent way. The first access to those data is more time consuming because of the cache miss but the next accesses are performants.In order to see exactly what is the cost of this transparency I made a test in which I access the data directly. Forgot my previous comparison between mempcy and EDMA. Rather than this previous test, I run the following: 1) read access to a data mapped in external memory. Because the L1/L2 caches are enabled, this single access triggers a cache line transfer from external memory. This is transparently performed by the hardware using an EDMA of 128 bytes. 2) compare this single access to a 128 bytes EDMA run by software using the DAT module of the CSL. The test is the following:

    mySDRAMvalues = (long long*)0x80000000;   // EMIFA => SDRAM start address
    tCacheStruct* myStruct = (tCacheStruct*)mySDRAMvalues;
    myValue = myStruct->b; // single data access => triggers a cache miss and an EDMA
    myValue++;

I compare it to :
    DAT_copy((long long*)0x80000000,myValues,128);

The DAT copy is about 150 cycles faster than the single access for several CPU/EMIF frequency configurations. I don't manage to get an explanation on this 150 cycles delta (it seems to come from L1D/L2 cache stall) while the cache line is also loaded using an EDMA. I'm going to take a look in detail in the spra002 to understand where those cycles are lost.

Note: I now use the CSL timer service to measure.

Regards,
Laurent

Processors

Processors forum

EDMA vs Cache for SDRAM data transfer