This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

DMA vs. memcpy

Does anyone have any metrics on when to use DMA to do a copy verses just using a memcpy? I guess I am asking how large of a buffer is needed before DMA becomes more efficient? We can assume that the data is aligned on 32 bit boundaries and is whole words (32 bit). With DMA several registers will need programmed, then some polling for when the transfer is complete will add to the overhead.

  • Hello!

    If you imagine DMA use as setup transfer and poll for completion, then there is no performance benefit. You'd better plan your application in a way, that once DMA transfer is fired, CPU switches to some other useful work, while DMA runs on its own. Good way is to set completion interrupt. Then CPU will get notification upon completion and may align copied data processing.

    One more note on data move. memcpy() itself tries to maximize copy throughput, but it has to perform additional checks before that maximum performance achieved. If you have knowledge of your data such as alignment and multiple, for loop might work even better. Suppose, your arrays of 32 bit values not only 32 bit aligned, but 64 bit aligned and there is even count of elements. Then following piece of code

    int src[COUNT];
    int dst[COUNT];
    
    int64 *psrc = (int64 *) src;
    int64 *pdst = (int64 *) dst;
    
    for (i=0; i<COUNT/2;i++) 
        _amem8(pdst++) = _amem8_const(psrc++);

    will copy 2 elements at a time, hence use maximum bandwidth of C64x and beat memcpy too.

  • David Boles,

    DMA:
    The DMA transfers can be triggered 3 different ways:

    • Manual START
    • Sync event from a peripheral
    • Chained event

    The DMA is configured to respond to sync events from peripherals. If the memcpy() transfer size could potentially be greater than 100, the benefit for using hardware DMA increases significantly.
    If this is the case, configure the DMA peripheral available on the device to perform the data/memory transfer instead using the memcpy() function call.

    If you plan to only do memory-to-memory transfers using the Quick DMA(QDMA), similar to a memcpy() in C
    The QDMA uses a trigger word to start the transfer and is used for memory-to-memory data movement.
    It cannot be synced to a peripheral event.

    MEMCPY:
    A call to MEMCPY in the code may call either of the two functions in the library:
    strasg - This function disable interrupts by clearing the GIE bit in the CSR
    memcpy - This function does not clear interrupts.

    The choice of strasg() vs. memcpy() depends on the alignment and length of the object to be copied.
    If the alignment is exactly 32 bits and the length is greater than 28, strasg() would be called.

  • David,

    Memory transfer performance depends on your board and your application.

    - There are 4 combinations of transfer src/dst using L2 SRAM and External Memory, or more if you consider L1D SRAM.
    - There can be considerations for cache coherency that need to be included in the benchmarking, when external memory is being used.
    - There are timing requirements, buffer sizes, and shared resources that may need to be considered for the full application.

    The best thing to do is run the tests in your environment the way you need it to be done:

    - Run memcpy with typical alignment and take TSCL readings before and after.
    - Run QDMA with the same alignment and take TSCL readings before and after, using cache coherency commands as needed.
    - If the alignment and length and compiler settings work out, try the for-loop in rrlagic's well-experienced post.

    15 years ago, we measured 5 words as being the decision point. <5 words use memcpy, >5 words use EDMA.

    With the faster memories on the DSK6455 and the faster internal buses inside the C6455, that number moved higher. I have not measured it since then, but I would use memcpy for <50 words, QDMA for >100 words, and measure carefully in between.

    You may find QDMA faster than memcpy or optimized loops for external memories that benefit from optimized longer transfer commands.

    In the simple case in your application, you will want to post the QDMA as early as possible in your thread so you can do as many instructions as possible before starting to poll for completion. For memcpy in that case, you will want to wait as long as possible so the data will be as fresh as possible in the cache.

    The best case is to use ping-pong buffers so you can be transferring one buffer's data while processing on the first buffer's data. This can virtually eliminate the time to do the transfers. In many applications, the EDMA can be setup to do the transfers once a buffer has filled and send an interrupt to start the processing on that buffer, even while the next buffer is filling. This is one of the biggest potentials for performance improvements in DSP applications.

    Regards,
    RandyP

  • Yes,DMA must wait for his transfer ending.So you must make DMA pararel .

  • David,

    Is your question answered, for now? Or do you have more you would like to hear?

    If you run some of the benchmarks above, please post your results here for others to understand.

    Regards,
    RandyP

  • Should I understand from the last part of this description that if I call memcpy on aligned 1MB of memory, interrupts will be disabled through the duration of the copy because it will internally select strasg? 

    I am using C64x+ (C6455, C6437), C6740, C6672

  • Yaels,

    The last part of what description?

    There is no C6437 device name and no C6740 device name. The C674x family of devices and the C6672 do not have the C64x+ core; they use new DSP cores that include the C64x+ features but may have updated software libraries.

    For the runtime-support library that you are using or plan to use, you can look at the source code for memcpy to see what it does.

    Regards,
    RandyP