This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Codec Engine DMA performance



Hi,

After evaluating some CE examples without the use of DMA I've started to test them using it. For this purpose I'm evaluating video1_copy example but I cannot see any performance improvement by using the DMA resource. Here's a comparison table of the first two frames after executing

CE_DEBUG=1 CE_DSP0TRACE="ti.sdo.ce.VISA=5" ./app_remote.xv5T -s xe674

IOFRAMESIZE [bytes] 1024
DSP [MHz] 800
ticks [hex] hex to dec ticks time [us]
DMA enter VISA 26c6fcd 40660941 encoder frame0
exit VISA 26e87fc 40798204 137263 171.58
enter VISA 278afb3 41463731 665527 831.91 decoder
exit VISA 27a7d3e 41581886 118155 147.69
enter VISA 285c67d 42321533 739647 924.56 encoder frame1
exit VISA 287cefe 42454782 133249 166.56
enter VISA 290d9f7 43047415 592633 740.79 decoder
exit VISA 292a636 43165238 117823 147.28
ticks [hex] hex to dec ticks time [us]
No DMA enter VISA 2e4b84f 48543823 encoder frame0
exit VISA 2e69ed9 48668377 124554 155.69
enter VISA 2f1edd7 49409495 741118 926.40 decoder
exit VISA 2f3cca7 49532071 122576 153.22
enter VISA 3025a33 50485811 953740 1192.18 encoder frame1
exit VISA 3045643 50615875 130064 162.58
enter VISA 30f8d1f 51350815 734940 918.68 decoder
exit VISA 31163f9 51471353 120538 150.67

In each case the results are almost the same for the enter/exit VISA calls (about 160us). I've double checked the output using CE_DEBUG=3 option in the DMA example and everything seems to work fine, I'm able to see DMAN3 and ACPY3 calls as expected. What could be missing?

CE 3.22.01.06
EZSDK 5.04.00.11
dm816x evm

Regards,
gaston

  • Gaston,

    I asked around regarding your question and the short answer is that DMA doesn't help much unless you perform the memory movement in parallel with CPU processing. Here is the longer answer.


    The bottleneck is typically the EMIF - doesn't matter if you access external memory using DMA or CPU, the external memory accesses can only go so fast.

    Typically codecs get perf lift using the DMA when they do this loop:
        1. DMA a subset of the input data buffer into internal memory
        2. Process it from that [fast] internal memory
        3. DMA the results from internal mem to the output buffer in external memory

    Here is another response I received.


    The benefit of using DMA vs CPU copy for synchronized raw transfers of large blocks of 1-dimensional data to/from external memory is typically bounded by the DDR/EMIF bandwidth as constrained by the target device. When transferring 2D type transfers, DMA should perform better vs CPU programming by eliminating the CPU overhead of offset/skip calculation etc.

    True performance using DMA would result from careful scheduling logic to have CPU to do work while background DMA transfers are taking place instead of blocking on completion of each data transfer.

    The codec author should be the one to comment on the expected benefit w/DMA enabling.

    I hope this answers your question.

    ~Ramsey