This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Latency of DMA vs QDMA Channels [C6678, EDMA3]

Hi,

I am using EDMA3 for double-buffering, loading and storing data from DDR in background while the processing code operates on L2SRAM.
However currently I am investigating performance issues and found that I am frequently spinning waiting for the DMA transfers to finish,
therefore the DMA transfers seem to take longer than the actual processing. As soon as I disable waiting for DMA units, throughput improves by 10%.

Because bandwidth is not that high (~ 1,2 GByte/s), I thought maybe latency could be an issue - as the slices I transfer are rather small (~12kB input, ~96kB output per Iteration) and one Iteration takes ~80k Cycles.
I am currently using plain DMA channels (as there are situation when I run out of QDMA channels elsewhere), could it be those DMA channels have a very high latency (~10 kCycles's) between triggering a channel and starting the actual transfer?

HAve numbers been published about latency of DMA / QDMA channles for EDMA3/C6678?


Thank you in advance, Clemens

  • Hi,

    You can find the EDMA throughput under different Scenarios at various test cases.
    Please go through the Throughput Performance Guide for C66x KeyStone Devices.
    http://www.ti.com/lit/sprabk5

    Check the below list options to improve EDMA transfer performance,

    • Don’t use the same priority (e.g. Q0) for too many transfers (causes congestion).
    • Can adjust TC0-2 priority to the SCR (see User Guide and QUEPRI).
    • In general, place small transfers at higher priorities.
    • Match ACNT to internal or external bus widths. Src/dst aligned on 16-byte boundaries
    • Whenever possible, break long non-real-time transfers into smaller transfers using features like self chaining, with intermediate chaining enabled.
    • Some LLD functions write directly to the EDMA3 parameter ram. You can achieve better EDMA3 performance by writing an entire PSET at once using: EDMA3_DRV_setPaRAM
  • Hi Pubesh,

    I know the "Throughput Performance Guide" and according to it throughput targets should be easily met, regardless of the EDMACC/TC used - as required throughput of all transfers in progress is as low as 1.25GB/s.

    As throughput hardly seems to be a problem, my suspicion was DMA channel latency. this however isn't answered by the throughput performance guide. So I'll repeate my question again, how long does it take between writing to the trigger word / firing the event and the beginnig of the actual transfer - and is there a huge difference between QDMA and "normal" DMA channels?

    Some notes to your suggestions:

    * I only have 2 transfers in parallel and nobody else touching DDR3, so priorities shouldn't be a problem.
    * ACNT is fairly large (> 256 byte for input and ~2kB for output) but with different IDX, which is according to several recommondations a good scenario (there was a document measuring IDX & ACNT influence for old C64 devices).
    * I am using CSL directly, because EDMA3LLD is a pain to configure and in my personal opinion bloated. I configure a single paramset enterly in memory, and write it then to param-space using something similar to EDMA3_DRV_setPaRAM.

    Thank you in advance & best regards, Clemens

  • Clemens

    The primary difference between DMA vs QDMA is just the triggering mechanism. Using DMA channel, assuming you are manually triggering , you are essentially going to be program the PaRAM Set and then do a configuration write to the EDMA ESR register etc to trigger the transfer, vs on QDMA you preprogram one of the PaRAM word to be the trigger word, so you when you program the paramset, the write to the trigger word/param word will trigger the transfer, so at best you are just going to save a single write to the ESR register in the EDMA MMR (for DMA).

    QDMA transfers are best intended to do linked list of transfers, where you can set a bunch of PARAM Sets and then use the DSP IDMA to write to the trigger word of multiple QDMA channels together to trigger the transfer.

    There shouldn't be 10K cycles of latency between triggering a channel (writing to the ESR reg or event triggered) vs the transfer actually starting to fetch the read packet from your source memory. If you are seeing high latency it is likely that either the Queue/TC you are using is alerady servicing a previous TRP or you are somehow spending more time waiting (polling?) for a previous transfer to finish prior to triggering a new transfer.

    Hope this helps some.

    Regards

    Mukul 

  • Hi Mukul,


    Thank you very much for the detailed explanation and clarification that the use of plain DMAs is most likely not the source of our issue. I have some other non-burst load on the DDR (writes issued by DSP0 continuously), maybe this is interferring with the DMA in some way, so I'll investigate on this end.


    Again, thanks a lot!


    Regards, Clemens