This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

C66x DMA transfers

Hi All,

I am interested in the C66x processor for doing some finite element computations. This involves a lot of indirect addressing which is normally not very efficient when doing small DMA transfers. I have found a paper on how to solve this on the Cell processor:

http://www.hpc.lsu.edu/training/tutorials/sc10/papers/SC10/pdfs/pap253s4.pdf

Does anyone know if the above method could apply to the C66x? I have tried to search for some documents explaining how to do efficient DMA transfers on C66x and its alignment constraints but unfortunately i have not been able to find much.

Any help is appreciated! I thank you in advance..

  • EDMA3 itself has no hard alignment constraints (some peripherals do), but obviously transfers aligned to the EDMA3 interface width will  be more efficient.  SPRABK5 describes the interconnect fairly well (in Figure 1) -- illustrating why EDMA0 (sometimes called TPCC0, which issues transfers to EDMA [or TP] TC0 and TC1) is strongly recommended for memory-to-memory transfers, and showing that 32-byte-aligned transfers will be optimal for it.

    I have not looked at DMA performance specifically on C66x processors, but on a previous generation chip, it takes hundreds of cycles to program a single EDMA3 descriptor.  The performance benchmarks in SPRABK5 gloss over the actual size of the DMA programming overhead.  Assuming that latency is still significant, it can be reduced and moved into the background by building descriptors in core-local memory and using IDMA0 to ship them to the desired EDMA3 unit.  It should also be possible to use EDMA3 to load its own descriptors, which would make more sense for large numbers of elements -- at least if each element only goes to one core.

    The other major thing to note is that the numbers of DMA descriptors (PaRAM entries) and QDMA channels are fixed per EDMA3 unit; depending on the connectivity of mesh elements and the number of cores you are using, those may become bottlenecks.

    (There are also peripherals that use Packet DMA; I am not familiar enough with those to say whether they could be exploited to help that kind of processing.)

  • The EDMA3 probably would not be ideal for the Finite Element Computations described in the document assuming the locations are fairly randomized.  It's great at structured data, even if it's with multiple degree's of row/columns w/ skipping around indexing.  But not so great for a randomly located elements.

    What would be good for it is the Multicore Navigator which is on the Keystone Architecture (C66x) devices.  Using the Host Packet Descriptors to point to the various elements is more likely want you want to do.

    Here's a link to the User Guide and a bit of online training for it can be found here.

    Best Regards,

    Chad