This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

EDMA API performance information.

Other Parts Discussed in Thread: SYSBIOS

The information found in the Throughput Performance Guide for C66x KeyStone Devices (Rev. A). is good, but the story is incomplete from a programmers perspective.

Moving forward, there are 2 API's intended for library and application programmers to program the EDMA engine. The particular use case I'm discussing here is a common one - moving memory to/from DDR to levels of cache/SRAM while simultaneously computing on another portion also in cache/SRAM. There are 3 'levels' of support for the EDMA unit - the CSL headers, the LLD device and the ECPY API in (currently misplaced in MCSDK-video instead of MCSDK). Website documentation is sporadic and inconsistent in regards to these API's, but there are some authoritative and useful presentations floating around on their usage.

Proper high performance algorithm design requires careful resource planning - the performance curves of the underlying hardware or software layers are highly relevant to changes in the algorithm. Citing speeds and feeds are fine for high level architecture, but performance is in the details. So, without further verbiage, here's what would have been helpful to my team during the implementation of a high performance BLAS for the C6678.

  • Benchmark curves for EDMA transfers to and from DDR3 to various levels of SRAM.
    • CSL, LLD and ECPY implementations
    • Transfer size from 4B to 100% capacity of SRAM level
    • Transfer segment size from 4B to 100% (one large vs N small chained)
    • Bandwidth (including all overheads)
    • Latency (roundtrip/2, including all overheads)
    • Preconfigured vs mailbox vs full PIO triggering
    • Completion notification (PIO vs interrupt status vs interrupt status + mailbox)
    • Simultaneous computation in L1
    • Simultaneous staggered transfers
    • Simultaneous multi-level transfers (DDR->L2,L2->L1 instead of DDR->L1)

This information would be used to decide exactly how to set up the optimal transfer size for ones algorithm in order to get the maximum performance.