EDMA API performance information.

Philip Mucci

Other Parts Discussed in Thread: SYSBIOS

The information found in the Throughput Performance Guide for C66x KeyStone Devices (Rev. A). is good, but the story is incomplete from a programmers perspective.

Moving forward, there are 2 API's intended for library and application programmers to program the EDMA engine. The particular use case I'm discussing here is a common one - moving memory to/from DDR to levels of cache/SRAM while simultaneously computing on another portion also in cache/SRAM. There are 3 'levels' of support for the EDMA unit - the CSL headers, the LLD device and the ECPY API in (currently misplaced in MCSDK-video instead of MCSDK). Website documentation is sporadic and inconsistent in regards to these API's, but there are some authoritative and useful presentations floating around on their usage.

Proper high performance algorithm design requires careful resource planning - the performance curves of the underlying hardware or software layers are highly relevant to changes in the algorithm. Citing speeds and feeds are fine for high level architecture, but performance is in the details. So, without further verbiage, here's what would have been helpful to my team during the implementation of a high performance BLAS for the C6678.

Benchmark curves for EDMA transfers to and from DDR3 to various levels of SRAM.
- CSL, LLD and ECPY implementations
- Transfer size from 4B to 100% capacity of SRAM level
- Transfer segment size from 4B to 100% (one large vs N small chained)
- Bandwidth (including all overheads)
- Latency (roundtrip/2, including all overheads)
- Preconfigured vs mailbox vs full PIO triggering
- Completion notification (PIO vs interrupt status vs interrupt status + mailbox)
- Simultaneous computation in L1
- Simultaneous staggered transfers
- Simultaneous multi-level transfers (DDR->L2,L2->L1 instead of DDR->L1)

This information would be used to decide exactly how to set up the optimal transfer size for ones algorithm in order to get the maximum performance.

over 12 years ago

Clement FR over 12 years ago

Genius 4750 points

That would be very helpful indeed. (subscribing to this post)

dzhou over 12 years ago in reply to Clement FR

TI__Genius 9065 points

Philip,

Thanks for all the good suggestions. Just want to acknowledge that we will be working on this. While multitasking, it will probably take us couple weeks to come back to you with meaningful results. I will keep you posted on the progress.

best regards,

David Zhou

Philip Mucci over 12 years ago in reply to dzhou

Prodigy 185 points

Hi David,

Thanks. I'd like also to point out some terribly unclear and outdated stuff in the publicly available documentation. (aka Google) Nearly all of it is on the processors.wiki.ti.com site.

http://processors.wiki.ti.com/index.php/Programming_the_EDMA3_using_the_Low-Level_Driver_(LLD)

This page should immediate be revised. It is the FIRST hit on google and talks about how how one should be using ACPY, which is log deprecated.

http://processors.wiki.ti.com/index.php/Dma_overview

This page, aside from being terribly confusing, is also quite dated. It also great confuses one as to the requirements for just running LLD or ECPY API. it should quite clearly state that SYSBIOS is NOT REQUIRED to use ECPY or LLD and simply that the RM functions help arbitrate access to the EDMA resources. This works just as well on bare metal.

http://processors.wiki.ti.com/index.php/Framework_Components_DMAN3/ACPY3_Users_Guide

This page, while properly acknowledging that ACPY3 is deprecated, does not link to anything useful for ECPY. If you need these API's, no one (really) cares about RMAN which it links to. That's an artifact of the implementation and just a dependency that needs to be satisfied at link time. This page should directly link to ECPY documentation and tutorials.

http://processors.wiki.ti.com/index.php/Framework_Components_FAQ

Also, outdated.

Lastly, doing a google search on ECPY DMA, you get nothing but a header file...

http://www.google.com/search?client=safari&rls=en&q=EDMA+ACPY&ie=UTF-8&oe=UTF-8#client=safari&rls=en&sclient=psy-ab&q=ECPY+DMA&oq=ECPY+DMA&gs_l=serp.3...3126.4056.3.4096.4.4.0.0.0.0.84.188.3.3.0...0.0...1c.1.17.psy-ab.ToSKbWqLMf0&pbx=1&bav=on.2,or.r_qf.&bvm=bv.47810305,d.eWU&fp=e744396e7d184d63&biw=1440&bih=764

dzhou over 12 years ago in reply to Philip Mucci

TI__Genius 9065 points

Philip,

I apoligize for the delay. The key engineer working on this is on 3 weeks vacation and won't back to office till 07/22. But I want to assure you we will be looking into this and thanks so much for your feedback!

regards,

David

Philip Mucci over 12 years ago in reply to dzhou

Prodigy 185 points

No problem. This information will be very useful indeed.

Xiaohui Li over 12 years ago in reply to Philip Mucci

TI__Intellectual 1870 points

Hi Philip,

Could you please elaborate more on the following,

Preconfigured vs mailbox vs full PIO triggering
Completion notification (PIO vs interrupt status vs interrupt status + mailbox)
Simultaneous staggered transfers

Thanks,

Xiaohui

Philip Mucci over 12 years ago in reply to Xiaohui Li

Prodigy 185 points

Hi Xiaohui,

Sure.

Point 1 has to do with initiating transfers. There are a number of ways to trigger a transfer on the EDMA hardware as I understand it. One can write the all registers directly, one can set up a memory location that when written to, triggers a transfer and one can also 'preconfigure' a transfer and then just write the last control register that sets it loose. Understanding the latencies of each is important. The difference between the former and the latter for example, is precisely the latency of writing the control registers in the Paramset. The second approach, does sound fast, but what are the latency implications? I probably have my details a bit wrong, but the idea is to explore the performance space of various ways of initiating a transfer. (consider the direct vs non-direct LLD functions...)

Point 2 has to do with completions. There are a number of approaches right. One can, set up an interrupt and just poll the ISR. One can set up a real handler that writes a memory location and poll that location. One can also poll the secondary ISR's... There seem to be two tiers of them on this device. Polling the ISR register is horribly dog slow as far as we're concerned. We'd like to know what is the fastest way to check completions. Polling a mailbox might be good, but not if the interrupt routine has a really long latency to be dispatched.

Point 3 has to do with setting up simultaneous transfers. We wrote tests to go along the entire design space to figure out how many simultaneous transfers of what size would give us the least latency and best throughput. Numerous docs indicate that for doing large transfers, it's best to set up a number of small staggered (or even linked transfers) on a number of channels. So the question here is, what size transfers, how many should be queued and across how many channels?

Hope this helps.

Processors

Processors forum

EDMA API performance information.