EDMA3 Performance in different memories

Pay Giesselmann

Hi everybody,

as edma is a good tool to design fast algorithms, I made some basic testing on C6657. I configured DMA to copy an array(64kB) and got some interesting timing results.

Can anybody verify theese measurements, not exactly but in general.

DDR3 to L2 ~ 100us
DDR3 to MSMC ~ 27us
L2 to L2 ~ 218us
DDR3 to DDR3 ~70us

What I'm wondering about is, why the access to the faster memories as L2 is slower then the ones to MSMC or DDR3.

best regards

Pay Gießelmann

over 13 years ago

0 Chad Courtney over 13 years ago

TI__Mastermind 30825 points

All those numbers look bad, even the DDR3 to MSMC number. Please see the Throughput Performance Guide for C66x Devices App Note for performance numbers.

Best Regards,
Chad

0 Pay Giesselmann over 13 years ago in reply to Chad Courtney

Prodigy 250 points

Thank you Chad,

so, what might I do wrong, in the case of MSMC to DDR3 I get only about 1/4 (2500MB/s) of what I could expect? I configured one DMA channel of EDMA instance 2 on C6657 as AB synchronized.

Next step, I tried out right now is to configure 2 DMA channels to copy one array (upper and lower half) this gives me the double throughput. Is this the right way to go to maximize data throughput?

-edit-

I tried out the effect of the RDRATE Register in the Transfer Controller to accelerate the transfer: Setting it to 0 and 1 results in the same timing i.e. i have at least four cycles latency in every read request.. The use of the right RDRATE Register could be proved by setting it to 0x2 which caused time to increase from 25us to 28us. Setting it to 0x3 results in about 53us.

best regards,

Pay Gießelmann

0 Chad Courtney over 13 years ago in reply to Pay Giesselmann

TI__Mastermind 30825 points

Even the slow EDMA1 and EDMA2 which are 128bit & CPU/3 should be able to handle it with one channel, but you could try EDMA0 which is 256bit wide and CPU/2 to make sure.

I'm not sure if you have something potentially poor with your EDMA setup or something else either. Can you dump the PaRAM values you're using. Also, how are you measuring this?

Best Regards,

Chad

0 Pay Giesselmann over 13 years ago in reply to Chad Courtney

Prodigy 250 points

Hi,

this is the parameter set I'm using:

a_edmaParameter.m_edmaITCCHEN = 0; // Intermediate transfer completion chaining enable
a_edmaParameter.m_edmaTCCHEN = 0; // Transfer complete chaining enable
a_edmaParameter.m_edmaITCINTEN = 0; // Intermediate transfer completion interrupt enable
a_edmaParameter.m_edmaTCINTEN = 1; // Transfer complete interrupt enable
a_edmaParameter.m_edmaTCC = 0; // Transfer complete code
a_edmaParameter.m_edmaTCCMODE = 0; // Transfer complete code mode 0: normal completion 1:early completion
a_edmaParameter.m_edmaFWID = 0x5; // FIFO width 0 - 5h: 8, 16, 32, 64, 128, 256 bit
a_edmaParameter.m_edmaSTATIC = 1; // Static set
a_edmaParameter.m_edmaSYNCDIM = 1; // Transfer synchronization dimension 0:A 1:AB
a_edmaParameter.m_edmaDAM = 0; // Destination address mode
a_edmaParameter.m_edmaSAM = 0; // Source address mode
a_edmaParameter.m_edmaSRC = (Uint32)t_source; // Source address
a_edmaParameter.m_edmaACNT = 64; // Count 1st dimension
a_edmaParameter.m_edmaBCNT = 1024; // Count 2nd dimension
a_edmaParameter.m_edmaDST = (Uint32)t_destination; // Destination address
a_edmaParameter.m_edmaDSTBIDX = 64; // Destination BCNT index
a_edmaParameter.m_edmaSRCBIDX = 64; // Source BCNT index
a_edmaParameter.m_edmaBCNTRLD = 0; // BCNT reload
a_edmaParameter.m_edmaLINK = 0xFFFFF; // Link address
a_edmaParameter.m_edmaDSTCIDX = 0; // Destination CCNT index
a_edmaParameter.m_edmaSRCCIDX = 0; // Source CCNT index
a_edmaParameter.m_CCNT = 1; // Count 3rd dimension

Use of EDMA0 is not supported in C6657. I measure execution time with the Timestamp_get32() in SYS/BIOS. There might be some overhead caused by SYS/BIOS, but if I compare the timings using 1,2 or 4 channels (27 us, 13 us, 7 us) there might be something wrong in the configuration.

What I'm wondering about while writing this is, that I map all channels to one queue i.e. they should be executed by the same TC.

thank you for your help,

best regards

Pay Gießelmann

0 Chad Courtney over 13 years ago in reply to Pay Giesselmann

TI__Mastermind 30825 points

What are you using for the 2 and 4 channel setups?

Can you dump what values you're getting back from the timestamps?

What's the DDR speed you're running at? What's the SYSCLK speed you're running at?

Can you dump the PaRAM values from the memory window when it's setup?

Best Regards,
Chad

0 Pay Giesselmann over 13 years ago in reply to Chad Courtney

Prodigy 250 points

In 2 and 4 channel setup I use almost the same parameter set, the A-dimensions stays, the B-dimension is 1/2 (1/4) of whole data block. Start address for each channel is base (base + 1/4, base + 1/2, base + 3/4).

DDR speed is 1333 MHz, SYSCLK I haven't set up yet, I'm not shure about it's reset value, but even if it was 1,25GHz, the timings are too slow.

memory content for single direction setup: Base address: 0x02744000

0x8010050C 0x80000000 0x04000040 0x0C000000 0x00400040 0x00004000 0x00000000 0x00000001

Example timestamp is below, the cyclecount after "Time:" is token from software trigger to ISR, the overhead is from ISR (which posts event) to prozess.

Time: 25667 Overhead: 1876

Then I made a mistake in my program, output for the 4-channel setup is:

Time: 7279 Overhead: 18030

To explain: I wait for the interrupt of the first channel, then go to my handler task and poll the interrupt pending register to show me 0xF (for all channels complete). So what confused me in my older post, that I use the same queue for all transfers is right and I do NOT get better timings with this setup.

Next step I corrected the queue setup, now each of the four channels has its own queue i.e. its own transfer controller.

Time: 21556 Overhead: 3772

That's again almost the same, what means that my first result was wrong, I don't get better timings using more channels. It's the EVM from D.SignT, I will ask the distributor for some benchmark on their board and stop my study here.

Thank you very much for your help, I will post it here when I have new results.

best regards

Pay Gießelmann

Processors

Processors forum

EDMA3 Performance in different memories