C64x Vs C66x EDMA driver performance

Louis Leung

I am comparing the EDMA data transfer performance between C64x and C66x, where in the C64x project direct CSL macros are used to modify the registers and in the C66x project, CSL functional layer driver calls are used, which is recommended by TI. In my case, I am running C6472 @ 700MHz with DDR2, and C6678 @ 1GHZ with DDR3.

Here is my driver design:

In C64x:

Transfer:

directly setup PARAM set registers
set the corresponding bit in ESR to trigger the transfer

Wait for completion:

poll the corresponding bit in IPR register
after transfer completes, set the corresponding bit in ICRH to clear the interrupt pending bit

In C66x:

Transfer:

Create a local Param register struct (localParamSet) and set them up
Call CSL_edma3MapDMAChannelToParamBlock() to map the DMA channel to the PARAM block
Call CSL_edma3GetParamHandle() to obtain a handle to the PARAM set
Call CSL_edma3ParamSetup() to copy localParamSet to the actual PARAM set registers
Call CSL_edma3DMAChannelEnable() to enable the channel
Call CSL_edma3SetDMAChannelEvent() to trigger the transfer

Wait for completion:

poll the corresponding bit by calling CSL_edma3GetHwStatus(), cmd = CSL_EDMA3_QUERY_INTRPEND
Call CSL_edma3HwControl() with cmd = CSL_EDMA3_CMD_INTRPEND_CLEAR to clear the interrupt pending bit
Call CSL_edma3ClearDMAChannelEvent() to clear channel event
Call CSL_edma3DMAChannelDisable() to disable the channel

I have realized that the overhead of the C66x driver is quite significant compared to the C64x, especially for small size data transfer (smaller than 32KB). And also, the performance is not as fast as I expected while I have a faster DSP @ 1GHz and faster DDR memory.

In the C66x test project, I'm using CC1 only. For the LL2 to DDR test, 1KB transfer elapse time: 0.618 us, but the function overhead to set up the PARAM set: 0.942us. The overhead takes longer time than the transfer.

I am wondering if I have missed anything in my setup that causes this, or there is any way I can speed it up, or there is nothing I can do about it. And in the latter case could we conclude that we shouldn't use EDMA for small size data transfer (like 32KB or less) but using memcpy instead?

Thanks,

-- Louis

over 8 years ago

0 Raja over 8 years ago

TI__Guru* 81335 points

We are working on this and get back to you.
Thank you for the post and patience.

0 Sivaraj Kuppuraj over 8 years ago

TI__Mastermind 35645 points

Hi,

Thanks for your post.

In general, the latency of DSP core accesssing external DDR memory which highly depends on the cache where the average cycles for each instruction was reported (refer figure 5 & 6 from the below attached doc. ) and again there would be a latency of DSP core accessing LL2 was measured and the average cycles for each instruction is reported (refer figure 2 from the below attched doc.)

In general, it is very hard to measure the initial latency between DMA event happen to real data transfer begin. So, we measured the transfer overhead instead which is the sum of the Latency and the time to transfer smallest element. To be specific, EDMA CC1 transfer overhead was measured in average cycles between EDMA trigger (write ESR) and EDMA completion (read IPR=1) for smallest transfer (1 word) between different ports on 1.2GHz KeyStone II EVM with 64-bit 1600MTS DDR. The test is performed between different types of source and destination endpoints and you could refer for LL2->DDR3A, LL2->DDR3B etc. which is captured in Table 6 from the below keystone memory performance attched doc.

Also, you could also check for throughput comparison between EDMA TC's from Table 7 on the attched doc. and also refer Table 4 on the same doc. for the transfer throughput comparison between DSP core, EDMA and IDMA.

/cfs-file/__key/communityserver-discussions-components-files/791/8357.K2-SOC-Memory-Performance.doc

Thanks & regards,
Sivaraj K

-------------------------------------------------------------------------------------------------------
Please click the Verify Answer button on this post if it answers your question.
--------------------------------------------------------------------------------------------------------

0 Louis Leung over 8 years ago in reply to Sivaraj Kuppuraj

Intellectual 290 points

Thanks Sivaraj for your reply. And I am aware of the throughput of the EDMA and the doc your mentioned. But I still want to hear if anyone can address the overhead of the driver issue, and the comparison with the C64x version.

Thanks,

-- Louis

0 Sivaraj Kuppuraj over 8 years ago in reply to Louis Leung

TI__Mastermind 35645 points

Hi,

May be, I could recommend you to walkthrough the EDMA complex throughput scenarios for overhead coniderations on the throughput performance guide for c66x keystone devices as below:

http://www.ti.com/lit/an/sprabk5a/sprabk5a.pdf

To remove overhead inorder to enhance EDMA throughput, there is a test process to setup as like steps mentioned below:

1. Get the transfer time for a payload of 32KB/channel (includes overhead).

2. Get the transfer time for 0 bytes - this is called a dummy transfer and closely

approximates the overhead.

3. Subtract 2 from 1 to get the transfer time with overhead removed - t3.

4. Throughput = [(32KB * number of channels)/t3] * 1GHz

Kindly try the above test setup to reduce overhead.

Thanks & regards,

Sivaraj K

-------------------------------------------------------------------------------------------------------

Please click the Verify Answer button on this post if it answers your question.

--------------------------------------------------------------------------------------------------------

0 Gunter Schmer over 8 years ago in reply to Sivaraj Kuppuraj

TI__Genius 13647 points

Hi,

we are continuing to see a larger overhead for small EDMA transfers (here 1KB) between LL2 and DDR on the C6678. We are trying to understand architecturally (also in comparison to the C6472) why that could be and what knobs can be turned.

Here are the assumptions, based on the use-case described in the very beginning:

[] Assuming that CSL functional layer vs direct PARAM writes have minimal impact. The EDMA3 CC and EDMA3 TC in both C6472 and C6678 have same conceptional architecture, however different configurations.

[] I believe the EDMA3 CC from a latency standpoint will behave the same way, and here I am assuming that any CC priorities you have set up are the same between the two devices, in fact that for this test the CPU will write one single event to kick of the EDMA channel, and that channel is the only channel in use. This should cause the TC request sent from the CC to the TC to behave the same way for both devices. So I think we can take the CC out of the loop for now.

Therefore I think we need to focus on the differences between TC and Teranet (in C6678) vs TC and Switch Central Resource SCR in C6472. While the TC is of the same conceptional architecture the TC configurations are different.

In the C6678 the CC1 with TC0 is used.

http://www.ti.com/lit/ds/symlink/tms320c6678.pdf

describes the CC/TC configuration in Table 7-34

For C6472

http://www.ti.com/lit/ug/spru727e/spru727e.pdf

describes the CC/TC configuration in Table 2-20 in comparison.

Then the System Interconnect is different between C6678 and C6472, i.e. Teranet vs SCR.

Here we need to look at how TC transfer requests are serviced by the Interconnect, how the priorities at the Teranet boundary are set up. The TC will act as a master on the Teranet or SCR.

4.3 Bus Priorities

The priority level of all master peripheral traffic is defined at the TeraNet boundary. User programmable priority

registers allow software configuration of the data traffic through the TeraNet. Note that a lower number means

higher priority - PRI = 000b = urgent, PRI = 111b = low.

We believe PRI is both set to 000b.

Given all that, what could cause the overhead of a 1KB transfer to be quite higher in the C6678? Is there an architectural reason (Teranet vs SCR) for this, and what knobs could be turned?

Thanks,

--Gunter

0 Gunter Schmer over 8 years ago in reply to Gunter Schmer

TI__Genius 13647 points

Hi,

to also respond to the suggestion of measuring true overhead by running a DUMMY NULL transfer in comparion to a transfer with payload, the customer measured:

CC0 TC0 Overhead (Dummy transfer): 1.014μs
CC0 Overhead (Actual transfer): 0.942μs
CC0 Transfer Time for 1KB (Does not include overhead): 0.900μs

CC1 Overhead (Dummy transfer): 1.056μs
CC1 Overhead (Actual transfer): 0.876μs
CC1 Transfer Time for 1KB (Does not include overhead): 0.918μs

The payload transfer was a 1KB payload transfer from LL2 to DDR.

As you can see from the above, the overhead is in the order of 1us, which would amount to 500 CPU/2 cycles on the teranet. This is very high and we are REALLY questioning why that is architecturally.

Thanks,
--Gunter

0 Sivaraj Kuppuraj over 8 years ago in reply to Gunter Schmer

TI__Mastermind 35645 points

Hi,

Thanks for your update.

Yes, you are correct. Transfer overhead is a big concern for short transfers and it needs to be included when scheduling DMA traffic in a system. Again, single-element transfer performance will be latency dominated. So, for small transfers, you should make the trade off between DMA and CPU.

In my understanding, since EDMA CC0 and CC4 are connected to TeraNet switch fabric close to MSRAM and DDR3A, so it's overhead to access MSRAM and DDR3A is smaller and again EDMA CC1, CC2 and CC3 are connected to TeraNet switch fabric close to DDR3B and DSP CorePac which includes LL2, so their overhead to access LL2 and DDR3B is also smaller.

On an average, the average IDMA transfer overhead measured is about 66 cycles. The following tables show the average cycles measured between EDMA trigger (write ESR) and EDMA completion (read IPR=1) for smallest transfer (1 word) between different ports on 1.2GHz KeyStone II EVM with 64-bit 1600MTS DDR

Table 1. EDMA CC0 and CC4 Transfer Overhead

destination source	LL2	MSRAM	DDR3A	DDR3B
LL2	376	325	376	376
MSRAM	325	325	325	325
DDR3A	427	376	478	427
DDR3B	427	376	427	478

Table 2. EDMA CC1, CC2 and EDMA CC3 Transfer Overhead

destination source	LL2	MSRAM	DDR3A	DDR3B
LL2	325	376	376	376
MSRAM	325	376	478	376
DDR3A	427	427	529	478
DDR3B	376	427	478	529

I guess, for big transfer this overhead may be ignored. For more details on EDMA throughput and DMA transfer overhead, please refer the attachment below:

/cfs-file/__key/communityserver-discussions-components-files/791/3175.0003.K2-SOC-Memory-Performance.doc

Thanks & regards,

Sivaraj K

-------------------------------------------------------------------------------------------------------

Please click the Verify Answer button on this post if it answers your question.

--------------------------------------------------------------------------------------------------------

0 Louis Leung over 8 years ago in reply to Sivaraj Kuppuraj

Intellectual 290 points

Thanks Sivaraj for the investigation and your information.

I'm curious what would be the equivalent overhead for the C64x+. We are evaluating our code and the use of EDMA in different modules to determine if we need to preform a significant rewrite. This information would be very helpful to decide if we have to switch our EDMA data fetching mechanism to cache based data fetching for small size data transfer.

Many thanks.

-- Louis

0 Gunter Schmer over 8 years ago in reply to Louis Leung

TI__Genius 13647 points

Hi Sivaraj,

is there someone who can provide the latency numbers that you have shown for the C6678 (Table 1 and 2 above) for the C6472 in comparison?

That comparison would be very important.

Thanks,

--Gunter

0 Gunter Schmer over 8 years ago in reply to Gunter Schmer

TI__Genius 13647 points

Hi Sivaraj,

two clarification:
[] We realize the latency data is for a K2 device above
[] What is a cycle in the tables above, is it a CPU/1 cycle?

Please let us know where we can find comparison C6472 overhead numbers.

Thanks,
--Gunter

Processors

Processors forum

C64x Vs C66x EDMA driver performance