This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

C64x Vs C66x EDMA driver performance

I am comparing the EDMA data transfer performance between C64x and C66x, where in the C64x project direct CSL macros are used to modify the registers and in the C66x project, CSL functional layer driver calls are used, which is recommended by TI.  In my case, I am running C6472 @ 700MHz with DDR2, and C6678 @ 1GHZ with DDR3.

Here is my driver design:

In C64x:

Transfer:

  1.  directly setup PARAM set registers
  2.  set the corresponding bit in ESR to trigger the transfer

Wait for completion:

  1. poll the corresponding bit in IPR register
  2. after transfer completes, set the corresponding bit in ICRH to clear the interrupt pending bit

In C66x:

Transfer:

  1. Create a local Param register struct (localParamSet) and set them up
  2. Call CSL_edma3MapDMAChannelToParamBlock() to map the DMA channel to the PARAM block
  3. Call CSL_edma3GetParamHandle() to obtain a handle to the PARAM set
  4. Call CSL_edma3ParamSetup() to copy localParamSet to the actual PARAM set registers
  5. Call CSL_edma3DMAChannelEnable() to enable the channel
  6. Call CSL_edma3SetDMAChannelEvent() to trigger the transfer

Wait for completion:

  1. poll the corresponding bit by calling CSL_edma3GetHwStatus(), cmd = CSL_EDMA3_QUERY_INTRPEND
  2. Call CSL_edma3HwControl() with cmd = CSL_EDMA3_CMD_INTRPEND_CLEAR to clear the interrupt pending bit
  3. Call CSL_edma3ClearDMAChannelEvent() to clear channel event
  4. Call CSL_edma3DMAChannelDisable() to disable the channel

I have realized that the overhead of the C66x driver is quite significant compared to the C64x, especially for small size data transfer (smaller than 32KB). And also, the performance is not as fast as I expected while I have a faster DSP @ 1GHz and faster DDR memory.

In the C66x test project, I'm using CC1 only.  For the LL2 to DDR test, 1KB transfer elapse time: 0.618 us, but the function overhead to set up the PARAM set: 0.942us.  The overhead takes longer time than the transfer.

I am wondering if I have missed anything in my setup that causes this, or there is any way I can speed it up, or there is nothing I can do about it.  And in the latter case could we conclude that we shouldn't use EDMA for small size data transfer (like 32KB or less) but using memcpy instead?

Thanks,

-- Louis

  • We are working on this and get back to you.
    Thank you for the post and patience.
  • Hi,

    Thanks for your post.

    In general, the latency of DSP core accesssing external DDR memory which highly depends on the cache where the average cycles for each instruction was reported (refer figure 5 & 6 from the below attached doc. ) and again there would be a latency of DSP core accessing LL2 was measured and the average cycles for each instruction is reported (refer figure 2 from the below attched doc.)

    In general, it is very hard to measure the initial latency between DMA event happen to real data transfer begin. So, we measured the transfer overhead instead which is the sum of the Latency and the time to transfer smallest element. To be specific, EDMA CC1 transfer overhead was measured in average cycles between EDMA trigger (write ESR) and EDMA completion (read IPR=1) for smallest transfer (1 word) between different ports on 1.2GHz KeyStone II EVM with 64-bit 1600MTS DDR. The test is performed between different types of source and destination endpoints and you could refer for LL2->DDR3A, LL2->DDR3B etc. which is captured in Table 6 from the below keystone memory performance attched doc.

    Also, you could also check for throughput comparison between EDMA TC's from Table 7 on the attched doc. and also refer Table 4 on the same doc. for the transfer throughput comparison between DSP core, EDMA and IDMA.

    /cfs-file/__key/communityserver-discussions-components-files/791/8357.K2-SOC-Memory-Performance.doc

    Thanks & regards,
    Sivaraj K

    -------------------------------------------------------------------------------------------------------
    Please click the Verify Answer button on this post if it answers your question.
    --------------------------------------------------------------------------------------------------------

  • Thanks Sivaraj for your reply.  And I am aware of the throughput of the EDMA and the doc your mentioned.  But I still want to hear if anyone can address the overhead of the driver issue, and the comparison with the C64x version.

    Thanks,

    -- Louis

  • Hi,

    May be, I could recommend you to walkthrough the EDMA complex throughput scenarios for overhead coniderations on the throughput performance guide for c66x keystone devices as below:

    http://www.ti.com/lit/an/sprabk5a/sprabk5a.pdf

    To remove overhead inorder to enhance EDMA throughput, there is a test process to setup as like steps mentioned below:

    1. Get the transfer time for a payload of 32KB/channel (includes overhead).

    2. Get the transfer time for 0 bytes - this is called a dummy transfer and closely

    approximates the overhead.

    3. Subtract 2 from 1 to get the transfer time with overhead removed - t3.

    4. Throughput = [(32KB * number of channels)/t3] * 1GHz

    Kindly try the above test setup to reduce overhead.

    Thanks & regards,

    Sivaraj K

    -------------------------------------------------------------------------------------------------------

    Please click the Verify Answer button on this post if it answers your question.

    --------------------------------------------------------------------------------------------------------

  • Hi,

    we are continuing to see a larger overhead for small EDMA transfers (here 1KB) between LL2 and DDR on the C6678. We are trying to understand architecturally (also in comparison to the C6472) why that could be and what knobs can be turned.

    Here are the assumptions, based on the use-case described in the very beginning:

    [] Assuming that CSL functional layer vs direct PARAM writes have minimal impact. The EDMA3 CC and EDMA3 TC in both C6472 and C6678 have same conceptional architecture, however different configurations.

    [] I believe the EDMA3 CC from a latency standpoint will behave the same way, and here I am assuming that any CC priorities you have set up are the same between the two devices, in fact that for this test the CPU will write one single event to kick of the EDMA channel, and that channel is the only channel in use. This should cause the TC request sent from the CC to the TC to behave the same way for both devices. So I think we can take the CC out of the loop for now.

    Therefore I think we need to focus on the differences between TC and Teranet (in C6678) vs TC and Switch Central Resource SCR in C6472. While the TC is of the same conceptional architecture the TC configurations are different.

    In the C6678 the CC1 with TC0 is used.

    http://www.ti.com/lit/ds/symlink/tms320c6678.pdf

    describes the CC/TC configuration in Table 7-34

    For C6472

    http://www.ti.com/lit/ug/spru727e/spru727e.pdf

    describes the CC/TC configuration in Table 2-20 in comparison.

     

    Then the System Interconnect is different between C6678 and C6472, i.e. Teranet vs SCR.

     Here we need to look at how TC transfer requests are serviced by the Interconnect, how the priorities at the Teranet boundary are set up. The TC will act as a master on the Teranet or SCR.

    4.3 Bus Priorities

    The priority level of all master peripheral traffic is defined at the TeraNet boundary. User programmable priority

    registers allow software configuration of the data traffic through the TeraNet. Note that a lower number means

    higher priority - PRI = 000b = urgent, PRI = 111b = low.

     

    We believe PRI is both set to 000b.

     

    Given all that, what could cause the overhead of a 1KB transfer to be quite higher in the C6678? Is there an architectural reason (Teranet vs SCR) for this, and what knobs could be turned?

     

    Thanks,

    --Gunter

     

     

     


     

  • Hi,

    to also respond to the suggestion of measuring true overhead by running a DUMMY NULL transfer in comparion to a transfer with payload, the customer measured:

    CC0 TC0 Overhead (Dummy transfer): 1.014μs
    CC0 Overhead (Actual transfer): 0.942μs
    CC0 Transfer Time for 1KB (Does not include overhead): 0.900μs

    CC1 Overhead (Dummy transfer): 1.056μs
    CC1 Overhead (Actual transfer): 0.876μs
    CC1 Transfer Time for 1KB (Does not include overhead): 0.918μs

    The payload transfer was a 1KB payload transfer from LL2 to DDR.

    As you can see from the above, the overhead is in the order of 1us, which would amount to 500 CPU/2 cycles on the teranet. This is very high and we are REALLY questioning why that is architecturally.

    Thanks,
    --Gunter
  • Hi,

    Thanks for your update.

    Yes, you are correct. Transfer overhead is a big concern for short transfers and it needs to be included when scheduling DMA traffic in a system. Again, single-element transfer performance will be latency dominated.  So, for small transfers, you should make the trade off between DMA and CPU.

    In my understanding, since EDMA CC0 and CC4 are connected to TeraNet switch fabric close to MSRAM and DDR3A, so it's overhead to access MSRAM and DDR3A is smaller and again EDMA CC1, CC2 and CC3 are connected to TeraNet switch fabric close to DDR3B and DSP CorePac which includes LL2, so their overhead to access LL2 and DDR3B is also smaller.

    On an average, the average IDMA transfer overhead measured is about 66 cycles.  The following tables show the average cycles measured between EDMA trigger (write ESR) and EDMA completion (read IPR=1) for smallest transfer (1 word) between different ports on 1.2GHz KeyStone II EVM with 64-bit 1600MTS DDR

    Table 1. EDMA CC0 and CC4 Transfer Overhead

    destination

    source

    LL2

    MSRAM

    DDR3A

    DDR3B

    LL2

    376

    325

    376

    376

    MSRAM

    325

    325

    325

    325

    DDR3A

    427

    376

    478

    427

    DDR3B

    427

    376

    427

    478

    Table 2. EDMA CC1, CC2 and EDMA CC3 Transfer Overhead

    destination

    source

    LL2

    MSRAM

    DDR3A

    DDR3B

    LL2

    325

    376

    376

    376

    MSRAM

    325

    376

    478

    376

    DDR3A

    427

    427

    529

    478

    DDR3B

    376

    427

    478

    529

    I guess, for big transfer this overhead may be ignored. For more details on EDMA throughput and DMA transfer overhead, please refer the attachment below:

    /cfs-file/__key/communityserver-discussions-components-files/791/3175.0003.K2-SOC-Memory-Performance.doc

    Thanks & regards,

    Sivaraj K

    -------------------------------------------------------------------------------------------------------  

    Please click the Verify Answer button on this post if it answers your question.  

    --------------------------------------------------------------------------------------------------------

  • Thanks Sivaraj for the investigation and your information.

    I'm curious what would be the equivalent overhead for the C64x+.  We are evaluating our code and the use of EDMA in different modules to determine if we need to preform a significant rewrite.  This information would be very helpful to decide if we have to switch our EDMA data fetching mechanism to cache based data fetching for small size data transfer.

    Many thanks.

    -- Louis

  • Hi Sivaraj,

    is there someone who can provide the latency numbers that you have shown for the C6678 (Table 1 and 2 above) for the C6472 in comparison?

    That comparison would be very important.

     

    Thanks,

    --Gunter

  • Hi Sivaraj,

    two clarification:
    [] We realize the latency data is for a K2 device above
    [] What is a cycle in the tables above, is it a CPU/1 cycle?

    Please let us know where we can find comparison C6472 overhead numbers.


    Thanks,
    --Gunter