This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS320C6678: C6678 Memory transfer (copy) performance

Part Number: TMS320C6678
Other Parts Discussed in Thread: MATHLIB

Hi,

I would like to ask support regarding what is the fastest way to transfer data from MSMC to MSMC.

The source buffer is in a virtual non-cached while the destination buffer is in the standard cached MSMC. Cache is on L1

The buffer is of 2048 integers (8192 B, 8 kB), measures are taken with the TSCL/TSCH register. I have done multiple tests:

1) Transferring data with memcpy using the real cached address instead of the virtual address for source buffer, without doing any cache_inv or cache_wb, takes averagely 2.3us. Of course in this case data is not really transferred from memory to memory but it is in the cache of the specific core instead. memory performance (in cache) is 8kB/2.3us = 3.48GB/s (not even close to 16GB of declared MSMC, also this is on cache which should be faster)

2) Transferring data with memcpy using the real cached address instead of the virtual address for source buffer, doing cache_wb only, takes averagely 3us. Memcpy operation takes a little more than 2us and the cache_wb a little less than 1us. memory performance (data is taken from cache and wrote after wb to MSMC) is 8kB/3us = 2,7GB/s

3) Transferring data with memcpy using the real cached address instead of the virtual address for source buffer, doing cache_wb and cache_inv, takes averagely 6.1us. Memcpy operation takes 4.5us, invalidate takes 0.9us and wb takes 0.7. I can't ever understand why the memcpy takes much more than before for the same operation. memory performance (data correctly transferred MSMC to MSCM) is 8k/6.1us = 1.3GB/s

 

4) Transferring data with memcpy using the virtual address for source buffer, doing cache_wb, takes averagely 40us. I will not go in the detail of the two operation, Memcpy is the one that takes around 39us. can't understand why. memory performance (data correctly transferred MSMC to MSCM) is 8k/40us = 200MB/s


5) Transferring data with EDMA needs no cache operation, data is correctly transferred from MSMC to MSCM and takes around 4us. Let's say there is 1 us of overhead (even though i know is less). Memory performance 8k/3us = 2,7GB/s.

Based on this topic  I would expect a much faster transfer. Is there someting wrong that I am doing?

I would like to have this data transfer in the shortest time as possible. for double access to MSMC (read and write) I would expect to have something similar to 8GB/s for the complete transfer. Am I wrong?

Please any advice and suggestion is very appreciated.

Thank you very much for your help in advance.

Best Regards,

Fabrizio

  • The SW team is notified. They will post their feedback directly here.

    Please share which Processor SDK RTOS version are you using?

    Best Regards,
    Yordan
  • BIOS 6.5.0.12
    MATHLIB C66x 3.1.1.0
    IPC 3.47.0.00
    EDMA3 2.12.5
    NDK 2.26.0.08
    c667x PDK 2.0.8
  • Hi,

    The best way to do MSMC to MSMC transfer is to use EDMA. Please check www.ti.com/.../sprabh2a.pdf, section 5.3. You may get some improvement from your code, like to use a TC with 128 bytes DBS, and compile your code with -o3 and probably need 2-3 EDMA transfer in parallel to achieve the number. Also when you calculate the throughput, the EDMA setup time needs to be removed.

    Regards, Eric
  • Hi,

    Thank you very much for your reply.

    I have read the document already, this is why I can't get why it takes so much to transfer data.

    In the previous post I didn't mentioned that the measurements are taken for transfer only. The channel setup is done previously and then only transfer is triggered. I have considered 1us overhead, if it is more than that please advice.

    Triggering multiple EDMA transfers in parallel looks like a good idea to me. How can I do that? How can I know when all the transfers are complete?

    Thank you in advance.

    Regards,

    Fabrizio

  • Hi,

    You need setup a few EDMA channels (e.g. 2 or 3 channels), in each channel, please setup the OPT for the transfer parameters. For example the first channel you transfer the first half of the data, the second channel you transfer the second half of the data. Then in select the transfer controller (TC), allocate the channels to different TCs.

    When trigger/check transfer completion the transfer, just like below code sequentially in a single core:

    set ESR bit for channel 0
    set ESR bit for channel 1
    ....
    poll IPR bit for channel 0
    poll IPR bit for channel 1
    ....

    This can guarantee most of the time the EDMA transfers are in parallel.

    Or, if you have code in multicore, you can use each core to do an EDMA transfer, make some synchronization among cores then set up the ESR bit to start transfer at the same time.

    Regards, Eric
  • Hi,

    Thank you for your reply. Do you think that my results are as expected from your benchmarks?

    Thank you in advance,

    BR

    Fabrizio

  • Hi,

    Please check the document for the number we obtained. Can you use 2-3 EDMA channels in parallel to get that number?

    Regards, Eric
  • Hi

    thank you Eric, parallel transfer speeded up the copy.

    BR

    Fabrizio