This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS320C6657: Throughput measurement of PCIe between two EVMs

Part Number: TMS320C6657
Other Parts Discussed in Thread: TMS320C6655

Hello , 

Can you please verify with the obtained results for calculated PCIe throughput measurement between two EVMs (TMS320C6657).

In fact, i am trying to measure PCIe throughput using EDMA Write and EDMA Read between two EVMs (TMS320C6657), One is configured as Root Complex and the second as Endpoint.

Knowing that the data burst size of EDMA3 is 64Bytes, so i was expecting to reach constant throughput with packets size superior to 64Bytes.

Please find below the obtained results and help to understand why throughput is increasing even for packet size > 64Bytes.

PCIe Throughput for 2xPort using a PHY line rate 2.5 Gbps (in Mbps)

Packet size (Bytes)

4

8

16

32

64

128

256

512

1024

2048

4096

8192

16384

32768

65536

EDMA WRITE L2

1.67

3.34

6.67

13.34

26.65

53.16

104.98

202.64

378.77

669.97

1088.17

1573.95

2026

2365.77

2581.94

EDMA WRITE DDR3

1.66

3.31

6.62

13.24

26.45

52.43

104.2

201.15

376.33

665.53

1082.38

1564.67

2013.02

2350.86

2565.9

Thank you

  • Hello Khouloud,

    Thank for run these benchmarks for PCIe throughput. This is good data reference for others to refer.

    Please note that we do not plan run and verify this as the software releases for this platform is frozen and there are no development/enhancements planned.

    Thanks.

  • Thank you for the reply. 

    I understand that, but my question is that i was expecting a certain behavior since the data burst size equals 64Bytes. 

    So based on your expertise, do you think those results are correct? 

  • Sure, We will check with our internal team and see we have some feedback on this benchmark data. 

    Also, can you provide the details like --> The Core in which PCIe drivers is running?  The frequency it is running at? Where is the data residing for the  benchmark - Is it DDR? Is  this TI SDK release provided benchmark (if yes, the release version, if not, how is the benchmark test written?)

    Thanks.

  • Hello!

    It's a bit unclear what is packet size in your measurements. I could guess that was EDMA transfer size. If so, there is overhead of EDMA submission and transfer request travel through EDMA guts. So whenever you submit large chunk of data to be transferred, EDMA engine can process it by pieces of burst size until whole chunk is served. In x1 Gen1 link with 128B burst size I could have up to 212 MBps, or 1.7 Gbps per lane. Your setup is x2, but 64B burst size, so those top numbers as 2581 / 8bit / 2 lanes = 161 MBps look as expected to me.

    Please understand, that PCIe subsystem has incoming buffer capacity, so despite of posted nature of write transaction, still PCIe subsystem incoming buffer can accept more data for subsequent transfer even if previous one is still in progress. With that if transfer controller has more data to serve for the same EDMA transfer request, it will push it out it as soon as there is a room in PCIe input buffer.

  • Hello Victor, Thank you for your response.

    Just to clarify some points: My issue is with throughput with small packets.

    I am very satisfied with the throughput with large packets, i am adding 20bytes overhead with each 64Bytes using EDMA.

    I would like to understand why the throughput is so low with small packets (4bytes to 512bytes).

    shouldn't i reach the maximum throughout when using 64Bytes packets? 

    thank you

  • Hello Praveen , 

    Thank you for your feedback.

     The Core in which PCIe drivers is running?  Core 0

    - The frequency it is running at? the system ref Clock is 1Ghz , the pcie ref clock is 100Mhz

    - Where is the data residing for the  benchmark - Is it DDR? i tested both , the second line for L2sram buffers, third line for DDR3 buffers

    -  Is this TI SDK release provided benchmark? No, the TI SDK won't build for my case, i have compilation errors, i think it does work for c6657 cpu's , so i was inspired by the benchmark of srio and adapt it to pcie

  • Hi,

    Could you please clarify, what is packet in your explanation?

  • Hello,

    For me packet is the sent payload (without overhead). Let me explain the steps of my test:

    1- I created a section in (cfg) file located in L2SRAM

    2- I allocated then source buffer in this section.

    3- fill this buffer with data -> send it to the endpoint -> check some special characters 

  • So let me ask other way. When you say 64B long packet it means source buffer has 64 bytes and EDMA transfer size is 64 bytes too, is that right? If so, there is EDMA submission overhead as well. But beyond that one should consider, that PCIe does not start delivering your data immediately, there is some latency between the moment when data were presented and actual transmission. If you repeatedly submit short transfers, you'll never see anything close to advertised PCIe throughput. However, if you present long piece of data, that latency diminishes behind actual transfer time. Even more, when consecutive transfers are presented, there is a gap between previous TLP EOF and next TLP SOF, like in the diagram below.

    I believe throughput measurements are irrelevant unless one presents some really large chunks of data.

  • Many thanks for this clear response.

    I am just wondering then,

    Q1 in my case (EDMA3, DBS 64B): starting from which size should i obtain the maximum throughput? 

    Q2 is it the same case for cpu write also?

    I really appreciate your help 

    Thank you

  • Hi again.

    Q1. Here is your equation: numerator is useful data, denominator is sum of useful data, data overhead, EDMA submission & processing latency, whatever gaps. With this, the larger is your chunk of data, the better would be the throughput, which indeed will rise quickly at first, slowly further. I cannot say clearly where this asymptotic would saturate, though my experiments show that happens somewhere above hundreds kilobytes, when there are hundreds or thousands TLPs used to transfer chunk of data. So it is no way close to DBS, but many multiples of DBS. I believe it happens because there are multiple buffering on the path of data, and each command adds latency (EDMA, PCIe). I believe your measurements presented in original post show your system operates just fine, one can't expect more.

    Q2. CPU access or so called PIO (Programmed I/O) is completely different story. To the best of my knowledge this kind of access can be served with one DWORD of payload, so your useful capacity would be just 4 bytes with 12 to16 bytes of overhead at transaction level and even more at data link level. Things also depend heavily on whether it is posted (write) or nonposted (read) transaction. With former on gen1 x1 link we had up to 45 MBps, but for read - 2 (TWO) MBps. Thus DMA is crucial when data to be pumped from outside into DSP.

  • hello Victor , thank you very much for your support and clear responses. 

    My actual equation is as follow (EDMA) : (useful data) / (useful data + overhead); overhead = (useful data % 64Bytes) * overhead per 64Bytes 

    ECRC enabled ==> overhead per 64Bytes == 24 bytes 

    ECRC Disabled ==> overhead per 64Bytes == 20 bytes 

    if i understand clearly i should add the latency. is it added per packet of 64B ? 

    Q1 : How much is data overhead?

    Q2 :  How much is EDMA submission?

     Q3 :  much is processing latency,

    Q4 : in case of PIO, how much is the overhead? can is it calculated for 1kB data for examples ? 

    Thank youuuu

  • Hi,

    I did not mean to really look into your equation, but rather give you an idea that in a/(a+b) the ratio will improve as long ans 'a' increases, but there will be no perfect point if 'b' is nonzero. What is important, your 'b' term contains not only overhead, but all latencies and delays. My numbers on overhead are somewhat different, but idea is that they are consistent, fixed, and can be found simply by reading the spec. When it comes to delays and latencies, this is a question sometimes hard to answer. Still we may get some idea, what helps to improve. In the picture I've shown in previous message, there is clear pause on TRN interface between two write TLPs. I cannot say for sure, 2 clocks is okay or not, but I saw that consistently in my setup. Now I think, if there is always 2 clock delay, then in equation a/(a+b+2) I would always get better result, if 'a' term, which is payload, is larger. And that holds true up to DBS of 64B. This suggests us that we better use DMA access with multi-DWORD payload capacity rather than CPU initiated PIO transfers with only one-DWORD payload.

    However, above this scheme there are also EDMA latencies and delays. I cannot tell for sure, how long does it take between transfer request and first bit of data appear on the link, but I think about it following way. Imagine, we sequentially submit transfers of DBS size. It means that EDMA engine must detect request, pass it through its internal path to actual transfer controller and the latter would then execute the transfer itself. Now compare that case to the situation, when kilobyte of data are requested in one transfer. Still the will delay for request detection, there will be delay of passing request down to transfer controller, but it will happen once per request. Moreover, knowing that larger piece of data has to be transferred, transfer controller may pump those data towards PCIe input buffer, even if the previous portion is still sending on the serial link. So this example tells us that large requests allow reduce impact of EDMA engine processing delays.

    EDMA submission latency depends on usage scenario. First, one may configure all descriptor fields each time, of have them preconfigured in config space. I cannot answer simply, what it takes, never attempted.

    In case of PIO your calculation are simple: there is duct one dword of payload on write transaction. Then depending on which layer you looking at actual overhead might be different. Se the picture below.

  • do you have any news from the internal team?

    I found benchmark tests performed on the C66xx Keystone devices by TI in this document:
    e2e.ti.com/.../Throughput-Performance-Guide-for-C66x-KeyStone-Devices-_2800_Rev.-A_2900_-_2800_7_2900_.pdf

    This document shows a throughput a lot higher than the measurements shown by  in the first post.
    Are the performance tests in this document still valid or do you have a updated version of the document?

  • Hello,

    That test was performed using two C6678. They provide gen2 x2 link, that is four times higher raw rate, plus C6678 can make DBS of 128B. This is simple explanation why measurements in that document exceed ones by topic starter on C6657.

    Before you ask, later experiments found that inbound 256B burst size capability is pointless, as it's actual performance is worse that 128B.

    Second, despite of being documented to be able produce 128B outbound transfers, reality is that maximum is 64B. I have seen that with the scope on FPGA side against C6670. Because of that mentioned document specifies read performance higher that write, which is nonsense. I have raised this question before, the most I got was acknowledgement that my observations look correct, but no further action was taken.

  • Hello,

    do you have any news from the internal team?

    We don't have anything to add from our side. Just letting you know that this device is considered as "legacy device" and there will be limited support based on the existing SDK and documentation available. The team who have worked on this have move on and we cannot get you any news from them.

    Are the performance tests in this document still valid or do you have a updated version of the document?

    Yes, they are valid for the device that is mentioned in the App Note.

    Thanks.

  • Hi Victor

    Thanks for your reply.
    The specifications for the C6678 and C6657 are identical (according to the datasheet) so it should be possible to achieve comparable results on the C6657 compared to the C6678.

  • Hi Praveen

    Thanks for your reply.

    Just to clarify - the TMS320C6655 and C6657 are "Active" devices but you do not provide any support for them?

    Do you have a list of "supported" DSP devices (not SoCs)?

  • Just to clarify - the TMS320C6655 and C6657 are "Active" devices but you do not provide any support for them?

    Yes, devices that can be bought on ti.com if they are listed as "Active". The software support for these is provided as PROCESSOR-SDK-C665X and support will be provided for this release and anything outside this will have limited and best-effort only.

    Do you have a list of "supported" DSP devices (not SoCs)?

    See the ti.com https://www.ti.com/microcontrolers-mcus-processors/digital-signal-processors/products.html

    Thanks.

  • Hi Praveen

    Thanks for the link, but this is for the "Active" devices. The TMS320C5557 is in this list.
    I would like a list of "NON-legacy-devices".

  • I would like a list of "NON-legacy-devices".

    All available devices are on ti.com. For any roadmap devices that is not listed, we suggest you contact your local TI sales representative for details.

    Thanks.

  • Yes, I know all available devices are on ti.com

    I ask for a list of DSPs where I can expect to have support from you.

    Take the list with "active" devices and remove the legacy-devices (like C6655/57/78) - this should end up with a list of devices for which I can expect to have support from TI if necessary.

  • Allow me disagree. C667x and C665x are different in EDMA3 features, namely the former could have burst size of 128B, while the latter one has up to 64B. This difference will result in performance difference.