This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Realistic data throughput of C667x PCIe Gen2 x2

Hi,

The theoretical data throughput of a PCIe Gen2 x2 is 1GB/s.  Has anyone tested the C6678 PCIe Gen2 x2 data throughput ? What is a realistic figure I can expect from this link ?

I plan to transfer about 560MB/s of video data through this PCIe link from video frame grabber to C6678's DDR3 memory. Will video data be delivered by DMA to the DDR3 memory ?  

regards,

Hui Peng

  • Hui,

    Based on our initial performance data, you should be able to have a good margin on your bandwidth requirement of 560MB/s. You will have to program the EDMA3 controller to transfer data to external DDR3 memory. Please note that the choice of EDMA transfer controller (TC) will impact the overall system bandwidth. A TC with larger burst size will improve your performance over one with a smaller burst size. The PCIe users guide (SPRUGS6A) has pseudo-code examples in Section 2.8.3 that use EDMA3 for data transfer to and from PCIe memory. We will be posting a performance app note on the website in the near future.

  • Hui,

    The theoretical throughput that you quoted was including packet overhead. If you exclude the overhead, the theoretical throughput actually drops below 1 GB/s. How much below 1 GB/s depends on the packet size.

    Measured Throughput

    The measured throughput for PCIe on the C6678 device depends on the packet size. Larger packets can acheive better throughput than smaller packets. When using DDR3 as a source/destination with PCIe, we have measured the throughput to be 6-7 Gb/s (750-875 MB/s) including overhead, and 5-6 Gb/s (625-750 MB/s) when excluding overhead (raw data). The measured difference in performance between read and write was negligible.

    Since you indicated that you will be using PCIe for transferring video data, I assume that you will be using large packets. Since you will be using large packets, I expect that you will see throughputs at the higher end of the rates indicated. Whether using small packets or large packets, you should be able to achieve the desired 560 MB/s.

    Regards,

    Derek

  • Hi Derek and Aditya,

    Thank you for your assuring information that the C667x PCIe Gen2 x2 can support my required data throughput.

    regards,

    Hui Peng

  • Hui,

    You are welcome. Can you please mark the thread as verified?

    Regards,

    -Aditya

  • Hi Aditya,

    How can I mark the thread as verified ?

    regards,

    Hui Peng

  • Hello Derek,

    May I know how is the measured PCIe throughput test conducted ? Is the C6678 device functioning as Root Complex or End Point ? What is the other PCIe device in the test ? What is the OS running in TI C6678 in this test ? What is the OS running in the other PCIe device ?

    It will be great if you can provide some details of the test setup.

    Thank you in advance.

    Regards,

    Hui Peng

  • Hui Peng,

    The test was conducted by connecting 2 C6678 devices together using SMA cables. The transmitter was acting as root complex, and the receiver was acting as the end point. Both of the devices were using the TI chip support library (CSL). Neither of the devices were running an OS.

     The throughput measurement was conducted by using the TSC register and the number of bytes transferred to determine the throughput. Immediately before starting the test, the TSC register would be read to get the starting cycle count. Then the data transfer would begin. Immediately after the data transfer completed, the TSC would be read again. The throughput would then be calculated using the following formula:

    bytes_transferred/test_duration

    where, test_duration = stop_cycle_count – start_cycle_count

    The DSP was operating at 1GHz for these tests, so each cycle takes 1 nanosecond, so the difference in the start and stop cycle counts actually yields the duration of the test in nanoseconds.

    Regards,

    Derek

  • Hi Derek,

    Measured Throughput

    The measured throughput for PCIe on the C6678 device depends on the packet size. Larger packets can acheive better throughput than smaller packets. When using DDR3 as a source/destination with PCIe, we have measured the throughput to be 6-7 Gb/s (750-875 MB/s) including overhead, and 5-6 Gb/s (625-750 MB/s) when excluding overhead (raw data). The measured difference in performance between read and write was negligible.

     

     

    What are the payload size of the larger packets and smaller packets of the test above?

     

    Regards,

    Michael

  • Hi Michael,

    I guess the payload size you mentioned is the data payload in each Transaction Layer Packet (TLP) in PCIe protocol.

    The max payload size is fixed as 128Byte for outbound transaction (our local device initiate the read/write transfer from/to the remote device), and 256Byte for inbound transaction (remote device initiate the read/write transfer from/to our device) for PCIe module in C66x devices. It is mentioned in the PCIe user guide.

    For example, 2048 bytes raw data buffers will be separated and transferred as 2048/128 = 16 (packets). Typically, the larger payload size, the less overhead for PCIe transaction and the more effective for data throughput.

    The "larger packets" or "smaller packets" Derek mentioned are actually the buffer size used in the throughput testing, which is ranged the from 2K,4K (smaller) to 18MB, 24MB(larger). The buffer size may also affect the throughput number. But the payload size in PCIe module is the same in any scenario.

    Hope it helps. Thanks.

     

    Sincerely,

    Steven

  • Steven Ji,

    If you would, please make specific the data points that you have stated.

    Derek's post said that there was a "throughput range of 5-6 Gb/s (625-750 MB/s) when excluding overhead (raw data)." Your post says that the buffer sizes "ranged the from 2K,4K (smaller) to 18MB, 24MB(larger)."

    Your post seems to be saying that the 5 Gb/s rate matches with both 2KB and 4KB buffer sizes, while the 6 Gb/s rate matches with both 18MB and 24MB buffer sizes. Is this true that you get 5 Gb/s with 2KB and 4KB buffer sizes and 6 Gb/s with both 18MB and 24MB buffer sizes?

    Thank for you the data, and the clarifications.

    Regards,
    RandyP

  • Hi RandyP,

    Thanks a lot for bringing it up. I think there are some clarifications needed for my previous reply.

    We tried the 2KB, 4KB and 8KB buffer sizes transactions between different memory end points (L2->L2, MSMC->L2, DDR->L2). They are all showing the trend that the throughput (Tput) of 8K > Tput of 4K > Tput of 2K for each combination. But the Tput is no much difference between the memory combinations (like Tput for L2->L2 of 8K is almost same as Tput for DDR->L2 of 8K).

    We also tried some points testing for DDR->DDR, like 18MB and 24MB. It shows that Tput of 24MB almost the same as Tput of 18MB, not much better.

    But it is not appropriate to compare Tput of 24MB for DDR->DDR and Tput of 8K for DDR->L2 and the former one is actually slower than the later one, even 24MB is larger than 8KB.

    So it is probably better to say the Tput of 8K (larger buffer, around 6Gb/s) is better than Tput of 2K (smaller buffer, around 5Gb/s) as the effective throughput for L2->L2 or DDR->L2. And Tput of 18MB (around 5.4Gb/s) is almost the same as Tput of 24MB (around 5.5Gb/s) for DDR->DDR.

    Hope it is clearer now. Thanks.

     

    Sincerely,

    Steven

  • This is very helpful, in my opinion.

    Thank you.

    RandyP

  • Thanks Steven.

    But i still have some questions.

    1. in your post "But the payload size in PCIe module is the same in any scenario", do you mean you use the same payload size when doing the throughput testing? and the numbers are 128Byte for outbound transaction and 256Byte for inbound transaction ?

    2. the buffer size using for throughput testing is used by the DMA engine (inside the PCIe controller or dedidcated DMA) of C66x? I mean it is not PIO transcations, right?

        That is why the buffer size in a DMA descriptor may affect the throughput number?

     

    PS. some data shows that the realistic data throughput of USB3.0 which is also 5Gbps is about 250MB/s. But in C66x, it seems it is easy to achieve 560MB/2 = 280MB/s in one lane (5Gbps). Just wondering.

  • Hi Derek,

    I would like to connect the C6678 Evaluation board TMDXEVM6678LE which is a AMC form factor to a Video Frame Grabber PCIe Card. Is it possible to use the SMA cable setup ? Can you suggest some proven methods of putting the two cards together ?

    I was thinking of the idea of putting the Evaluation board on AMC-to-PCIe adaptor and then put the two cards on a PCIe backplane with the Evaluation board as the root complex and the frame grabber card as the end point. Is this possible ? My concern is once the evaluation board is put into AMC-to-PCIe Adaptor, it can only work as an end point.

    Regards,

    Hui Peng

     

  • Hi Michael,

    The answer is Yes for both of your questions.

    The payload size is fixed and we are using DMA to do the outbound transaction. For the inbound, the remote device will initialize the transaction (we using another C66x as remote device and using DMA as well) and the local device has the PCIe master port to send the data to the destinations without using DMA engine.

    Regarding to the DMA parameters setup, it seems A-sync transfer will achieve better throughput. You can play with the number in your testing as well.

    I am not sure about USB 3.0. But you are welcome to post your realistic throughput for PCIe after you have done some testing in your system. It will be great to share your experience. Thanks.

     

    Sincerely,

    Steven

  • Hui Peng Lim said:

    Hi Derek,

    I would like to connect the C6678 Evaluation board TMDXEVM6678LE which is a AMC form factor to a Video Frame Grabber PCIe Card. Is it possible to use the SMA cable setup ? Can you suggest some proven methods of putting the two cards together ?

    I was thinking of the idea of putting the Evaluation board on AMC-to-PCIe adaptor and then put the two cards on a PCIe backplane with the Evaluation board as the root complex and the frame grabber card as the end point. Is this possible ? My concern is once the evaluation board is put into AMC-to-PCIe Adaptor, it can only work as an end point.

    Regards,

    Hui Peng

     

     

    Hi Hui Peng,

    We have done the PCIe two boards testing for C66x to C66x via SMA cables. Basically we have the SMA connectors for the PCIe signals on the board and connect the two PCIe modules via the SMA cables, one working as RC and another as EP.

    So if you can find the SMA adapters or something else to connect EVM and the Video card via SMA cables, it should be OK. You probably want to try the common PCIe reference clock first in your testing. The common clock may have higher possibility for two PCIe working together, although we tried both common and separate reference clocks in our C66x-to-C66x test, both of them were working.

    For the backplane, it seems like a neat option. I do not have experience on that but I think a couple teams within TI (and some customers) are working on it. I am not sure if the EVM can only work as an end point. Probably there is an external host needed on the backplane for the configurations. But one PCIe end point can still initialize the data transaction to another EP after the configuration.

    Hope I (other somebody else) can have the update later on for the connections and it is great to share your experience here for the progress as well. Thanks.

     

    Sincerely,

    Steven

     

  • Hi,

    Is there a TI C6678 PCIe Software development kit to ease the design effort for PCIe ?

    regards,

    Hui Peng

  • Hi Hui Peng,

    We can get the software support from Multicore Software Development Kit (MCSDK) for C66x devices.

    At MCSDK v2.0 BETA, we only have some CSL register level support for PCIe in the package (so probably need to configure the registers directly).

    The more PCIe support is listed as future feature for the MCSDK in the release note.

     

    Sincerely,

    Steven

     

  • Hi Steven,

    Can you provide me the codes for the PCIe throughput test, especially on the root complex side ? I wish to understand how the root complex PCIe boot-up configuration to search PCIe devices and assign PCIe address is done. I also wish to understand the DMA setting on the root complex side to transfer the incoming data to external DDR3 SDRAM.

    Regards,

    Hui Peng