This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

PCIe throughput measurements



Hi all,

When I measured the throughput performance of PCIe Memory Write through a PCIe switch, in some case, I obtain a throughput value better than the old measurements when using the BOC between two DSPs C6678!!

it looks non normal non ???

Some details:

- using EDMA: the throughput using PCIe switch is likely less than using the BOC (it looks normal-OK)

- using CPU:

1 - for data nearly <1KB: the throughput using PCIe switch is better than using the BOC (it looks non normal, there is an explaination of that??)

2- for data >1KB: the throughput using PCIe switch is likely less than using the BOC (it looks normal-OK)


PS: the throughput is measured with TSCL/TSCH counter.

  • Delared,

    What is the progarm running on DSP? Is it the PCIE example project in MCSDK/PDK package?

    Regards, Eric

     

  • Eric,

    Yes, my programs are based on the example project in MCSDK/PDK provided by TI

  • Delared,

    The PCIE example uses CPU only (not EDMA) to move data and there is no throughput measurement code. I assume you added something extra based on it. How do you measure the throughput with TSCL/TSCH? The time delta between the first data received to the last data (PCIE_EXAMPLE_BUF_FULL) received on EP side? I think it is possible that PCIE switch has buffered and delayed data transmission. Maybe they can send the data faster (buffered first) than using the BOC case, the effect is pronounced when transfer small data (<1K in your case). On the RC side, if you add code to count the time between the first data is written to EP and the last data is loopback, do you seem the PCIE swicth case takes longer time than BOC case?

    Regards, Eric

      

  • Eric,

    Yes I made some modifications to perform PCIe Memory transactions using EDMA. but I carry more  about CPU:

    - Throughput measurement method (requester side -RC):

    /* RC Writes  data to EP using CPU*/

      tsc_start=_CSL_tscRead ();

        for (i=0; i<PCIE_BUFSIZE_APP; i++)

        {

          *((volatile uint32_t *)pcieBase + i) = srcBuf[i];

        }

      tsc_stop=_CSL_tscRead ();

      tsc_cycle=tsc_stop-tsc_start;

    measured throughput=Data size/(tsc_cycle*1/freq_CPU)

    Note:  I had the recommendation of TI employee to use this method.

    - OK, I will try to take meaure and compare the loop buck throughput. Meanwhile, you said that "if the switch has buffered and delayed data transmission, Maybe they can send the data faster (buffered first) than using the BOC case"

    So means that :

    1. The data transfer becomes faster due to buffered transmission.
    2. In RC side, I measure just the data transfer between DSP RC (MSMC) and switch (buffer), and not really the data transfer between RC (MSMC) to EP (MSMC)?

    I need more explanations please!

    PS: I’m using MSMC for all PCIe data transfer.

  • Delared,

    I think DSP PCIE peripheral and PCIE switch have some buffers. At the SERDES level, each write need an acknowledge before writing the next, but this is transparent to software. In PDK test application, there is no issue for you to measure the cycles at:

    tsc_start=_CSL_tscRead ();

        for (i=0; i<PCIE_BUFSIZE_APP; i++)

        {

          *((volatile uint32_t *)pcieBase + i) = srcBuf[i];

        }

      tsc_stop=_CSL_tscRead ();

    The write post by software execute independently whether the real PCIE transfer finished or not for the beginning dozens of write, as long as buffer is not full. When the buffer is full, the next software PCIE write can only finish when a buffer is available.

    I only have BOC setup and I tried to transfer 2000 data (#define PCIE_BUFSIZE_APP 2000) and logged the delta for each data transfer.

    for (i=0; i<PCIE_BUFSIZE_APP; i++)

    {
    old = TSCL;
    *((volatile uint32_t *)pcieBase + i) = srcBuf[i];
    gBuffer[i] = TSCL-old;
    }

    I saw for the first ~100 data, the delay is uniform around 17 cycles. Then, for every 16 data transfer, there is a bump to ~0x120 cycles. This behavior continues to the end of 2000 data.This seems to indicate the buffer initially is empty so you can write faster from software point of view (data attached).  

    In your set-up with a switch in bewteen, it is possible that switch have a bigger buffer to make the transfer faster as seen from software. Maybe you can try my way for the first 2000 data in switch case. Or you can do the measurement at the EP side, that would give a clearer explanation.5074.cycle_2048.dat

    Regards, Eric

  • Eric,

    Thank you for your help, i have some questions:

    1. Says that due to switch buffering mechanism, and my throughput measurement (limited at RC side), the measurements reflect the transfer between RC and switch buffer

    Note : for x2 mode, the switch has an Ingress Frame Buffer IFB size  support up to 32 TLPs so with CPU 32-bits => 4 B for  each TLP=> 4*32=128B

    that can explain the pic of measurement for data <128B, after that the buffer is full, so the CPU should wait that the switch release stored TLPs, so with this delay the performance is likely reduced.

    at all i could say that with this method of measurements (RC side) i measure just the transfer  RC-to- switch buffer (posted transaction) and not RC-to-EP,

    you think that could be  right ?

    2. Also this difference of measurements don't appears when using EDMA (always throughput_BOC > throughput_switch), may i ask why ??

    unfortunately, as yet, i can't perform memory read to compare the 2 measurements (posted vs non posted).

    3. in other hand, when using EDMA, the effect of switch is so considerable! (eg. for data transfer =1KB, throughput_BOC =4.16 Gbps vs throughput_switch=2.87Gbps)

    there are some explanations or hypothesis of this limitation (possible causes) ??

    4. how can i do the measurements at the EP side?

  • Delared,

    The DSP PCIE periperal also has its own buffer (I don't know how big it is). The PCIE data link layer has acknowledgement protocol to ensure reliable delivery of TLPs. When you write data from RC to EP through a switch, the switch's egress buffer drains the packets to EP. Any slowness in the path will press back to the RC side, reuslting in the switch ingress buffer and RC Tx buffer build-up and finally full and thus the write can't be posted. The measurement at RC side I thought reflects the fastest speed the any buffers in the path is not full.

    When you use EDMA, the burst size can be 128 bytes or 64 bytes, instead of 4 bytes use CPU case. Seems the switch takes some time to "switching" the chunk of data to the port connected to EP, and the throughput is slow compared to BOC.

    The test pattern is writing 0, 1, 2, 3, 4, ..... 1 to the EP side. And EP loops the last location for "1" indicating all data received. Perhaps you can use a fixed pattern from RC side, then at EP side, looping the first location for the pattern to start timer count and then looping the last location for the pattern to stop the timer.

    Regards, Eric