This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

SRIO x4 FPGA master direct IO performance

Other Parts Discussed in Thread: TMS320C6671

Hello again,

We are quite far with the FPGA - DSP srio x4 connection. Unfortunately I met a problem for which I can't find a reasonable answer (in both documentation and e2e forum).


The goal is to achieve the highest possible full-duplex throughput between TMS320C6671 DSP and Artix 7 family fpga. The fpga is acting as a master and its generating the direct IO requests (NREAD, NWRITE). At this time we can run the connection in x4 mode in 2.5 Gbps speed (we get the expected transfer speed). But for a 5 Gbps x4 connection the speed is the same as in 2.5 Gbps. The problem with the 5 Gbps x4 connection appears only while full duplex transmissions in the same time, while one direction transfer only, the speed is the expectable theoretical throughput. Also full duplex transmission using a x2 link type in 5 Gbps allows to achieve the expectable throughput (for this link).


There are no transmission errors reported. Also the fpga analyzer shows that after a while there is a delay in accepting the packets from the DSP side (there are no retransmissions like while using the 0 priority packets).

Question:
In this case I want to ask a question if there is possibility to achieve the theoretical speed for this link while full duplex transmission?

If there is a limitation (for single port, MAU unit, data VBUS), is there a way to pass over it (2x x2 port, different configuration).

I'll be thankful for any hints.

  • Hi Lukasz,

    Thanks for posting. I would recommend taking a look at the Keystone Throughput User Guide (http://www.ti.com/lit/pdf/sprabk5). Depending on your payload size, you can see comparable data rates across line rates if you are not sending a near maximum payload. This is reflected in the data collected and presented in the guide.

    Thanks,

    Clinton

  • Hi Clinton,

    I already did it. And what I seen confused me a little because of the payload size. The maximum SRIO payload size is 256 bytes as I know (surely the value describe the whole transmission size). Other thing is that the described situation in sprabk5 is different than the one described by me.

    1. The transmission is initiated by the DSP SRIO LSU unit. I use the FPGA to generate direct IO packets.

    2. The LSU can generate transmission in one direction at a time. I want to achieve full duplex transmission.

    As I described before at 5Gbps x4 link while one direction transmission I get the speed near to the maximum theoretical - 4links x 5Gbps x 0.8 (8/10 coding)  = ~2GB/s and I got about 1.7GB/s (because of protocol overhead and fpga packet generation). In all other links and speed configuration I get the maximum expected speed in one direction and full duplex transmissions. Only not at the x4 5Gbps at full duplex.

    I'm interested If the maximum throughput in full duplex transmission is possible in the system I described. FPGA is generating the NREAD ans NWRITE packets. Perhaps there is no way to achieve the max throughput in full duplex at highest speed. Or if it is possible, what I'm doing wrong (system concept, missing register configuration)?

  • Hello again,

    because the management from my company is pushing me to resolve the problem of the maximum SRIO performance for 4 links full duplex 5Gbps link I want to ask again for some help.

    I describe shortly the system where I use the SRIO connection. The TMS320C6671 DSP is connected with Xilinx Artix7 FPGA using 4 pairs srio links. There is a need to achieve the highest throughput between DSP DDR3 memory and FPGA. The data is transfered using direct IO packets type NREAD and NWRITE generated in FPGA (the memory controller is implemented in fpga). The LSU in SRIO module in DSP is responsible only for configuring the FPGA, while the high throughput data transfers the srio frame generating is done using the fpga.

    I made a lot of tests with different port configuration (x1, x2, x4) and different speed (1.125, 2.5, 5) also simplex and full duplex transmissions. The designed system has an overhead for the transmission (additional 10 -20% of bandwidth). For all configurations (except the x4 5Gbps full duplex) the speed was as expected. The measured speed for the fastest link (x4 5Gbps) was about 1.7 GB/s in simplex mode. But while switching to full duplex the speed was only the half of this value. The situation with decreasing performance for full duplex transfer was observed only in the fastest link. For me it looks like there is a bottleneck somewhere in the system with maximum throughput about 1.7 GB/s but operating in one direction at time. We analyzed FPGA side behavior with logic analyzer and it looked like the DSP part was generating the delay.

    At this time I don't know if it is possible to achieve the theoretical speed in the fastest configuration while full duplex transfer. I assume that the DDR3 memory is able to achieve full duplex transfer at this speed. I'll be very thankful for any hints in this case.

  • Have you run the same full duplex test at x4 with 5Gbps but using L2 memory?  That would be a quick test, but I'm guessing you are running into DDR limitations because you are accessing multiple banks of DDR memory simultaneously.  Opening and closing the banks will drop the throughput.  I can't comment on how much.  Do you have any other accesses to DDR from any other master than SRIO?  If so this would further have the potential to reduce throughput.  That is my best guess at this point.

    Regards,

    Travis

  • Hello,

    Thanks for the answer.

    As I remember the results were the same using MSMC memory. During the test only SRIO accessed the DDR3. At full duplex I made also a test using DDR3 in one direction and MSMC in other.

    I'm not familiar with the SRIO design in the keystone. But I was thinking if a configuration with 2 ports x2 could use mor of the SRIO component resources and improve the throughput.

  • 2 port of x2 will only give you more connectivity options, it will not improve the overall throughput.  I'm surprised at your results with MSMC and DDR showing the same. 

  • Lukasz-san,

    I looked at Xilinx RapidIO IP document.
    (P11) Table 2-1: Minimum Supported Speed Grade Details
    http://www.xilinx.com/support/documentation/ip_documentation/srio_gen2/v2_0/pg007_srio_gen2.pdf

    It seems that the 5Gbps x4-link Rapid IO is NOT supported officially on FPGA side.
    So, I am wondering what you can communicate with DSP.
    Have you check if it is no problem to Xilinx support?

    If you have already checked, please back to me.
    Because I am interested in whether it can be achieved, or not.

    Best regards, RY

  • Hi Lukasz,

    I performed the DirectIO throughput tests using the throughput example found in the PDK. This example project can be found a the following location after you install the MCSDK:

    <install directory>\pdk_C6670_x_x_x_x\packages\ti\drv\exampleProjects\SRIO_TputBenchmarkingTestProject

    I configured the example in the following way:

    • Internal loopback (this would ensure full duplex transmission)
    • 1 x4 port operating at 5 Gbps per lane

    The results for both the NREAD and NWRITE tests yielded similar throughput values as presented in the Throughput User Guide. Note that the data presented in this guide was captured using L2 memory endpoints. I plan to run these same tests using MSMC memory to see if different results are achieved.

    Have you reached out to Xilinx about this issue? Based on RY's post above it sounds like the 5 Gbps x4 link might not be supported on their end. Please let us know what you find out.

    Thanks,

    Clinton

  • Hello,

    sorry for delayed answer.

    It's true that Xilinx write that the 5Gbps x4 link is not supported for Artix 7 family. However the core can be generated (with some constrain not met). Because in some previous version of documentations there was a notice that this port type and speed is supported we did some tests with data verification and we found no problems. The one direction transfers between DSP and FPGA with fixed pattern generation and verification shows us that the link was stable. In our test the speed measured on DSP side was about 1.7 GB/s.

  • Hi Clinton.

    I believe that You get the same results as in the "Throughput Performance Guide for C66x KeyStone Devices". I'll try to run a simple example with the loopback enabled. But I'm afraid that this example does not reflect our case. The example uses LSU to generate direct IO packets, we are using FPGA and the DSP is slave in the transaction. Other thing I'm not sure if the loopback in the example uses the digital domain loopback (Loopback [3-0] in  PER_SET_CNTL1) or the loopback mode in SRIO_SERDES_CFGRX.

  • Hi Lukasz,


    I had asked Clinton to verify the 5Gbps x4 mode with that PDK example targeting MSMC memory, simply to show that in a full duplex mode (packet generation and reception) the KII SRIO IP shouldn't be the bottleneck.  From our perspective, you should be able to achieve similar with any external device that can meet the same performance.


    Regards,

    Travis

  • Travis, Clinton

    thanks a lot for the information. I'll need to take a closer look on all registers that can give me any status about the possible transfer speed degradation. I'll also analyze again the debug information from the FPGA IP core. I'll report what I found.

    thanks again,

    Lukasz

  • Lukasz-san,

    Lukasz Mitrega said:

    Because in some previous version of documentations there was a notice that this port type and speed is supported

    Could you let me know the document you looked?

    I am looking for the description supporting 5Gbps x4 link, but I can not.

      http://japan.xilinx.com/support/documentation/ip_documentation/srio_gen2/v3_1/pg007_srio_gen2.pdf
      http://japan.xilinx.com/support/documentation/ip_documentation/srio_gen2/v3_0/pg007_srio_gen2.pdf
      http://japan.xilinx.com/support/documentation/ip_documentation/srio_gen2/v2_0/pg007_srio_gen2.pdf
      http://japan.xilinx.com/support/documentation/ip_documentation/srio_gen2/v1_6/pg007_srio_gen2.pdf
      http://japan.xilinx.com/support/documentation/ip_documentation/srio_gen2/v1_5/pg007_srio_gen2.pdf
      http://japan.xilinx.com/support/documentation/ip_documentation/srio_gen2/v1_4/pg007_srio_gen2.pdf
      http://japan.xilinx.com/support/documentation/ip_documentation/srio_gen2/v1_3/pg007_srio_gen2.pdf
      http://japan.xilinx.com/support/documentation/ip_documentation/srio_gen2/v1_2/pg007_srio_gen2.pdf

    Best regards, RY

  • Hi RY,

    The http://japan.xilinx.com/support/documentation/ip_documentation/srio_gen2/v1_5/pg007_srio_gen2.pdf documentation describe the minimum speed grade for the whole 7 family. The newer documentation has separate requirements for virtex, kintex and artix.

     

    regards

    Lukasz

  • Lukasz-san,
    Thank you so much !! 
    Best regards, RY