This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

66AK2H12: [SRIO] Question about Write-operatio throughput

Part Number: 66AK2H12

Hi,

This forum always help me when I stuck in problem, Thanks again.

I have some question about write-operation throughput in SRIO. 

here is my environment:

 - custom board (DSP:66AK2H12, FPGA: Xilinx vertex 7)
 - PDK : v4.0.4
 - CCS : v7.0.1
 - SRIO : 5Gbps per lane * 4 lane (20Gbps in total)

When the DSP send the write-operation packet with 256kB payload, FPGA can receive that packet. So I can measure the time difference between them.

Based on this data, I calculate the performance, and shows almost 10 Gbps which is not maximum throughput. 

I found that there are several factors to decrease its speed like 8b/10b encoding, serdes module and so on. But the 10 Gbps is not reasonable throughput.

My question is that:

 - the test scenario is sending 256000 bytes from DSP to FPGA, There is only one sending operation is DSP side(it tooks 400ns), And SRIO IP separates its data by 256 bytes
 which is maximum packet payload size. the signal diagram is like this:(sorry not upload image due to the security policy on my workplace)

                     |---------------------|          |------------------|         |-----------------------

                     |                            |          |                        |         |

                     |                            |           |                       |         |           .....

_________|                            |_____|                      |____|                                   ...........

                     |                            |            |

                     |             #1          |     #2   | 

                             (145ns)          (62ns)

This pulse shows the packet receive signal in FPGA. #1 pulse is handling packet in FPGA. But there is time period(mentioned #2), and the packet is not sent from DSP.
I assume that packet handling is occurred in DSP side(separate in 256 bytes and attached header format in packet and so on...) 

  Is there any way to decrease the #2 period? I tried to find the register that control its field but, the packet generation is occurred in PHY layer of SRIO IP and it`s automatically generated.

2. I measured the two write operation throughput, NWRITE and SWRITE(streaming write). But there is no difference between them. Refer to the SRIO white paper, SWRITE have short header format,

and low overhead, So I guess it will affect the throughput. But nothing is happened. 

  Is there any result to compare the performance between NWRITE and SWRITE format in same size?

I couldn`t explain my problem in details due to my poor english. Sorry about that..

Thanks again.

Regards,

chanseok

  • Hi Chanseok,

    I've forwarded this to the design experts. Their feedback should be posted here.

    BR
    Tsvetolin Shulev
  • 1)  Have you seen table 18 in http://www.ti.com/lit/an/sprabk5b/sprabk5b.pdf

    The HW can support 13.4Mbps in the scenario you are discussing.  The SRIO peripheral will burst out the packets as fast as possible with virtually no interpacket gap.  So I think either one of two things is happening, either the FPGA is not receiving the packets fast enough and issuing physical layer retries, or you have contention for resources happening inside the K2H device with memory accesses.  A quick simple test for this is disable all other memory accesses and just do the SRIO traffic tx operations.  What throughput do you get then?

    2) Virtually no difference in performance between SWRITE and NWRITE for the same size.

    Regards,

    Travis

  • Dear Travis,

    Thanks for very detailed answer. I will do that scenario case.

    I have one more question, if you don`t mind. Actually, I send SRIO packet through DIO socket. In the driver code, there are no relationship between DIO and PKTDMA(QMSS+CPPI), I think.

    So I have removed that code, and it works. Is this affect the overall performance of SRIO while disabling PKTDMA?

    (I have doubt that when I try test using tput_benchmark including SRIO example, It didn`t work with my modified driver. But when I rolled back driver(the original one), It works. Only difference between them is QMSS initialization. My modified code does not contained QMSS initialization) 

    Thanks for your help.

    Regards,

    chanseok

  • Chanseok Kang said:
    I have one more question, if you don`t mind. Actually, I send SRIO packet through DIO socket. In the driver code, there are no relationship between DIO and PKTDMA(QMSS+CPPI), I think.

    Correct!

    Chanseok Kang said:

    So I have removed that code, and it works. Is this affect the overall performance of SRIO while disabling PKTDMA?

    (I have doubt that when I try test using tput_benchmark including SRIO example, It didn`t work with my modified driver. But when I rolled back driver(the original one), It works. Only difference between them is QMSS initialization. My modified code does not contained QMSS initialization) 

    You do not need any of the PKTDMA initialization when using DIO (LSU) mode.  I'm not certain what in the code is causing it not to work unless you didn't enable the messaging tests or you removed too much in the initialization (like maybe serdes related) that prevented the DIO from working.

    Regards,

    Travis

  • Dear Travis,

    Finally, I`ve done with my test, and found out what the cause is.

    Actually, I allocated the source buffer in DDR area(defined it as a global buffer, so It may allocate in the heap section.)

    After I change the location from DDR to L2SRAM, the release cycle(mentioned #2 in question) is decreased by 25ns. (64ns->25ns)

    In this case, maximum throughput(13.1Gbps) is shown on the result. here is the result: (DIO_NWRITE)

    byte Write API FPGA Receive release cycle overall(us) Throughput(Mbps)
    256 414ns 138ns - 1.214 1686.985173
    512 412ns 137ns 25ns 1.37 2989.781022
    1024 415ns 138ns 26ns 1.68 4876.190476
    2048 410ns 138ns 26ns 2.285 7170.2407
    4096 410ns 138ns 25ns 3.49 9389.111748
    8192 412ns 137ns 25ns 5.93 11051.60202
    16384 412ns 137ns 25ns 10.82 12113.86322
    32768 - - - 20.522 12773.80372
    65536 - - - 40.007 13104.90664
    131072

    But Another problem is occurred. When I send the 128Kbyte data(in L2SRAM) from DSP to FPGA, FPGA does not receive some of data(overall time is ended in 46us...when I test it in DDR, all the data is received)

    I thought it maybe FPGA`s problem. But When the FPGA send the bulk data to DSP in same option, DSP also missed some data.

    Here is the question. When the send API(srio_sockSend_DIO :defined in LLD) is called, the data is loaded into LSU shadow register, and separate it by 256 bytes and shoot the destination.

    So I guess SRIO IP have internal buffer. Is there any chance to occur overflow in internal buffer? (I mean, data loading speed is too fast to store in LSU register) or Limitation while using SRAM source buffer?

    thanks for your support as always.

    Best Regards,

    Chanseok.

  • Chanseok Kang said:
    So I guess SRIO IP have internal buffer. Is there any chance to occur overflow in internal buffer? (I mean, data loading speed is too fast to store in LSU register) or Limitation while using SRAM source buffer?

    No, there is no way to overflow the buffers.  What completion codes are you getting from the LSU?

    Regards,

    Travis

  • Dear Travis,

    Thanks for reply,
    I tested on non-blocking socket. So I didn`t check completion code(cause I guess completion code is used for transfer complete in blocking socket). In that case of Non-blocking socket, When can I check the completion code? right after sending packet?
    Anyway, I will check it again.

    Thanks for your help.

    Regards,
    Chanseok
  • I checked the completion code, and found out its value of 0xB(0b1011). Then after few steps it changed 0xA(0b1010).
    Refer from the datasheet, it means DMA data transfer error. I couldn`t find any informations about this DMA. Maybe it would be EDMA, I guess. (No explanation for configuring DMA in datasheet and example code)
    How can I handle this?

    Regards,
    Chanseok.
  • Finally found the reason. Actually, Some modules are initialized like UIA, NIMU, NDK.. After disabling this, that problem is gone. Maybe one of these module takes some resource or occur bus contention.
    Thank you very much for these help.

    Best regards,
    Chanseok.