This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

SRIO Type 11 lost packets, performance issue

Other Parts Discussed in Thread: SYSBIOS

Hi,

I am having problem with SRIO Type11 data transfer. Basically when some external device sends a stream of type11 messages some packets get lost. The link partner receives responses without errors. This happens at speeds 6-10 Gbps (SRIO in 4 lane 3.25 Gbps). I can receive big chunks (tens of MB) of data if at around 5Gbps without losses but at higher speed there is mismatch in packet count on both ends. I use srio driver.

I receive data by calling Srio_sockRecv in a cycle. If there is some nontrivial code inside the loop the number of lost packets is big if loop is empty the number is small or sometimes all data is received (compiling in release also helps). That seems strange because there is around 3k cpu cycles between packets (4k bytes per packet at 10 Gbps). If I send small number of packets (fewer then srio_driver_config.u.drvManagedCfg.rxCfg.numRxBuffers) then it seems that all messages are received regardless of speed.

There are a couple of things that might cause this problem:

  1. I use NDK that uses QMSS/CPPI.
  2. The main cycle is run inside SYSBIOS task.
  3. Instead of using simplified OSAL memory allocation as in example, the function Osal_srioDataBufferMalloc uses HeapBuf_alloc.
  4. Data from srio is transfered into DDR.

Do I need to go to barebone design and get rid of all of the above in oder to ensure reliable data trasfer? Is there a way to detect packets received by srio subsystem and dropped due to the lack of unused qmss/cppi rx buffers?

Thanks in advance,

Alexey

  • Alexey

    What device are you using? The information below is for Nyquist.

    According the SRIO UG, " The RapidIO Physical Layer 1x/4x LP-Serial Specification currently covers four frequency points: 1.25, 2.5, 3.125, and 5 Gbps. Due to the 8-bit/10-bit encoding overhead, the effective data bandwidth per differential pair is 1.0, 2.0, 2.5, and 4 Gbps, respectively."

    Did you tie your lanes together to have a 4X port width? Did you ensure the SRIO is synced with the external device?

    I am attaching a GEL file to this post. A GEL file is a debugging tool used by the CCS to help the user catch any bugs. I assume you have the CCS v5. This GEL file will work with all Nyquist devices.

    To use the GEL file:

    1. Enter into CCS Debug Mode (should be in the top right of CCS next to CCS Edit) and you have ran your program or paused your program.
    2. Select Tools->GEL files
    3. A GEL Files tab should open up
    4. Right click in an entry under Script/Status table and select "Load Gel..."
    5. Select the GEL file I have given you and wait for its status under the Script/Status table to turn to "Success".
    6. Select Scripts and see all the different tests provided.

    I recommend running the SRIO Errors Scan (both physical and logical) and SRIO dump to check your port width.

    Elush

    3681.TCI6618_8_Srio_v0.13.gel

  • Hi,

    Sorry forgot to mention, I am using c6678 in 4x mode at 3.125 Gbps. Thank you very much for the gel file, I will try it when I get back to work.

    At the moment I am working with 2 c6678 boards connected through a switch. The whole setup is quite shaky because I use virtual machine to connect to one of the boards and neither CCS nor my OS play nice with this setup. On top of this it seems I did not quite figured out how to work with the switch because my code does not work 100% of the time (i think if always work right  after i power on all devices).

    I managed to start throughput benchmark code that comes with PDK in board-switch-board configuration, the measure throughput using type 11 packets was around 8200 Mbit/s. Unforutnately when I use my own init and transfer code I get max 5400 Mbit/s quite consistantly (sometimes it drops to 3400 or 1900 Mbit/s for reasons I dont undestand). Basically the TX buffer allocate function fails so I call it untill I get TX buffer (to my undestanding it means that TX board can not send packets because RX side is busy).

    RIght now I am trying to merge my code and the code from the benchmark example. I can not just take the benchmark code because we need ehternet which also uses cppi/qmss subsystem. I was able to verify that if I replace gel file in the benchmark code with my initialization procedure then the throughput does not degrade. I also changed some parts of mine cppi/qmss initialization function to follow the example more closely but this did not improve throughput in my case. I have a couple of questions regarding thr throughput test:

    1. Is there a reason for compiling srio_drv.c instead of using precompiled binary other then to simplify debugging?

    2. The example manages descriptors manually but I let the driver do this for me. Can this impact perfomance?

    Thank you in advance,

    Alexey

  • Alexey

    I am not exactly following what you did or what you are trying to do, so I can't answer those questions. I am unsure of what your code contains.

    Elush

  • Alexey Naydenov said:

    I receive data by calling Srio_sockRecv in a cycle. If there is some nontrivial code inside the loop the number of lost packets is big if loop is empty the number is small or sometimes all data is received (compiling in release also helps). That seems strange because there is around 3k cpu cycles between packets (4k bytes per packet at 10 Gbps). If I send small number of packets (fewer then srio_driver_config.u.drvManagedCfg.rxCfg.numRxBuffers) then it seems that all messages are received regardless of speed.

    Hi,

    I am seeing a similar issue to this on my current setup: C6678, SRIO Type 11, 5 Gbps. 

    When a large number of packets are sent to a particular core within a certain time, the core will only retain an amount of messages that is equivalent to the number of receive queues. This happens regardless of size of the messages. 

    Has this issue been resolved?

    Thanks,

    -Dan

  • Hi,

    I have my issue which sounds similar to yours and a possible work around. 

    I was experiencing dropped packets due to the fact that I was sending the packets to DSP faster than any one core could receive them. Most of the TI SRIO examples use 4 separate receive queues because they know it takes time to update a queue when receiving a message. By having multiple queues to receive, additional SRIO messages can be accepted while other queues are updating. 

    Unfortunately even with the 4 receive queues, I was still able to send messages faster then they could update. I have experienced this less with larger message sizes (4KB), due to the fact that they require more packets and therefore higher latency between messages. The increased time between messages allows for the receive queues to update in time. The only other option that I have found is to ensure that you "pace" how often you send the messages. Hence the "pacing" metric in the SRIO throughput example. It's not an ideal solution, but it works. 

    -Dan