This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Issues with throughput and processing speed when using SRIO

Issue 1:

We are still investigating throughput issues we are observing on our investigation of sending packets over SRIO. As part of that investigation, we took the TI tput example provided in the pdk and ran on the evm in loopback mode in a big-endian implementation. From the results that we got we have a query in regards to the performance numbers reported by the TI example. I am picking one of the packet sizes from the type 11 investigation. i.e. 512 which has a PCyc of 4784 cycles per packet. So to achieve 1.25 Gbps the actual data throughput according to the results is 856Mbps which means 209030 packets being sent from one core to another in one second. Those 209030 packets took 209030 x 4784 cycles to be received which translates to 1Gcycles/second i.e. the entire processing capability of the core. So my question is how can one actually do anything with the packets to do some actual processing on the data if it is taking 1G just to transfer the data. Granted we will not be doing type 11 and relying on message responses to go back from the consumer to the producer core so the numbers to send a single packet maybe of the order of 50% of the previous one. I would expect the time it takes to receive the packets to be even more negligible so that one can actually process the data and not have roughly 50% of the cycles available just to receive the data. I am recopying that line (just below) from the results much further below.

[C66xx_0] 0        1             1.250     C-I-C      Type-11               24           512        0             856.19   209030.09               1200000              No          4784      592        4082      110        5.74

Issue 2:

Additionally, on our multi-core application we have configured 8 cores, with one driver per port (3 ports) per core with one socket per core for TX and one driver per core for Rx as each queue requires a separate driver. In addition we have 5 sockets per core for receive so total 64 sockets across 8 cores.

We also have 16k descriptors available for SRIO with 2048 per core of which 648 on a test basis I have assigned for Tx and the remaining 1400 are spread over the 5 sockets for Rx. We put the 1400 descriptors into a rx free queue for each core and make use of starvation queues where we drop on starvation.

In a given 10ms processing cycle, we are then creating a packet of size 500 bytes approx. and sending the packet 640 times by calling Srio_sockSend using a descriptor popped from the TX Free Q. We do that for 2 cores and send them both to a single core on another chip via an SRIO switch. We are able to see the statistics of the number of bytes received on various ports of the switch which are each connected to a DSP chip. On the Rx, We make a call to the driver function Srio_rxCompletionIsr in polled mode which gives us a count of the number of descriptors in the rxCompletion queue. We then receive all the packets in a given 10ms on the rx dsp by calling Srio_sockRecv on all the sockets will there are no more packets on the sockets and we then essentially discard them and do no further processing with them for now. What we are seeing is that we are not receiving in the driver all the packets we are expecting in any given 10ms window. This we are able to determine by keeping a running count of the rx completion queue i.e. Qmss_getQueueEntryCount(ptr_srioDrvInst->rxCompletionQueue). However, when we look at the statistics from the SRIO switch, we can see that the packets have been sent to the destination chip, but it appears that the hardware is dropping some of these packets and so not all the packets get pushed onto the Rx completion queue (general purpose queue). We should not be seeing any starvation on the Rx free queue because we have 1400 descriptors, and any given 10ms, we are expecting 1280 packets and hence 1280 descriptors should be used. I played around with the number of packets we send from the two cores and 500 from each core works without seeing any drops but when I tried 520 it did not.

 

[C66xx_0] ********************************

[C66xx_0] *********** CONSUMER ***********

[C66xx_0] ********************************

[C66xx_0] WARNING: Please ensure that the CONSUMER is executing before running the PRODUCER!!

[C66xx_0] Debug: Waiting for module reset...

[C66xx_0] Debug: Waiting for module local reset...

[C66xx_0] Debug: Waiting for SRIO ports to be operational... 

[C66xx_0] Debug: SRIO port 0 is operational.

[C66xx_0] Debug: SRIO port 1 is operational.

[C66xx_0] Debug: SRIO port 2 is operational.

[C66xx_0] Debug: SRIO port 3 is operational.

[C66xx_0] Debug:   Lanes status shows lanes formed as four 1x ports

[C66xx_0] Debug: AppConfig Tx Queue: 0x2a0 Flow Id: 0

[C66xx_0] Debug: SRIO Driver Instance 0x@00861540 has been created

[C66xx_0] Debug: Running test in polled mode.

[C66xx_0] Debug: SRIO Driver handle 0x861540.

[C66xx_0]

[C66xx_0]

[C66xx_1] ********************************

[C66xx_1] *********** PRODUCER ***********

[C66xx_1] ********************************

[C66xx_1] WARNING: Please ensure that the CONSUMER is executing before running the PRODUCER!!

[C66xx_1] Debug(Core 1): Waiting for SRIO to be initialized.

[C66xx_1] Debug: AppConfig Tx Queue: 0x2a1 Flow Id: 1

[C66xx_1] Debug: SRIO Driver Instance 0x@00861450 has been created

[C66xx_1] Debug: Running test in polled mode.

[C66xx_1] Debug: SRIO Driver handle 0x861450.

[C66xx_1]

[C66xx_1]

 [C66xx_1] Latency: (Type-11, 1.250GBaud, 1X, tab delimited)

[C66xx_1] Core  Lanes     Speed    Conn      MsgType              PktSize  NumPkts              MnLCycs               AgLCycs               MxLCycs

[C66xx_1] 1        1             1.250     C-I-C      Type-11               16           100        2344      2376      2447

[C66xx_1] 1        1             1.250     C-I-C      Type-11               32           100        2528      2581      2621

[C66xx_1] 1        1             1.250     C-I-C      Type-11               64           100        2951      2988      3055

[C66xx_1] 1        1             1.250     C-I-C      Type-11               128        100        3746      3810      3922

[C66xx_1] 1        1             1.250     C-I-C      Type-11               256        100        5439      5477      5570

[C66xx_1] 1        1             1.250     C-I-C      Type-11               512        100        7484      7562      7643

[C66xx_1] 1        1             1.250     C-I-C      Type-11               1024      100        11948    12031    12150

[C66xx_1] 1        1             1.250     C-I-C      Type-11               2048      100        22340    22428    22534

[C66xx_1] 1        1             1.250     C-I-C      Type-11               4096      100        41188    41319    41480

[C66xx_1]

[C66xx_0] Throughput: (RX side, Type-11, 1.250GBaud, 1X, tab delimited)

[C66xx_0] Core  Lanes     Speed    Conn      MsgType              OHBytes              PktSize  Pacing   Thruput               PktsSec.               NumPkts              PktLoss AgPCycs               AgLCycs               AgICycs AgOCycs               Seconds

[C66xx_0] 0        1             1.250     C-I-C      Type-11               24           16           0             173.91   1358695.63               6800000              No          736        593        33           110        5.01

[C66xx_0] 0        1             1.250     C-I-C      Type-11               24           32           0             348.30   1360544.25               6800000              No          735        589        36           110        5.00

[C66xx_0] 0        1             1.250     C-I-C      Type-11               24           64           0             609.52   1190476.25               6000000              No          840        595        135        110        5.04

[C66xx_0] 0        1             1.250     C-I-C      Type-11               24           128        0             754.61   736919.69               3600000              No          1357      594        653        110        4.89

[C66xx_0] 0        1             1.250     C-I-C      Type-11               24           256        0             856.19   418060.19               2200000              No          2392      595        1687      110        5.26

[C66xx_0] 0        1             1.250     C-I-C      Type-11               24           512        0             856.19   209030.09               1200000              No          4784      592        4082      110        5.74

[C66xx_0] 0        1             1.250     C-I-C      Type-11               24           1024      0             856.10   104504.13               600000  No          9569      593        8866      110        5.74

[C66xx_0] 0        1             1.250     C-I-C      Type-11               24           2048      0             856.05   52249.33               400000  No          19139    594        18435    110        7.66

[C66xx_0] 0        1             1.250     C-I-C      Type-11               24           4096      0             856.05   26124.67               200000  No          38278    592        37576    110        7.66

[C66xx_0]

[C66xx_1] Throughput: (TX side, Type-11, 1.250GBaud, 1X, tab delimited)

[C66xx_1] Core  Lanes     Speed    Conn      MsgType              OHBytes              PktSize  Pacing   Thruput               PktsSec.               NumPkts              PktLoss AgPCycs               AgLCycs               AgICycs AgOCycs               Seconds

[C66xx_1] 1        1             1.250     C-I-C      Type-11               24           16           228        173.91   1358695.63               6800000              No          736        135        584        17           5.01

[C66xx_1] 1        1             1.250     C-I-C      Type-11               24           32           228        348.30   1360544.25               6800000              No          735        135        583        17           5.00

[C66xx_1] 1        1             1.250     C-I-C      Type-11               24           64           0             609.52   1190476.25               6000000              No          840        135        688        17           5.04

[C66xx_1] 1        1             1.250     C-I-C      Type-11               24           128        0             754.61   736919.69               3600000              No          1357      135        1205      17           4.89

[C66xx_1] 1        1             1.250     C-I-C      Type-11               24           256        0             856.19   418060.19               2200000              No          2392      135        2240      17           5.26

[C66xx_1] 1        1             1.250     C-I-C      Type-11               24           512        0             856.19   209030.09               1200000              No          4784      135        4632      17           5.74

[C66xx_1] 1        1             1.250     C-I-C      Type-11               24           1024      0             856.10   104504.13               600000  No          9569      135        9417      17           5.74

[C66xx_1] 1        1             1.250     C-I-C      Type-11               24           2048      0             856.14   52254.79               400000  No          19137    135        18985    17           7.66

[C66xx_1] 1        1             1.250     C-I-C      Type-11               24           4096      0             856.19   26128.76               200000  No          38272    135        38120    17           7.65

[C66xx_1]

  • Hi Aamir,
    Are you running examples on C6678 EVM?
    Thank you.
  • Issue 1 is while running on the c6678 EVM

    while issue 2 is while running on specialized hardware consisting of 20 cc678s on a board.

    Aamir

  • For my understanding, you are running the SRIO throughput example in loopback mode with 1x lane configuration. Is it correct?

    Have you configured the maximum CPU frequency(1.2Ghz) on your setup?

    Thanks,
  • 4 port  x1 lane is my configuration and my CPU is running at 1Ghz.

    thanks, Aamir

  • Hi Aamir,

    I have tested the SRIO example project (SRIO_TputBenchmarkingTestProject) on C6678 DSP With 1.2GHz CPU frequency and then compare the result with throughput document value mostly both results are same.

    Please take a look at below thread:
    http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/p/322606/1154810.aspx#1154810

    Thanks,
  • Aamir Husain said:
    So my question is how can one actually do anything with the packets to do some actual processing on the data if it is taking 1G just to transfer the data.


    The SRIO messaging and internal navigator are designed to be a "fire and forget" type of interface.  What I mean by that is the CPU can write a descriptor to send data and place it on the SRIO TX queue.  Regardless of the number of cycles it takes to actually send the message or messages, the CPU can be off doing other processing.  The data movement internal to the chip for transmission is all done by the HW and Navigator PDSPs.  The throughput example is not a real world example, it is really there to show what the SRIO HW is capable of doing.  It doesn't use interrupts on the RX side, but simply polls receive queues.

    Aamir Husain said:
    However, when we look at the statistics from the SRIO switch, we can see that the packets have been sent to the destination chip, but it appears that the hardware is dropping some of these packets and so not all the packets get pushed onto the Rx completion queue (general purpose queue).


    Type 11 SRIO messages can not be simply dropped.  What is the TX cc (completion codes) on the packets that you don't see arriving on the destination device.

    Regards,

    Travis