SRIO throughput is higher than theoretical speed

Peter010

Hello

We are doing SRIO throughput test on K2H EVM. The measurement is based on board to board through SRIO switch by using NWrite and NRead packet in 4 lanes 5Gbps mode. The theoretical link rate is 2000 MBytes/s without overhead.(4*5Gbps*0.8/8=2000 MB/s)

The throughput is calculated by: Number_of_Bytes / DSP_clock_MHz / elapsed_cycles. The elapsed cycles are the number of cycles between the point setting LSU and got completion code from LSU_STAT_REG, as showing below.

t_start = _itoll(TSCH, TSCL);
KeyStone_SRIO_LSU_transfer(&lsuTransfer);

uiCompletionCode= KeyStone_SRIO_wait_LSU_completion(0, lsuTransfer.transactionID, lsuTransfer.contextBit);
t_stop = _itoll(TSCH, TSCL);
t_total = (t_stop - t_start) - t_overhead;

In the condition that the source and destination are both L2 SRAM, The NWrite can reach speed higher than 2000MB/s, which is 2015 MB/s.

Can you tell me what is the reason of the measured SRIO speed can be higher than theoretical one?

The following figure shows the performance we measured.

Thanks

Xining

over 9 years ago

0 Raja over 9 years ago

TI__Guru* 81335 points

Hi,

What do you mean by Core and Chip memory?

I hope you are running MCSDK SRIO throughput example. If Yes, Can you please provide the test log?

Note: We have validated SRIO throughput with various memory configurations and results are not exceeded the theoretical speed.

I hope, you have referred Throughput Performance Guide for KeyStone II Devices for SRIO throughput.

Thank you.

0 Peter010 over 9 years ago in reply to Raja

Genius 4165 points

Hi Raja

Our test is not based on MCSDK. In above figure the "In core memory" means L2 SRAM, "On Chip memory" is MSMC.

Oh, it seems like you have update the Throughput document. I have the first version of this document.

Based on the document you provided, they measure the SRIO throughput by using Doorbell. In the document, it said that:"The total time taken for the packet transfer is calculated by using the doorbell time stamps". I am not quite clear how they use the time stamp in the doorbell measure the throughput.

Beside the SRIO performance differences between my condition and TI condition, there is a place that my test result is contradict with the document. In the document, the throughput for L2 Memory and DDR3 is almost same (showing in the below figure). In my condition, the place where the memory reside highly influence the performance of SRIO throughput. Usually, the closer the source/destination memory to the core the higher throughput it can achieve.

The test log for L2 SRAM in our test is showing below. The test log for MSMC and DDR is the same but different performance. You can see their performance in the first post.

DSP speed grade = 800MHz, ARM speed grade= 800MHz
Initialize main core clock = 122.88MHz/4x39 = 1198MHz
DDR3A initialization
Initialize DDR data rate = 100.000/1*20/6*4= 1333.3 MTS, bus width = 64 bits.
DDR PHY status PGSR0=0xb0000fff.
DDR3B initialization
Initialize DDR data rate = 100.000/1*16/4*4= 1600.0 MTS, bus width = 64 bits.
DDR PHY status PGSR0=0xb0000fff.
SRIO test between two Devices start............................................

SRIO path configuration 4xLaneABCD                     
NREAD    from 0x10800200 to 0x1080a200,      4 bytes, completion code = 0, Cycles = 1504, Thruput = 3MB/s
NREAD    from 0x10800200 to 0x1080a200,      8 bytes, completion code = 0, Cycles = 1533, Thruput = 6MB/s
NREAD    from 0x10800200 to 0x1080a200,     16 bytes, completion code = 0, Cycles = 1532, Thruput = 12MB/s
NREAD    from 0x10800200 to 0x1080a200,     32 bytes, completion code = 0, Cycles = 1604, Thruput = 23MB/s
NREAD    from 0x10800200 to 0x1080a200,     64 bytes, completion code = 0, Cycles = 1676, Thruput = 45MB/s
NREAD    from 0x10800200 to 0x1080a200,    128 bytes, completion code = 0, Cycles = 1821, Thruput = 84MB/s
NREAD    from 0x10800200 to 0x1080a200,    256 bytes, completion code = 0, Cycles = 2179, Thruput = 140MB/s
NREAD    from 0x10800200 to 0x1080a200,    512 bytes, completion code = 0, Cycles = 2397, Thruput = 255MB/s
NREAD    from 0x10800200 to 0x1080a200,   1024 bytes, completion code = 0, Cycles = 2756, Thruput = 445MB/s
NREAD    from 0x10800200 to 0x1080a200,   2048 bytes, completion code = 0, Cycles = 3476, Thruput = 705MB/s
NREAD    from 0x10800200 to 0x1080a200,   4096 bytes, completion code = 0, Cycles = 4917, Thruput = 997MB/s
NREAD    from 0x10800200 to 0x1080a200,   8192 bytes, completion code = 0, Cycles = 7868, Thruput = 1247MB/s
NREAD    from 0x10800200 to 0x1080a200,  16384 bytes, completion code = 0, Cycles = 13627, Thruput = 1440MB/s
NREAD    from 0x10800200 to 0x1080a200,  32768 bytes, completion code = 0, Cycles = 25365, Thruput = 1547MB/s
NREAD    from 0x10800200 to 0x1080a200,  65536 bytes, completion code = 0, Cycles = 48621, Thruput = 1614MB/s

NWRITE   from 0x10812200 to 0x1081a200,      4 bytes, completion code = 0, Cycles = 452, Thruput = 10MB/s
NWRITE   from 0x10812200 to 0x1081a200,      8 bytes, completion code = 0, Cycles = 452, Thruput = 21MB/s
NWRITE   from 0x10812200 to 0x1081a200,     16 bytes, completion code = 0, Cycles = 453, Thruput = 42MB/s
NWRITE   from 0x10812200 to 0x1081a200,     32 bytes, completion code = 0, Cycles = 453, Thruput = 84MB/s
NWRITE   from 0x10812200 to 0x1081a200,     64 bytes, completion code = 0, Cycles = 452, Thruput = 169MB/s
NWRITE   from 0x10812200 to 0x1081a200,    128 bytes, completion code = 0, Cycles = 452, Thruput = 339MB/s
NWRITE   from 0x10812200 to 0x1081a200,    256 bytes, completion code = 0, Cycles = 453, Thruput = 677MB/s
NWRITE   from 0x10812200 to 0x1081a200,    512 bytes, completion code = 0, Cycles = 595, Thruput = 1030MB/s
NWRITE   from 0x10812200 to 0x1081a200,   1024 bytes, completion code = 0, Cycles = 883, Thruput = 1389MB/s
NWRITE   from 0x10812200 to 0x1081a200,   2048 bytes, completion code = 0, Cycles = 1460, Thruput = 1680MB/s
NWRITE   from 0x10812200 to 0x1081a200,   4096 bytes, completion code = 0, Cycles = 2684, Thruput = 1828MB/s
NWRITE   from 0x10812200 to 0x1081a200,   8192 bytes, completion code = 0, Cycles = 4988, Thruput = 1967MB/s
NWRITE   from 0x10812200 to 0x1081a200,  16384 bytes, completion code = 0, Cycles = 9741, Thruput = 2014MB/s
NWRITE   from 0x10812200 to 0x1081a200,  32768 bytes, completion code = 0, Cycles = 21044, Thruput = 1865MB/s
NWRITE   from 0x10812200 to 0x1081a200,  65536 bytes, completion code = 0, Cycles = 44373, Thruput = 1769MB/s

STREAM   from 0x10802200,   4096 bytes transfer complete.

SWRITE   from 0x10802200 to 0x1080a200,     16 bytes, completion code = 0, Cycles = 453, Thruput = 42MB/s
SWRITE   from 0x10802200 to 0x1080a200,     32 bytes, completion code = 0, Cycles = 453, Thruput = 84MB/s
SWRITE   from 0x10802200 to 0x1080a200,     64 bytes, completion code = 0, Cycles = 453, Thruput = 169MB/s
SWRITE   from 0x10802200 to 0x1080a200,    128 bytes, completion code = 0, Cycles = 452, Thruput = 339MB/s
SWRITE   from 0x10802200 to 0x1080a200,    256 bytes, completion code = 0, Cycles = 451, Thruput = 680MB/s
SWRITE   from 0x10802200 to 0x1080a200,    512 bytes, completion code = 0, Cycles = 595, Thruput = 1030MB/s
SWRITE   from 0x10802200 to 0x1080a200,   1024 bytes, completion code = 0, Cycles = 885, Thruput = 1386MB/s
SWRITE   from 0x10802200 to 0x1080a200,   2048 bytes, completion code = 0, Cycles = 1460, Thruput = 1680MB/s
SWRITE   from 0x10802200 to 0x1080a200,   4096 bytes, completion code = 0, Cycles = 2684, Thruput = 1828MB/s
SWRITE   from 0x10802200 to 0x1080a200,   8192 bytes, completion code = 0, Cycles = 4989, Thruput = 1967MB/s
SWRITE   from 0x10802200 to 0x1080a200,  16384 bytes, completion code = 0, Cycles = 9740, Thruput = 2015MB/s
SWRITE   from 0x10802200 to 0x1080a200,  32768 bytes, completion code = 0, Cycles = 21045, Thruput = 1865MB/s
SWRITE   from 0x10802200 to 0x1080a200,  65536 bytes, completion code = 0, Cycles = 44371, Thruput = 1769MB/s
SRIO test complete.

0 tscheck over 9 years ago

TI__Mastermind 23525 points

Xining,

The throughput test results will definitely depend on the method used to measure and number of packets sent. I believe our throughput numbers show the HW capability because the actual number of packets send are not high. To answer your initial question, you have to remember that for NWRITE packets, the LSU stat register is set with the CC as soon as the last packet of the transfer is moved from the logical layer TX buffers into the physical layer TX buffers. The packet isn't actually sent out on the pins yet. So essentially, you are reading a false finish time. We used doorbell as a mechanism to know that the packet was sent and actually landed in the destination memory before measuring the transmission time. This is done by measuring the time it takes to write the LSU for doorbell and receive the doorbell response, we save this time measurement, then when we do the data transfers, we end the test by sending a doorbell and when we get the doorbell response we mark the total time it takes and subtract the time it took for a single doobell.

Regards,

Travis

0 Peter010 over 9 years ago in reply to tscheck

Genius 4165 points

Thanks for your explanation.

I will try to measure the benchmark following your procedure.

Thanks
Xining

Processors

Processors forum

SRIO throughput is higher than theoretical speed