66AK2H12: [SRIO] Question about Write-operatio throughput

Chanseok Kang

Intellectual 370 points

Part Number: 66AK2H12

Hi,

This forum always help me when I stuck in problem, Thanks again.

I have some question about write-operation throughput in SRIO.

here is my environment:

- custom board (DSP:66AK2H12, FPGA: Xilinx vertex 7)
- PDK : v4.0.4
- CCS : v7.0.1
- SRIO : 5Gbps per lane * 4 lane (20Gbps in total)

When the DSP send the write-operation packet with 256kB payload, FPGA can receive that packet. So I can measure the time difference between them.

Based on this data, I calculate the performance, and shows almost 10 Gbps which is not maximum throughput.

I found that there are several factors to decrease its speed like 8b/10b encoding, serdes module and so on. But the 10 Gbps is not reasonable throughput.

My question is that:

- the test scenario is sending 256000 bytes from DSP to FPGA, There is only one sending operation is DSP side(it tooks 400ns), And SRIO IP separates its data by 256 bytes
which is maximum packet payload size. the signal diagram is like this:(sorry not upload image due to the security policy on my workplace)

|---------------------| |------------------| |-----------------------

| | | | |

| | | | | .....

_________| |_____| |____| ...........

| | |

| #1 | #2 |

(145ns) (62ns)

This pulse shows the packet receive signal in FPGA. #1 pulse is handling packet in FPGA. But there is time period(mentioned #2), and the packet is not sent from DSP.
I assume that packet handling is occurred in DSP side(separate in 256 bytes and attached header format in packet and so on...)

Is there any way to decrease the #2 period? I tried to find the register that control its field but, the packet generation is occurred in PHY layer of SRIO IP and it`s automatically generated.

2. I measured the two write operation throughput, NWRITE and SWRITE(streaming write). But there is no difference between them. Refer to the SRIO white paper, SWRITE have short header format,

and low overhead, So I guess it will affect the throughput. But nothing is happened.

Is there any result to compare the performance between NWRITE and SWRITE format in same size?

I couldn`t explain my problem in details due to my poor english. Sorry about that..

Thanks again.

Regards,

chanseok

over 9 years ago

0 Cvetolin Shulev-XID over 9 years ago

TI__Guru 65405 points

Hi Chanseok,

I've forwarded this to the design experts. Their feedback should be posted here.

BR
Tsvetolin Shulev

0 tscheck over 9 years ago

TI__Mastermind 23525 points

1) Have you seen table 18 in http://www.ti.com/lit/an/sprabk5b/sprabk5b.pdf

The HW can support 13.4Mbps in the scenario you are discussing. The SRIO peripheral will burst out the packets as fast as possible with virtually no interpacket gap. So I think either one of two things is happening, either the FPGA is not receiving the packets fast enough and issuing physical layer retries, or you have contention for resources happening inside the K2H device with memory accesses. A quick simple test for this is disable all other memory accesses and just do the SRIO traffic tx operations. What throughput do you get then?

2) Virtually no difference in performance between SWRITE and NWRITE for the same size.

Regards,

Travis

0 Chanseok Kang over 9 years ago in reply to tscheck

Intellectual 370 points

Dear Travis,

Thanks for very detailed answer. I will do that scenario case.

I have one more question, if you don`t mind. Actually, I send SRIO packet through DIO socket. In the driver code, there are no relationship between DIO and PKTDMA(QMSS+CPPI), I think.

So I have removed that code, and it works. Is this affect the overall performance of SRIO while disabling PKTDMA?

(I have doubt that when I try test using tput_benchmark including SRIO example, It didn`t work with my modified driver. But when I rolled back driver(the original one), It works. Only difference between them is QMSS initialization. My modified code does not contained QMSS initialization)

Thanks for your help.

Regards,

chanseok

0 tscheck over 9 years ago in reply to Chanseok Kang

TI__Mastermind 23525 points

Chanseok Kang said:
I have one more question, if you don`t mind. Actually, I send SRIO packet through DIO socket. In the driver code, there are no relationship between DIO and PKTDMA(QMSS+CPPI), I think.

Correct!

Chanseok Kang said:

So I have removed that code, and it works. Is this affect the overall performance of SRIO while disabling PKTDMA?

(I have doubt that when I try test using tput_benchmark including SRIO example, It didn`t work with my modified driver. But when I rolled back driver(the original one), It works. Only difference between them is QMSS initialization. My modified code does not contained QMSS initialization)

You do not need any of the PKTDMA initialization when using DIO (LSU) mode. I'm not certain what in the code is causing it not to work unless you didn't enable the messaging tests or you removed too much in the initialization (like maybe serdes related) that prevented the DIO from working.

Regards,

Travis

0 Chanseok Kang over 9 years ago in reply to tscheck

Intellectual 370 points

Dear Travis,

Finally, I`ve done with my test, and found out what the cause is.

Actually, I allocated the source buffer in DDR area(defined it as a global buffer, so It may allocate in the heap section.)

After I change the location from DDR to L2SRAM, the release cycle(mentioned #2 in question) is decreased by 25ns. (64ns->25ns)

In this case, maximum throughput(13.1Gbps) is shown on the result. here is the result: (DIO_NWRITE)

byte	Write API	FPGA Receive	release cycle	overall(us)	Throughput(Mbps)
256	414ns	138ns	-	1.214	1686.985173
512	412ns	137ns	25ns	1.37	2989.781022
1024	415ns	138ns	26ns	1.68	4876.190476
2048	410ns	138ns	26ns	2.285	7170.2407
4096	410ns	138ns	25ns	3.49	9389.111748
8192	412ns	137ns	25ns	5.93	11051.60202
16384	412ns	137ns	25ns	10.82	12113.86322
32768	-	-	-	20.522	12773.80372
65536	-	-	-	40.007	13104.90664
131072

But Another problem is occurred. When I send the 128Kbyte data(in L2SRAM) from DSP to FPGA, FPGA does not receive some of data(overall time is ended in 46us...when I test it in DDR, all the data is received)

I thought it maybe FPGA`s problem. But When the FPGA send the bulk data to DSP in same option, DSP also missed some data.

Here is the question. When the send API(srio_sockSend_DIO :defined in LLD) is called, the data is loaded into LSU shadow register, and separate it by 256 bytes and shoot the destination.

So I guess SRIO IP have internal buffer. Is there any chance to occur overflow in internal buffer? (I mean, data loading speed is too fast to store in LSU register) or Limitation while using SRAM source buffer?

thanks for your support as always.

Best Regards,

Chanseok.

0 tscheck over 9 years ago in reply to Chanseok Kang

TI__Mastermind 23525 points

Chanseok Kang said:
So I guess SRIO IP have internal buffer. Is there any chance to occur overflow in internal buffer? (I mean, data loading speed is too fast to store in LSU register) or Limitation while using SRAM source buffer?

No, there is no way to overflow the buffers. What completion codes are you getting from the LSU?

Regards,

Travis

0 Chanseok Kang over 9 years ago in reply to tscheck

Intellectual 370 points

Dear Travis,

Thanks for reply,
I tested on non-blocking socket. So I didn`t check completion code(cause I guess completion code is used for transfer complete in blocking socket). In that case of Non-blocking socket, When can I check the completion code? right after sending packet?
Anyway, I will check it again.

Thanks for your help.

Regards,
Chanseok

0 Chanseok Kang over 9 years ago in reply to tscheck

Intellectual 370 points

I checked the completion code, and found out its value of 0xB(0b1011). Then after few steps it changed 0xA(0b1010).
Refer from the datasheet, it means DMA data transfer error. I couldn`t find any informations about this DMA. Maybe it would be EDMA, I guess. (No explanation for configuring DMA in datasheet and example code)
How can I handle this?

Regards,
Chanseok.

0 Chanseok Kang over 9 years ago in reply to tscheck

Intellectual 370 points

Finally found the reason. Actually, Some modules are initialized like UIA, NIMU, NDK.. After disabling this, that problem is gone. Maybe one of these module takes some resource or occur bus contention.
Thank you very much for these help.

Best regards,
Chanseok.

Processors

Processors forum

66AK2H12: [SRIO] Question about Write-operatio throughput