This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

A brief pitfall and frustration guide for PCIe users on C66

Hello!

In our system we are using C6670 DSP and Spartan 6 FPGA, connected over PCIe. I have reached certain milestone in the development and collected some experience I'd like to share. I am a bit lazy to refer all useful contributions, but sure I do acknowledge that input. Sad thing that useful contribution is coming mostly from community rather then from TI employees, and that is one of the reason I am writing this review.

As most of developers, we referred TI's examples to develop our own solution. One of the first things to remember is that TI's examples are written in assumption of "TI DSP to TI DSP". That may and will not work if one of the parties is different.

It was advised to use TI's PCIe LLD for the development. While it looks quite powerful, one should be extremely careful using it. When it about powering up the module, local setup, link setup and training, its just fine to refer TI's example.

In conventional system, like PC, there should be remote endpoints setup step after link setup. The procedure if ubiquitously referred as enumeration, but it generally consists of two steps: remote endpoint discovery and their configuration. There is no enumeration example provided by TI, so one have to develop their own solution. One way to do that is looping through bus number, function number and issuing configuration reads for Vendor ID/Product ID registers. Once Cfg read request reaches existing device, there will be meaningful return values for those registers and one may conclude, the device is present. This way one may discover all devices present. In small systems with known and fixed configuration this step might be redundant, and DSP may proceed with EP configuration immediately.

One of the very basic steps of configuration is setting BAR registers. One may need to know memory window requested by endpoint. Then root complex may write all ones to remote BAR and read it back. Lower part masked to zero would indicate the size of requested window. Then root complex is free to map it to available location obeying alignment. Simply saying, if EP requests 32KB of memory, its base should align on 32KB boundary.

THERE IS NO such code in TI's example. Instead both DSPs do configure their own local registers. That's perfectly OK for DSP-to-DSP, but wull not work, if remote endpoint has no its own processor.

Next one may want to configure remote endpoint capabilities registers. At this point you'd better keep off the LLD. Configuration space of both Rc and EP consists of PCI configuration space header (64B), PCI compatible config space (up to 256B) and PCIe config space. While config header is mandated by PCI/PCIe specs, it has compatible layout in all parties. For instance, config space header of my FPGA does match expectation of TI LLD, so its OK to use LLD's functions to read vendor/product ID, read and write BARs.

But beyond that the show begins. Design of capabilities feature in PCI/PCIe is build around linked list concept. There is CapPtr field at 0x34 location in configuration space header. This pointer is used to access the very first item in the list. Then system software is expected to traverse through the list in search if the certain record. Once again, record locations are not fixed and it may differ in other hardware. TI's LLD does not follow this rule. Instead, when its about to access configuration space beyond the header, it assumes layout defined in LLD. Needless to say that may not match remote EP's layout. Be sure, configuration space layout of PCIe endpoint of Xilinx FPGA does not match definition in TI's LLD. So when you need to configure something like MSI capability, DO NOT use TI's LLD. Instead, one may peep inside TI\pdk_C6670_1_1_2_6\packages\ti\drv\pcie\src\, specifically pcieapp.c, pciecfg.c to get an idea how to read and write remote registers, and then develop your own solution for your config space layout.

One of the useful features is MSI capability. In order to get it running, one have to disable legacy INTx emulation (bit 10 in Command register), enable bus mastering (bit 2 in the same register) and setup MSI capability. That consists of enabling required number of MSI vectors and setting up destination address for MSI writes.

On DSP side MSI writes trigger event which could be a source for HWI. In case of C6670 all 32 interrupts are split by 8 per core, and further 4 vectors share same event number. Note, that is different on other C66 devices. There is a famous typo in C6670 data manual about events triggered by MSI vectors. In table 7-38 event number 18 matches PCIEXpress_MSI_INTn+4, not PCIEXpress_MSI_INTn+1. So Core0 on C6670 would receive MSI vectors 0-8-16-24 on event 17 and 4-12-20-28 on event 18. Do not confuse them with PCIESS event IDs from PCIe user manual. Above mentioned interrupts match events 0x4 and 0x8 respectively.

ISR for MSI should explicitly write to EOI register to signal the processing was over. DO NOT use TI LLD for that. If you do so, will face interrupt double triggering, as many of us already did. Instead, one may write to EOI register directly or pick up lower lever function like pcie_write_irqEOI_reg().

As 4 MSI vectors may trigger same HWI, there is a question, how to process them. One should look to MSI_IRQ_STATUS on the corresponding MSIX_IRQ array. One may loop through all 4 bits and service all the interrupts in one run, or one may process just one and bail out. In my experiment the interrupt was triggered again, if unprocessed flags remain in MSI_IRQ_STATUS.

Another big issue is about PCIe performance. One have to understand, that those top performance numbers can be achieved in some special circumstances. First, one should clearly understand, that link utilization ratio in PCIe depends on payload size. All PIO, that's CPU initiated accesses do produce 1 DWORD read or write. There is at least 3 DW TRN header overhead, and 6 more bytes overhead at data link layer. So 1DW = 4B of payload transfer is accompanied by at least 18B of overhead. Then utilization ratio at data link layer is 4/(4+18)=18%. Next if there is x1 Gen1 link, due to 8/10 encoding at physical layer, 2.5 Gbaud per second would reduce to just 2Gbps of raw useful bandwidth. With that 0.18*2=0.36Gbps or about 45MBps of bandwidth. One may observe similar numbers for sequential writes to remote endpoints by CPU. In my case that was about 40MBps. 

When it comes to read performance, CPU reads are spit (non-posted transactions) and performance degrades even more dramatically. In my test subsequent read requests are arriving approximately 110 TRN clocks apart, where TRN clock is running at 62.5MHz. That is 1.76 us. In other words, one read transaction, carrying 4b of payload is complete in 1.76 us. That gives PIO read performance about 2.3MBps and is a source of real frustration. I have test setup available and can provide signal graphics if someone is interested to investigated it.

Summing up, one should use multi-dword payload for PCIe transactions in order to achieve any better performance. DSP's DMA engine is capable of making multi-dword reads and writes. Important point is make sure remote party does support them too. If that's about DSP-to-DSP, you're safe. When it about custom hardware like FPGA, one should have some PCI engine responsible for that. Long story short, there is no out of the box support for multi-dword transfers in Xilinx IP cores. One have to find his way. Users of embedded designs may try those building blocks. As to me, I have used xapp859 target reference design. It was written for Virtex 5 device, so I spend several month fitting it to Spartan 6. Now I am able to get around 200 MBps transfer speed for the blocks of 32KB and up.

Although DPS is advertised to have 256B inbound / 128B outbound payload capability, it was found by another contributor, that 256B payload does not bring any advantage due to PCIe subsystem internal delays. So the best performance is achieved with 128B payload. One have to program that capability to remote device if it about to use its own DMA engine instead for DSP's EDMA. Also remember, that DSP can serve read requests of 256B only, so remote DMA cannot ask for bigger blocks at once.

Perhaps I could add forgotten details later.

Hope this review might help someone.