This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA4VH-Q1: PCIe BAR Space Access Performance Inquiry

Part Number: TDA4VH-Q1


Hi TI

Problem Description: I've encountered a performance issue regarding data transfer while developing with a PCI Express (PCIe) device. Specifically, I've observed that the speed of data transfers using Direct Memory Access (DMA) and CPU access (via memcpy_toio and copy_from_iter interfaces) to the Base Address Register (BAR) space of a PCIe Endpoint (EP) device is similar at the 64KB data size level, and significantly slower than accessing memory allocated through kmalloc or dma_alloc_coherent.

Attempted Solutions and Observations:

  1. DMA vs. CPU Access: I tried accessing the PCIe device's BAR space using both DMA and CPU methods and found that their transfer speeds do not significantly differ at the 64KB data size level.

  2. Memory Allocation Method: I changed the memory allocation method for the PCIe BAR space from dma_alloc_coherent to kzalloc, hoping to observe performance changes based on the memory allocation strategy. However, this modification did not improve access speed, which remained much slower than accessing memory allocated through kmalloc or dma_alloc_coherent.

  3. Synchronization Mechanism Impact Exclusion: To ensure that performance issues were not introduced by synchronization overhead, I implemented a producer-consumer model based on flag bits to control access to the BAR space. The results indicated that the access speed remains slow even when direct resource competition is avoided.

Questions:

  • I would like to understand why, even in the absence of direct resource competition, the speed of accessing the PCIe BAR space directly via DMA or CPU remains significantly lower than accessing standard memory.
  • Could the PCIe bus bandwidth or latency, hardware features, or driver implementation methods be causing this performance behavior? If so, are there recommended optimization strategies or configuration adjustments to improve this situation?
  • What could be the reason for the similar speeds of DMA and CPU access at the 64KB data size level? Does this imply that the fixed overhead of data transfer dominates at this data size level?

Expected Answers:

I am looking for detailed explanations about the reasons behind the observed performance behavior, along with possible optimization suggestions or solutions. Specifically, I am interested in learning if there are best practices for specific PCIe hardware and configurations that could enhance the data transfer efficiency of PCIe device BAR spaces.

  • Hi Qiang,

    If you are seeing that performance is around the same between DMA and CPU transfers for PCIe for smaller blocks of data, then most likely this is due to overhead from DMA initialization.

    For example, in our performance benchmarking using PCIe NVMe SSD cards, we test with different buffer sizes. 4KB buffer size (the smallest buffer size we benchmark) has significantly lower throughput than larger buffer sizes: https://software-dl.ti.com/jacinto7/esd/processor-sdk-linux-jacinto7/09_01_00_06/exports/docs/devices/J7_Family/linux/Release_Specific_Performance_Guide.html#pcie-nvme-ssd. 64KB is in-between 4KB (the smallest buffer) and 256KB (the next smallest buffer), so most likely overhead is causing overall throughput performance to look not good.

    Regards,

    Takuma

  • Hi Takuma,

    Thank you for your previous response. However, I feel there might have been a misunderstanding regarding my query. Let me illustrate my question with a specific example to clarify:

    Assume on the EP side within the Linux kernel, I allocate a regular memory space using tmp = kzalloc(size, GFP_KERNEL);, and I also allocate memory space for the EP BAR using the pci_epf_alloc_space() interface. My observation raises a question: Why is the performance significantly lower when the EP side accesses its own allocated BAR space compared to when it accesses the tmp memory allocated via kzalloc? Moreover, the time it takes for the EP side to access its BAR space is roughly equivalent to the time it takes for the RC side to access the same BAR space using memcpy_toio. When the EP side accesses the kzalloc-allocated tmp memory, it's an order of magnitude faster than accessing the BAR space.

    This leads me to wonder, does accessing the BAR space from the EP side also involve communication over the PCIe bus, resulting in similar performance characteristics to RC side accesses? I was under the impression that such internal accesses by the EP to its own BAR space should be more direct and thus faster compared to accessing memory over the PCIe bus. Could you please shed some light on whether accessing the BAR space from the EP side inherently involves PCIe bus transactions, or if there might be other factors at play here causing this unexpected performance behavior?

    I appreciate your insights on this matter.

    Best regards,

    qiang

  • Hi Qiang,

    There are several things that may affect data transfer. Prefetchable vs non-prefetchable, I/O space vs memory space. So even with RC not connected, these factors may cause some differences in speed.

    As for whether transactions are occurring from EP to RC when writing to EP BAR from EP - I am not quite sure. However, an experiment would be to see if the mapped BAR space on the RC-side is getting the data you are writing on EP-side.

    Regards,

    Takuma

  • Hi Takuma,

    Thank you very much for your insights and the suggested experiment regarding the data transfer between the EP and RC sides. Your explanation helped clarify several aspects of PCIe data transfers for me. I understand that the specific details about the performance discrepancies when accessing BAR space, especially from the EP side, might not be readily available to you.

    Given the specialized nature of this query, could you possibly direct me to someone else or suggest another resource within TI who might have more detailed knowledge about PCIe BAR space access performance? I'm eager to delve deeper into the root causes of the observed performance issues and explore potential optimization strategies to improve access speeds.

    Any further guidance or recommendations on whom to contact for more specialized advice would be greatly appreciated.

    Thank you again for your time and assistance.

    Best regards,

    Qiang

  • Hi Qiang,

    Sure, I can check internally. However, please expect that this will take a significant amount of time, since our members are in different time zones and different teams.

    Is there a specific target for performance that you are desiring but not meeting in terms of PCIe transfer speeds or latency?

    Regards,

    Takuma

  • Hi Takuma,

    Thank you very much for your willingness to help and for checking this internally. I completely understand that coordinating across different time zones and teams may take some time, and I appreciate your efforts in advance.

    Regarding the specific performance targets, we are currently observing significantly lower speeds when accessing the BAR space from the EP side compared to regular memory access. While I don't have a precise number for the desired performance, our goal is to minimize this discrepancy to ensure our system operates efficiently. Ideally, we would like the access speed to the BAR space to be as close as possible to that of regular memory, understanding that some difference due to the nature of PCIe transactions is inevitable.

    If there are known benchmarks or expected performance metrics for PCIe BAR access that could guide our expectations, that information would be invaluable to us. Additionally, any insights or recommendations on optimization techniques or configurations that could improve our current performance would be greatly appreciated.

    Again, thank you for your assistance with this matter. I look forward to any information or guidance you and your team can provide.

    Best regards,

    Qiang

  • Hi Qiang,

    Not specific to PCIe BAR (although it is one aspect of these example), but we do have a PCIe EP example that you could reference, if not already referenced: https://software-dl.ti.com/jacinto7/esd/processor-sdk-linux-j784s4/09_01_00_06/exports/docs/linux/Foundational_Components/Kernel/Kernel_Drivers/PCIe/PCIe_End_Point.html

    Regards,

    Takuma

  • Hi Takuma,

    Thanks for your suggestion. I have already reviewed the PCIe EP example you mentioned. While it provides a good overview of accessing the BAR, my concern specifically lies with the performance aspect of BAR access from the EP side, which appears to be significantly slower than expected. I'm keen on understanding the underlying reasons for this slow performance and exploring possible optimizations. Any further insights or guidance on this would be highly appreciated.

    Best regards,

    Qiang

  • Hi Qiang,

    Understood your query and I appreciate your continued patience on this matter.

    regards,

    Takuma

  • Hi Takuma,

    Any progress on addressing this issue? Could you provide an update?

    Best regards,

    Qiang

  • Hi Qiang,

    In terms of physical memory, there is no difference between non-BAR and BAR space. Therefore, differences are most likely coming from memory attribute differences.

    Also, instead of using generic memory allocation, recommendation would be to use dmaengine APIs similar to the epf test example for your driver: https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/tree/drivers/pci/endpoint/functions/pci-epf-test.c?h=ti-linux-6.1.y#n102.

    Regards,

    Takuma

  • Hi Takuma,

    Thank you for your response, but I have a few more questions.What exactly do you mean by ' differences are most likely coming from memory attribute differences'? In my driver, I'm indeed using the 'pci_epf_test_alloc_space' interface from pci-epf-test.c to allocate BAR space. However, the speed of transferring data from this BAR space to the upper layer application using the 'copy_from_iter' interface is significantly slower compared to transferring data from 'kmalloc' allocated memory to the upper layer application using the same interface. What could be the reason for this?

    Best regards,

    Qiang

  • Hi Qiang,

    Depending on the location of where memory is being allocated, memory attributes will differ. Allocating to BAR space via pci_epf_test_alloc_space and allocating elsewhere using kmalloc would most likely have differences in memory attributes that could affect performance.

    Regards,

    Takuma

  • Hi Takuma,

    pci_epf_test_alloc_space ultimately calls memory allocated by dma_alloc_coherent. I attempted to replace the dma_alloc_coherent interface with kzalloc, but found that the speed was still slower by an order of magnitude. I understand that there shouldn't be any attribute differences between memory allocated by replacing the interface and directly using kzalloc. I'm quite puzzled by this.

    Best regards,

    Qiang

  • Hi Qiang,

    dma_alloc_coherent returns address range for which memory attributes are already set, so cache is handled automatically, however, MSMC L3 cache handles snooping for cache on our system. kzmalloc requires extra cache clean and invalidate in a normal system, but this is handled by MSMC.

    Additionally, if outbound transactions are being done, the memory address needs to get translated to PCIe address. This translation is automatically done by iATU on TI SOC when writes are done to PCIEX_DATX region.

    Regards,

    Takuma

  • Hi Takuma,

    I'm using pci_epf_test_alloc_space to allocate memory on the EP side, and then reading this memory on the EP side. Does this involve outbound transactions?

    Best regards,

    Qiang

  • Hi Qiang,

    For PCIe there needs to be a translation between VBUSM to AXI between device and PCIe controller. This bridge is accessed through accessing the PCIEX_DAT0, and DAT1/DAT2 regions. Reference below taken from TRM:

    TRM also describes difference between inbound and outbound translation mechanisms, as there is difference between PCIe address to AXI address translation vs VBUSM to PCIe address translation.

    Regards,

    Takuma

  • Hi Takuma,

    Thank you very much for patiently answering my question. Based on your suggestions, I found that there is no Local CPU Mem Access data flow in the PCIe subsystem, as shown in the diagram below. Can I assume that the latency of CPU accessing the EP-side BAR is not introduced by the PCIe subsystem?

    Best regards,

    Qiang

  • Hi Qiang,

    In practice, I think it is safe to say that read/writes to BAR address spaces imply communication between PCIe devices. Therefore, when local CPU accesses BAR space, it would only be for the purpose of communicating between PCIe devices.

    In our system, writes from local CPU will go through the VBUSM2AXI bridge, and transactions from a remote device will go through the AXI2VBUSM bridge as shown in the block diagram you have shared, which will be different from a normal CPU read/write to local memory that is not related to PCIe or other peripherals.

    Regards,

    Takuma

  • Hi Takuma,

    I've consulted the manual, and I understand your point. Typically, accessing PCIe devices follows the green path, through the VBUSM2AXI bridge, then accessing the remote device. Then, the remote device accesses its local BAR via the yellow path. However, my question is, when the CPU on the EP side accesses its own BAR, does it follow the red path? Is it similar to accessing regular memory?

    Best regards,

    Qiang

  • Hi Qiang,

    I am currently on business travel for a week, so my responses will be delayed.

    Regards,

    Takuma

  • Hi Takuma,

    Any progress on addressing this issue? Could you provide an update?

    Best regards,

    Qiang

  • Hi Qiang,

    I reviewed the diagram and arrows, and your understanding is the same as my understanding for yellow and green paths.

    However, BAR is only accessed for inbound address translation while iATU is used for all outbound translations, so there are no cases where local EP accesses its own BAR space. For example, local writes to DAT0 or DAT1 will go through the iATU path (green path) while writes from external device goes through the BAR0-7 path (yellow path).

    Regards,

    Takuma

  • Hi Takuma,

    Yes, I agree with your statement. Normally, the PCIe BAR is only used for inbound address translation. However, on our Endpoint (EP) side, there's a local CPU, and the EP's BAR is emulated through the EP side's memory space. So, the CPU only needs to know the memory address corresponding to this EP BAR to access the memory corresponding to the BAR via the red path, indirectly accessing the BAR space of the EP. I've conducted experiments and indeed, it's the case. However, the access speed has consistently been very slow, and it's unclear why. It's just as mentioned before, accessing the simulated EP BAR memory is significantly slower than accessing other memory.

    Best regards,

    Qiang

  • Hi Qiang,

    As of now, we unfortunately do not have a clear explanation for why you are seeing slower access to EP BAR memory, other than there being some additional address translation paths required as shown in the diagrams.

    For memory transfers in general, there are many factors like caching, memory attributes, and DMA channel type (normal vs high capacity vs ultra high capacity), but if you are not seeing any differences in performance between all of these factors, unfortunately it is hard to root cause why you are seeing slow access speeds.

    Regards,

    Takuma