Hi,
I have a DaVinci DM8168 EVM board with a C6678EVM board connected via PCIe (by using a AMC-to-PCIe adapter). I am interested in transferring data from the 8168 to the DDR memory of the C6678 as fast as possible. I am interested in doing this programatically from the A9 processor currently. In otherwords, not using DMA for the moment. I am basing my code on the PCIe bootloader example which comes with the mcsdk for the C6678.
My code, like the example code, simply calls ioremap() to map the BAR corresponding to DDR memory on the C6678 to the linux kernel address space. I then use memcpy() to copy data to the remote device's memory (c6678 DDR). Here are some performance measurement results:
using ioremap():
~ 1081 Mbps writing to 'remote' DDR
~ 50 Mbps reading from 'remote' DDR
I then added mmap support to my driver, which maps the device's BAR to usermode, via the following call:
remap_pfn_range(vma, vma->vm_start, physical >> PAGE_SHIFT, vsize, pgprot_noncached(vma->vm_page_prot));
copying data from usermode via memcpy I obtained the following performance measurements:
write: ~ 295 Mbps writing to remote DDR
An aditional question:
4) Why the huge difference between read/write performance? I realize that reading is a two-way transaction (read-response vs write), but why is reading an order of magnitude slower?
Joel,
Yes, the PCIe window needs to be mapped non-cacheable.
ioremap() would do that by default.
Regarding the lower write performance using CPU writes, your assumption makes sense - do you have protocol analyzer to verify if that indeed is the case? Also, can you check the CPU usage during read/write? I assume you are not doing simultaneous read & writes?
Thanks.
Hemant
--- Hemant
Hi Hemant,
Thanks for the reply. Do you have any thoughts regarding #3? If ioremap() maps the PCIe window as non-cacheable, and remap_pfn_range(vma, vma->vm_start, physical >> PAGE_SHIFT, vsize, pgprot_noncached(vma->vm_page_prot)) also maps the PCIe window as non-cacheable, I wonder why writing to the memory which is mapped via remap_pfn_range() is 4x slower than the ioremapped memory? Any ideas here? Is "write-combining" something that could come in to play?
Unfortunately I do not have access to a protocol analyzer, so I cannot verify TLP sizes and such.
Also, do you have any idea why reading is 20x slower than writing?
Thanks,
Joel
Regarding performance difference between kernel and user memcpy over PCIe window, my (wild) guess is it could be due to kernel memcpy being more optimized. Other aspect that need to be checked if the scheduling allows the user process which is doing copy to run uninterrupted? Can you monitor the CPU usage during memcpy?
I suggest to use profiling to see what exactly is eating time here. Same applies for read performance difference.
Does your use case only involves CPU transfers from DM8168 device or you do (or intend to do) similar transfers from C6678 too? If yes, have you seen read/write numbers for such transfers?
Hi Hermant,
Thanks for the reply. I will look in to the profiling/monitoring suggestions that you mentioned. Currently my use case only involves transfers to/from c6678 memory initiated by the DM8168 ARM, so I don't have any numbers for C6678 CPU driven transfers.
-Joel
FYI, for anyone who may be looking at this thread in the future: I am now using EDMA to to the transfers to the DSP memory over PCIe and am seeing ~480MB/s throughput, which is fast enough for me. I think this is evidence that the speed differences may be due to transaction sizes (TLP sizes), as the EDMA controller would issue larger bus transactions.
I am getting exactly the same issue. Did you change anything on the Linux driver side (especially in ioremap function of the driver) for this performance improvement? I would greatly appreciate your help. Thanks in advance.
Hi Anthony,
I haven't worked on this part of our system in a while, but if I recall correctly, once I switched to using EDMA I didn't investigate CPU-copy based performance issues any further. Is using DMA a possibility for you? I will have to revisit the PCIe driver for my project soon, so I may discover more at that time. If I do, I'll post here.
Hi Joel and Hemant
I am having the exactly same problem. I am currently trying to use EDMA to do a data transfer between DM8168EVM and a Xilinx 7 FPGA.
However, for some reason, the data did not get write into the FPGA. The EDMA is working as I am able to DMA between two memory address.
For testing only, I setup the source of EDMA as a memory on DM8168 and the destination as BAR[2] which is block RAM of FPGA.
Only the first address of BAR[2] get written but the rest are all 0s.
Single memory write works as I am able to change the value on FPGA block RAM by using devmem
Do you spot anyting wrong?
Do you mind share your lspci dump as well as some more details or code on how you used EDMA?
Thanks in advance.
Will
Will,
Can you provide info on the EDMA transfer you are doing? E.g., following would be helpful:
1) SYNC mode -- A, B, AB?
2) A, B, C counts and increment values?
3) Source address, destination address, transfer size
Also, I suggest you try doing at least 16 Bytes burst with multiple of 16 Byte transfers (if required) aligned to 16 Byte address boundaries to see if it works.
Hemant,
I have tried couple of things.
First, A-SYNC, A count = 4, B count = 2048, C count = 1. The data is successfully transfered over. But the speed is low.
Second, AB-SYNC, A count = 4, B count = 4, C count = 512. Is this 16 Bytes burst? In this case, only first 4 bytes of 16 bytes get written into FPGA block RAM. The rest are all zeros.
Source address: 0x86C98000 Destination address: 0x20020000. Transfer size is 8k bytes.
I also tried A-SYNC with A count = 16. However, again, only the first 4 bytes of 16 bytes get written into FPGA block RAM.
Does this relate to PCIe settings on both DM8168 and FPGA?
Thanks a lot for your help.