• Join
  • Sign In with my.TI Login
Texas Instruments
  • Products
  • Applications
  • Tools & Software
  • Support & Community
  • Sample & Buy
  • About TI
Sample & Purchase Cart Sample & Purchase Cart
  • Search
  • Advanced
TI E2E™ Community
  • Support Forums
  • Blogs
  • Groups
  • Videos
  • 简体中文
  • More ...
TI Home » TI E2E Community » Support Forums » Digital Signal Processors (DSP) » DaVinci™ Video Processors » DM816x, C6A816x and AM389x Processors Forum » PCIe throughput on DM8168EVM connected to C6678EVM
Share
DaVinci™ Video Processors
  • Forums
  • Announcements
Options
  • Subscribe via RSS

PCIe throughput on DM8168EVM connected to C6678EVM

PCIe throughput on DM8168EVM connected to C6678EVM

This question is answered
Joel Keller
Posted by Joel Keller
on Dec 21 2011 14:20 PM
Expert1200 points

Hi,

I have a DaVinci DM8168 EVM board with a C6678EVM board connected via PCIe (by using a AMC-to-PCIe adapter).  I am interested in transferring data from the 8168 to the DDR memory of the C6678 as fast as possible.  I am interested in doing this programatically from the A9 processor currently.  In otherwords, not using DMA for the moment.  I am basing my code on the PCIe bootloader example which comes with the mcsdk for the C6678.

My code, like the example code, simply calls ioremap() to map the BAR corresponding to DDR memory on the C6678 to the linux kernel address space.  I then use memcpy() to copy data to the remote device's memory (c6678 DDR).  Here are some performance measurement results:

using ioremap():

~ 1081 Mbps writing to 'remote' DDR

~ 50 Mbps reading from 'remote' DDR

I then added mmap support to my driver, which maps the device's BAR to usermode, via the following call:

remap_pfn_range(vma, vma->vm_start, physical >> PAGE_SHIFT, vsize, pgprot_noncached(vma->vm_page_prot));

copying data from usermode via memcpy I obtained the following performance measurements:

write: ~ 295 Mbps writing to remote DDR

read: ~ 36 Mbps reading from remote DDR
If I remove pgprot_noncached() from the remap_pfn_range,  I see significantly faster rates, but i get cache coherency issues where writes are not actually hitting the remote DDR, and reads are not being fetched.  (verified using JTAG to see the C6678 DDR)

I have the following questions:
1)  Since the DM8168 & C6678 are both PCIe gen2 dual link devices, the theoretical bus bandwidth between the two should be 10Gbps.  We can knock that down to 8Gbps due to encoding overhead.  Now 8Gbps or 1GBps is my theoretical max throughput , but I am getting 1Gbps (via PIO on ioremaped region).   Why?   I read elsewhere that doing PIO  may limit the PCIe TLP's to 4 byte payloads.  Assuming ~18 bytes of overhead per TLP packet, I might be sending 22 bytes for every 4 bytes of payload.  Dividing 8Gbps by 5 gives me something which is close to what I am measuring.  Is this what is going on here?
2)  How does ioremap() map the PCI address range with respect to caching?  are L1&L2 caches completely bypassed, or is something more complicated going on?  If the caches are disabled for ioremapped() addresses, then why is it significantly faster than remaping using pgprot_noncached?  If caching is enabled, why do I not see cache coherency issues in this case?
3)  Why the drastic difference between ioremap() and remap_pfn_range() using pgprot_noncached?  What is the difference here?
I'd be very grateful for any clarification.
Thanks,
Joel 
PSP DM816x dm8168 PCIe DaVinci evm DDR dm81xx edma
Report Abuse
  • Reply
You have posted to a forum that requires a moderator to approve posts before they are publicly available.
All Replies
  • Joel Keller
    Posted by Joel Keller
    on Dec 21 2011 14:26 PM
    Expert1200 points

    An aditional question:

    4)  Why the huge difference between read/write performance?  I realize that reading is a two-way transaction (read-response vs write), but why is reading an order of magnitude slower?

    Report Abuse
    • Reply
    You have posted to a forum that requires a moderator to approve posts before they are publicly available.
  • HemantPedanekar
    Posted by HemantPedanekar
    on Dec 26 2011 06:46 AM
    Expert5645 points

    Joel,

    Yes, the PCIe window needs to be mapped non-cacheable.

    ioremap() would do that by default.

    Regarding the lower write performance using CPU writes, your assumption makes sense - do you have protocol analyzer to verify if that indeed is the case? Also, can you check the CPU usage during read/write? I assume you are not doing simultaneous read & writes?

    Thanks.

       Hemant

    ---
       Hemant

     

    PSP PCIe
    Report Abuse
    • Reply
    You have posted to a forum that requires a moderator to approve posts before they are publicly available.
  • Joel Keller
    Posted by Joel Keller
    on Dec 28 2011 08:59 AM
    Expert1200 points

    Hi Hemant,

    Thanks for the reply.  Do you have any thoughts regarding #3?  If ioremap() maps the PCIe window as non-cacheable, and remap_pfn_range(vma, vma->vm_start, physical >> PAGE_SHIFT, vsize, pgprot_noncached(vma->vm_page_prot)) also maps the PCIe window as non-cacheable, I wonder why writing to the memory which is mapped via remap_pfn_range() is 4x slower than the ioremapped memory?  Any ideas here?  Is "write-combining" something that could come in to play?

    Unfortunately I do not have access to a protocol analyzer, so I cannot verify TLP sizes and such.  

    Also, do you have any idea why reading is 20x slower than writing?

    Thanks,


    Joel

    Report Abuse
    • Reply
    You have posted to a forum that requires a moderator to approve posts before they are publicly available.
  • HemantPedanekar
    Posted by HemantPedanekar
    on Jan 09 2012 03:46 AM
    Verified Answer
    Verified by Joel Keller
    Expert5645 points

    Joel,

    Regarding performance difference between kernel and user memcpy over PCIe window, my (wild) guess is it could be due to kernel memcpy being more optimized. Other aspect that need to be checked if the scheduling allows the user process which is doing copy to run uninterrupted? Can you monitor the CPU usage during memcpy?

    I suggest to use profiling to see what exactly is eating time here. Same applies for read performance difference.

    Does your use case only involves CPU transfers from DM8168 device or you do (or intend to do) similar transfers from C6678 too? If yes, have you seen read/write numbers for such transfers?

    Thanks.

       Hemant

    ---
       Hemant

     

    PSP PCIe
    Report Abuse
    • Reply
    You have posted to a forum that requires a moderator to approve posts before they are publicly available.
  • Joel Keller
    Posted by Joel Keller
    on Jan 11 2012 09:12 AM
    Expert1200 points

    Hi Hermant,

    Thanks for the reply.  I will look in to the profiling/monitoring suggestions that you mentioned.    Currently my use case only involves transfers to/from c6678 memory initiated by the DM8168 ARM, so I don't have any numbers for C6678 CPU driven transfers.

    -Joel

    Report Abuse
    • Reply
    You have posted to a forum that requires a moderator to approve posts before they are publicly available.
  • Joel Keller
    Posted by Joel Keller
    on Mar 01 2012 09:21 AM
    Expert1200 points

    FYI, for anyone who may be looking at this thread in the future:  I am now using EDMA to to the transfers to the DSP memory over PCIe and am seeing ~480MB/s throughput, which is fast enough for me.  I think this is evidence that the speed differences may be due to transaction sizes (TLP sizes), as the EDMA controller would issue larger bus transactions.

     

    Report Abuse
    • Reply
    You have posted to a forum that requires a moderator to approve posts before they are publicly available.
  • Antony Thanesh
    Posted by Antony Thanesh
    on Jun 29 2012 16:12 PM
    Prodigy10 points

    I am getting exactly the same issue. Did you change anything on the Linux driver side (especially in ioremap function of the driver) for this performance improvement? I would greatly appreciate your help. Thanks in advance.

    Report Abuse
    • Reply
    You have posted to a forum that requires a moderator to approve posts before they are publicly available.
  • Joel Keller
    Posted by Joel Keller
    on Jul 03 2012 08:36 AM
    Expert1200 points

    Hi Anthony,

    I haven't worked on this part of our system in a while, but if I recall correctly, once I switched to using EDMA I didn't investigate CPU-copy based performance issues any further. Is using DMA a possibility for you?  I will have to revisit the PCIe driver for my project soon, so I may discover more at that time.  If I do, I'll post here.

    -Joel

    Report Abuse
    • Reply
    You have posted to a forum that requires a moderator to approve posts before they are publicly available.
  • Qinhao Xu
    Posted by Qinhao Xu
    on Aug 30 2012 16:58 PM
    Prodigy30 points

    Hi Joel and Hemant

    I am having the exactly same problem. I am currently trying to use EDMA to do a data transfer between DM8168EVM and a Xilinx 7 FPGA.

    However, for some reason, the data did not get write into the FPGA. The EDMA is working as I am able to DMA between two memory address.

    For testing only, I setup the source of EDMA as a memory on DM8168 and the destination as BAR[2] which is block RAM of FPGA.

    Only the first address of BAR[2] get written but the rest are all 0s.

    Single memory write works as I am able to change the value on FPGA block RAM by using devmem

    Do you spot anyting wrong?

    Do you mind share your lspci dump as well as some more details or code on how you used EDMA?

    Thanks in advance.

    Will

    Report Abuse
    • Reply
    You have posted to a forum that requires a moderator to approve posts before they are publicly available.
  • HemantPedanekar
    Posted by HemantPedanekar
    on Aug 30 2012 22:46 PM
    Expert5645 points

    Will,

    Can you provide info on the EDMA transfer you are doing? E.g., following would be helpful:

    1) SYNC mode -- A, B, AB?

    2) A, B, C counts and increment values?

    3) Source address, destination address, transfer size

    Also, I suggest you try doing at least 16 Bytes burst with multiple of 16 Byte transfers (if required) aligned to 16 Byte address boundaries to see if it works.

    ---
       Hemant

     

    PSP PCIe
    Report Abuse
    • Reply
    You have posted to a forum that requires a moderator to approve posts before they are publicly available.
  • Qinhao Xu
    Posted by Qinhao Xu
    on Sep 05 2012 07:58 AM
    Prodigy30 points

    Hemant,

    I have tried couple of things.

    First, A-SYNC, A count = 4, B count = 2048, C count = 1. The data is successfully transfered over. But the speed is low.

    Second, AB-SYNC, A count = 4, B count = 4, C count = 512. Is this 16 Bytes burst? In this case, only first 4 bytes of 16 bytes get written into FPGA block RAM. The rest are all zeros.

    Source address: 0x86C98000 Destination address: 0x20020000. Transfer size is 8k bytes.

    I also tried A-SYNC with A count = 16. However, again, only the first 4 bytes of 16 bytes get written into FPGA block RAM.

    Does this relate to PCIe settings on both DM8168 and FPGA?

    Thanks a lot for your help.

    Will

    Report Abuse
    • Reply
    You have posted to a forum that requires a moderator to approve posts before they are publicly available.
TI E2E™ Community
  • Support Forums
  • Blogs
  • Videos
  • Groups
  • Site Support & Feedback
  • Settings
TI E2E™ Community Groups
  • TI University Program
  • Make the Switch
  • Microcontroller Projects
  • Motor Drive & Control
Other Communities
  • Deyisupport
  • Designsomething.org
  • beagleboard.org
  • TI on Element 14
  • TI on TechXchangeSM
Other Technical & Support Resources
  • WEBENCH® Design Center
  • Product Information Centers
  • Technical Documents
  • TI Design Network
  • TI Technical Articles
  • TI Training

All content and materials on this site are provided "as is". TI and its respective suppliers and providers of content make no representations about the suitability of these materials for any purpose and disclaim all warranties and conditions with regard to these materials, including but not limited to all implied warranties and conditions of merchantability, fitness for a particular purpose, title and non-infringement of any third party intellectual property right. TI and its respective suppliers and providers of content make no representations about the suitability of these materials for any purpose and disclaim all warranties and conditions with respect to these materials. No license, either express or implied, by estoppel or otherwise, is granted by TI. Use of the information on this site may require a license from a third party, or a license from TI.

Content on this site may contain or be subject to specific guidelines or limitations on use. All postings and use of the content on this site are subject to the Terms of Use of the site; third parties using this content agree to abide by any limitations or guidelines and to comply with the Terms of Use of this site. TI, its suppliers and providers of content reserve the right to make corrections, deletions, modifications, enhancements, improvements and other changes to the content and materials, its products, programs and services at any time or to move or discontinue any content, products, programs, or services without notice.

Follow Us Texas Instruments on Facebook Texas Instruments on Twitter Texas Instruments on LinkedIn Texas Instruments on Google+
TI Worldwide | Contact Us | my.TI Login | Site Map | Corporate Citizenship | mobile m.ti.com (Mobile Version)

TI is a global semiconductor design and manufacturing company. Innovate with 100,000+ analog ICs and
embedded processors, along with software, tools and the industry’s largest sales/support staff.

© Copyright 1995-2013 Texas Instruments Incorporated. All rights reserved.
Trademarks | Privacy Policy | Terms of Use