AM6442: How to improve PCIE DMA transfer performance

xue gao

Part Number: AM6442

Tool/software:

Hi,

I'm currently using AM6442 as EP mode, and Windows as RC. And successfully run PCIE EP Enumeration testcase in MCU SDK, which is seen below.

AM64x MCU+ SDK: PCIE EP Enumeration

However, in DMA test, we found that the data copy speed between Bar0 and DMA memory is extermly low (compared to copy speed between two local memories).

Our DMA memory is mapped to RC DMA memory to access via Pcie outbound mapping, and the documentation mentions that the outbounding address must be in specified region.

I'm not sure whats affecting this speed and I wonder is there any method to optimize performance of copying data from DMA mapping address memory to local memory.

Best regards,

3 days ago

0 Ashwani Goel 3 days ago

TI__Mastermind 22680 points

Hi xue gao,

Thanks for your query.

Can you please refer section 4.3 and 4.4 of below documentation and let me know if you need further help?

SitaraAM64x /AM243x BenchmarksCortex-R5 Memory Access Latency (Rev. B)

Regards

Ashwani

0 Dominic Rath 3 days ago

Mastermind 6875 points

Hi,

the DMA test peformed by the pcie_enumerate_ep example is DMA from the RC's perspective (i.e. the EP writes directly to RC memory, without involving the RC CPU), but on the AM64x the copying is performed by the CPU.

If you want to transfer a larger amount of data you should use the BCDMA peripheral on the AM64x, similar to what's shown in the appnote Ashwani linked.

You'll have to combine these examples yourself, e.g. copy over the code to setup the BCDMA from some other example and use that instead of the "memcpy" call in "copyLoop".

Regards,

Dominic

0 xue gao 3 days ago in reply to Dominic Rath

Prodigy 10 points

Thanks for your reply.

I will try to use the BCDMA to connect outbound mapping address memory and local Bar0. If there is any problem or progress, I will update to you.

But I still have one doubt. I've tested the copy performance between two local memories using CPU, and the copy speed is much higher than that between outbound mapping address memory and Bar0. These two kinds of copy are both using CPU, but why is there a significant difference in copying performance?

Looking forward to your reply.

Best regards,

0 Dominic Rath 3 days ago in reply to xue gao

Mastermind 6875 points

xue gao said:
But I still have one doubt. I've tested the copy performance between two local memories using CPU, and the copy speed is much higher than that between outbound mapping address memory and Bar0. These two kinds of copy are both using CPU, but why is there a significant difference in copying performance?

The latency from the AM64x's R5f to the RC's memory is much higher. From what I've seen you have typically ~2us of latency, but we've also had one system that was a lot worse at 3-4us (maybe even 5us?).

The latency from the AM64x's R5f to its own memory is much smaller, see the appnote Ashwani linked (~60ns for MSRAM, ~280ns for DDR).

That latency is also an issue for the BCDMA, because a single channel is limited by its FIFO's size (192 byte IIRC) and the latency. 192 byte * 1/2us = 96,000,000 byte/s. That roughly matches the numbers from the appnote for a single BCDMA channel (868 Mbit/s).

It depends on your application if that could become a problem, and you could further optimize by using multiple BCDMA channels in parallel (see the appnote for how well that scales).

Regards,

Dominic

0 xue gao 2 days ago in reply to Dominic Rath

Prodigy 10 points

Dominic Rath said:
The latency from the AM64x's R5f to the RC's memory is much higher.

I'm sorry but this deviates from my previous understanding. I thought in DMA test, data will transfer with PCIE DMA (however, this did make me confused, as the manual said that our PCIE did not have DMA built in it), in other words, data from RC's memory is already in R5f's PCIE0_DAT0 memory (where EP DMA mapping) after PCIE DMA, and the memcpy function you mentioned before is just copy from PCIE0_DAT0 memory to MSRAM (where Bar0 in) using CPU.

But as you said, DMA test is just RC copying data to PCIE without CPU. That's to say, the difference between DMA test and Bars test is only the copying way of RC. Is my understanding correct?

Best regards,

0 Dominic Rath 2 days ago in reply to xue gao

Mastermind 6875 points

I don't think I fully understand what you think how this works.

The "DMA test" implemented in pcie_enumerate_ep copies data from the AM64's Bar0 to the RC's host memory. From the RC's point of view this is "DMA". Some other busmaster (the EP) writes to the RC's host memory.

The outbound mapping doesn't copy any data. It just provides a means for the AM64x to access "PCIe address space" via an address in the PCIE0_DAT0 region. The RC makes sure that there is "memory" behind those PCIe addresses. So accessing a location in the PCIE0_DAT0 region that has an outbound mapping configured means the R5f is sending a transaction via PCIe to the RC, which in turn reads/writes its own memory.

The Bar0 memory on the other hand is local memory in the AM64x (MSRAM). When the RC writes data there it needs to go across the PCIe bus and via an inbound mapping in the AM64x PCIe to the location inside the AM64x.

Both directions (RC accesing something in the EP via a Bar, EP accessing something in the RC via an outbound mapping) have a latency of ~2us (or more).

Depending on how the "remote" memory is mapped (via the MPU on the EP/R5f, via page tables and MTRR on an x86 RC), writing remote memory might not be as bad, because writes can be "fire and forget". Reading always incurs this latency. The pcie_enumerate_ep example doesn't use a specific MPU mapping for the outbound DMA region so it gets "strongly ordered" characteristics where the R5f waits for every write to complete. You should be able to optimize that to use "device" or maybe even "normal uncached", but I'm not sure how many outstanding transactions you'll be able to get, i.e. this will likely hit some other limit.

Regards,

Dominic

0 xue gao 1 day ago in reply to Dominic Rath

Prodigy 10 points

Thanks for your explain, it really helps me a lot. And yes, after testing, we find out the writing speed is higher than reading speed.

I've changed the memmory type of PCIE0_DAT0 region into "cache", and it did optimize the transfer performance a lot.

I will be continuing to try to copy data between outbound dma region and local memory by using BCDMA.

Thanks again!

Best regards,

Processors

Processors forum

AM6442: How to improve PCIE DMA transfer performance