AM6442: AM64x PKTDMA buffers and burst lengths

Dominic Rath

Part Number: AM6442

Dear TI team,

we have a question regarding the DMA subsystem of the TI AM64x.

Our question concerns the Paket DMA (PKTDMA) hardware of the AM64x, which is described in section 11.1.1.4 of the TRM (Rev. A).

We would like to know:

The maximum read (write) burst size per TX (RX) channel.
The size of the "per channel FIFO buffers".

For illustration purposes: The TRM of the AM65x (Rev. E) contains this information in section 10.2.3.1.1 for the UDMA-P controller.

– Provides per-channel buffering:
   • Provides 16 word deep × 128-bit Packet FIFO for each Tx channel
   • Provides 4 word deep Packet Info FIFO for each Rx channel
   • Provides 8 word deep × 128-bit Packet Data FIFO for each Rx channel
   • Supports up to 32 Protocol Specific words for Tx packets
   • Supports up to 32 Protocol Specific words for Rx packets

...

– Provides a memory read access unit
• Supports read bursts up to 128 bytes (limited by Tx Per Channel FIFO depth for the channel)
– Provides a memory write access unit
• Supports write bursts up to 128 bytes (limited by Tx Per Channel FIFO depth for the channel)

Can you please provide us with the same information for the AM64x's PKTDMA controller?

Regards,

Dominic

over 4 years ago

0 Dominic Rath over 4 years ago

Mastermind 7470 points

We've digged into this some more. Here's some background:

Our actual is problem is that we're trying to transfer larger amounts of memory between the AM6[45]x and an x86 via PCIe. When using UDMA for the transfer on the AM65x, we achieve only ~60MB/s (x86->AM6x, i.e. DMA reads). We resorted to using the DRU, which achieves a lot more (~300MB/s, iirc).

We had to come back to this issue, and had to use UDMA again. This time, we were trying to understand why the UDMA performed so badly. We noticed that the UDMA only has 128 bytes of FIFO per channel, although we're not 100% about this yet (there are channels that claim to have 1024 bytes...). The limited buffer space combined with the high latency (~2µs+) of accessing RC's memory via PCIe explains the low throughput we're seeing (128 byte * 1/0,000002s = 64.000.000 byte/s).

Eventually, that software is supposed to run on the AM64x, so we started looking into the AM64x TRM in parallel. Unfortunately the AM64x TRM doesn't mention the size of these buffers (or I couldn't find it), but it does mention the underlying problem. In chapter "11.1.1.4.2 Channel Classes" the TRM explains the need for "High capacity" and "Ultra-High Capacity" channels. Unfortunately it seems that the AM64x offers neither of these, at least the DMASS0_PKTDMA_0_CAP3 register has 0 for both UCHAN_CNT and HCHAN_CNT.

Can you tell us what size the AM64x PKTDMA and BCDMA FIFOs are?
Is it true (I've checked the TRM and had a brief look at the actual register contents) that the AM64x has zero high/ultra-high capacity DMA channels, and is thus not suitable for "moderate per-channel bandwidth with significantly increased latency"?
Are there any alternatives we haven't thought of?

Regards,

Dominic

0 Dominic Rath over 4 years ago in reply to Dominic Rath

Mastermind 7470 points

Dominic Rath said:
although we're not 100% about this yet (there are channels that claim to have 1024 bytes...)

Despite what the AM65x TRM says, it appears that there are a few high-capacity channels on the AM65x. Using those we're not limited by the DMA channel throughput anymore. The PDK UDMA driver appears to know about these, and has some entries with numTx/RxHcCh set to '2'. Unfortunately for the AM64x they're all set to '0'.

Still hoping for my questions in bold to be answered.

Regards,

Dominic

0 JJD over 4 years ago in reply to Dominic Rath

TI__Guru* 86820 points

Hi Dominic,

it is true that AM64x does not support high/ultra-high capacity channels, as shown in the CAP3 register. The FIFO depths are defined in TFIFO_DEPTH registers in both the PKTDMA and BCDMA modules. and since AM64x only supports normal channel capacity, the max values are 192 bytes (note that this should be reflected in as the reset value in this registers which should be 0xC0=192, not 192h as shown in the TRM. This is a typo which will be fixed)

The burst sizes are defined in TCFG.TX_BURST_SIZE and RCFG.RX_BURST_SIZE for both PKTDMA and BCDMA, and both are max of 64 bytes, again represented by the reset value in the bitfield.

I don't see any alternative here at the moment. AM65x and AM64x have completely different DMA architectures mainly for cost savings. Is there a certain throughput you are looking to achieve?

Regards,

James

0 Dominic Rath over 4 years ago in reply to JJD

Mastermind 7470 points

Hello James,

thanks for the fast response (even though the answer is not what I hoped for).

We need to achieve 1 Gbit/s / ~120 MB/s in order to forward Ethernet traffic from ICSS-G to the x86 RC. With the 128 byte FIFO of the AM65x we never saw > 60 MB/s, and only ~300 Mbit/s with the actual Ethernet->AM65x->PCIe->x86 setup. I don't think that's going to work with the 192 byte buffers.

Regards,

Dominic

0 Pekka Varis over 4 years ago in reply to Dominic Rath

TI__Mastermind 27050 points

Dominic,

Unfortunately the FIFO depth of 192bytes will limit the throughput of a single channel to ~96MB/s with a read latency of 2 microseconds (192B / 2microS) in AM64x. With PCIe on AM64x we are able to support 4 channels in parallel, so if you can structure the data so multiple channels can be used that would allow meeting 1Gbit/s with 2microsecond latency.

Pekka

0 Dominic Rath over 4 years ago in reply to Pekka Varis

Mastermind 7470 points

Hello Pekka,

thanks for your confirmation.

In the meantime we revisited the normal capacity channels on the AM65x, and noticed that the FIFOs are configured to 128 byte, while the TRM says the FIFOs are "16 word deep × 128-bit Packet FIFO", which would amount to 256 bytes. We increased the FIFO sizes, and up to 192 bytes we were seeing an increase in performance, up to about 580 Mbit/s (actual Ethernet traffic measured). At 128 bytes we were seeing about 400 Mbit/s Ethernet traffic.

What's really problematic is that we've seen huge differences in latencies between different x86 systems. With those small FIFOs the AM64x is very susceptible to this latency. That could prove to be a real problem for our application.

Pekka Varis said:
With PCIe on AM64x we are able to support 4 channels in parallel

could you elaborate on this some more? I'm not sure what the connection between "PCIe on AM64x" and "4 channels" is. With PCIe x1 Gen2 the AM64x's PCIe is limited to < 4 Gbit/s, which is roughly 5x what a single DMA channel could handle (theoretical throughput in the 2us example), but maybe there's some other limitation that restricts us to a maximum of 4 channels?

Regards,

Dominic

0 Pekka Varis over 4 years ago in reply to Dominic Rath

TI__Mastermind 27050 points

Dominic Rath said:
Pekka Varis said:
With PCIe on AM64x we are able to support 4 channels in parallel

could you elaborate on this some more? I'm not sure what the connection between "PCIe on AM64x" and "4 channels" is. With PCIe x1 Gen2 the AM64x's PCIe is limited to < 4 Gbit/s, which is roughly 5x what a single DMA channel could handle (theoretical throughput in the 2us example), but maybe there's some other limitation that restricts us to a maximum of 4 channels?

For a single DMA channel the FIFO depth is the weakest link in trying to go across interfaces with multiple microsecond latency. With the path to PCIe there is further restriction on the total number of transactions in flight accross the PCIe link, well above the three 64byte TLPs that 192byte FIFO of a DMA channel can have outstanding. When testing for maximum bandwidth achievable with AM64x DMA over PCIe we measured that 4 parallel DMA channels gets the best throughput. Natively Ethernet receive is serial, a frame arrives DMA is triggered, even if the DMA has not finished the next frame arrives but the DMA for the next frame is only triggered once the previous frame has landed in the destination memory (remote PCIe has acknowledged receiving). In the opposite direction Ethernet uses one channel per queue (typically mapped to VLAN priority).

If the use case can somehow leverage this you can get to bandwidth across PCIe that is higher than a single channel can achieve. Maybe something like bouncing with terminating first locally in SRAM, then triggering a round robin of for example 4 channels of BCDMA. And the reverse in the other direction.

Probably the most concise description of the DMA channels is in chapter 2 of https://software-dl.ti.com/processor-sdk-linux-rt/esd/AM64X/07_03_00_02/exports/docs/linux/Foundational_Components/Kernel/Kernel_Drivers/Network/CPSW3g.html#multi-port-switch-mode .

Processors

Processors forum

AM6442: AM64x PKTDMA buffers and burst lengths