AM6442: Burst size limitations of the GPMC interface

Mario Trams

Part Number: AM6442
Other Parts Discussed in Thread: TMDS64EVM

Tool/software:

Hello,

I'm about to try to transfer data via the GPMC interface with bursts as long as possible with the goal of maximizing the throughput.

Momentarily, the GPMC is operated at 16 bit with a moderate 33.33MHz clock.

There is this GPMC parameter termed ATTACHEDDEVICEPAGELENGTH in the TRM (bits 24:23 of CONFIG1_i). In junction with the standard omap-gpmc driver and the linux device tree it is termed gpmc,burst-length with a slightly different notation of the parameter. That is, gpmc,burst-length specifies the maximum burst length in words rather than the value coded from it into ATTACHEDDEVICEPAGELENGTH. When gpmc,burst-length is set to 16 for instance, ATTACHEDDEVICEPAGELENGTH is found to be 10 binary, which corresponds to 16 words. So this seems just fine.

One odd thing I found in a first instance is that the SDK (I'm using 11.00.09.04 momentarily) does not support the maximum burst length of 32 words, which should be applicable in 16 bit mode according to the TRM. For testing purposes I did patch the omap-gpmc driver a little bit in order to be able to set ATTACHEDDEVICEPAGELENGTH to 11 through the device tree settings.

Now, the according GPMC address window can be mmapped by a linux process and then accessed like normal memory. This is working and I see bursts being used during read and write.

In order to move data between main memory and some GPMC window, it should be sensible to make use of memcpy() that is coming with the SDK. Although I did not check that in detail, I think that the memcpy() implementation is usually highly optimized for the processor.

For instance, that would be some piece of code that moves 64 bytes of data from some buffer buffer1 in main memory to some pointer gpmc_space representing memory within the GPMC area and then back into main memory at buffer2:

memcpy( (void*) gpmc_space, (void*) buffer1, 64);
memcpy( (void*) buffer2, (void*) gpmc_space, 64);

It should be noted that gpmc_space as I did use it for testing was the very beginning of a GPMC window - so well aligned. buffer1 and buffer2 were normally declared 64 bit integer arrays within the code of the C function. AFAIK the compiler places such arrays aligned to their data type - so they should be aligned to 64 bit here.

As it turns out, there appear only bursts with a length of 8 words at the GPMC interface for writes, while for reads there are used bursts with a length of 16 words.

A burst length setting of 16 instead of 32 does not have an effect on the actual burst lengths.

So by using the processor, only write burst lengths of 8 words can be generated, which matches 16 bytes here, while read burst lengths of 16 words resp. 32 bytes can be created.

Is that a normal behavior and are there some tweaks to optimize it? The GPMC is certainly not attached to the cache coherency mechanism of the processor. But at least the cache line size is 64 bytes. I don't know how this is being handled within the processor. But I think there is indeed a good chance that there can be transfered 64 bytes at once here.

Thanks,

Mario

1 month ago

0 Swargam Anil 1 month ago

TI__Mastermind 46976 points

Hello Mario,

Please allow me one or two days to get back to you regarding the above queries.

FYI, TI India will be on holiday on 15th August 2025.

Regards,

Anil.

0 Mario Trams 1 month ago in reply to Swargam Anil

Prodigy 170 points

Hi Anil,

thanks for coming back on that. So I know someone is taking care.

I did just write down a few additional thoughts about the situation. Here they are:

The TRM (SPRUIM2H) states the following information:

"32-bit interconnect target interface which supports non-wrapping and wrapping burst of up to 16x32 bits." ("Section 12.3.3.1.1 GPMC Features")

Furthermore it ("Section 12.3.3.1.1 GPMC Features") states that "The GPMC supports the following various access types" (among others):

Asynchronous read page access (4-8-16-32 Word16, 4-8-16 Word32)
Synchronous read/write burst access with/without wrap capability (4-8-16-32 Word16, 4-8-16 Word32)

So generally spoken, the GPMC should be able to generate burst sizes of up to 64 bytes.

Of course, this does not necessarily mean that the processor can cause them.

There is also a section "12.3.3.4.9.5 System Burst vs External Device Burst Support" in the TRM. It states that:

The device system can issue the following requests to the GPMC:

Incrementing fixed-length bursts of two, four, and eight words

Questions here:

What exactly is refered as "device system"?
What is refered as "words"?

As for the definition of "word" one could assume 64 bits here, as the AM64x is a 64 bit device. 8 x 64bit would also match these 64 bytes again. And the "device system" is probably not just the actual processor but also other units - namely DMA engines as well.

In section "12.3.3.4.5 GPMC Interconnect Port Interface" the TRM states the following things:

The GPMC interconnect interface is a pipelined interface including a 16 × 32-bit word write buffer. (Note: Here we have 64 bytes again)
The device system can issue eight incrementing 32-bit interconnect accesses (read/write) (Note: among others - 8 is the largest one)
Only power-of-two-length precise bursts 2 × 32, 4 × 32, 8 × 32, and 16 × 32, with the burst base address aligned on the total
burst size, are supported

Here is a discrepancy that one one hand there is stated that the "device system" can issue up to 8 incrementing 32 bit accesses through the system, while on the other hand there are mentioned burst of up to 16 x 32. There is no mentioning of the 16 bit case, but one might assume that the burst lengths are just doubled here.

As it can be taken from the default omap-gpmc driver for Linux, a setting of the GPMC to generate burst lengths of 64 bytes (meaning 32 x 16 bit) is not supported although the GPMC can at least be set to allow this (which means that ATTACHEDDEVICEPAGELENGTH has to be set to 3). I guess that this limitation is there because the typical devices attached to the GPMC do only support burst lengths of up to 16 at 16 bit anyway. In the mean time I did tweak the omap-gpmc driver a little bit so that it can also be set to allow burst lengths of up to 32. However, this does not have an effect in particular at read operations, where I'm still observing burst lengths of 16.

The fact that the processor cannot be used in order to generate burst lengths of 32 is not necessarily a surprise to me - although there might be ways to accomplish that I'm just not aware of.

But what about DMA, in particular the "BCDMA"? Would BCDMA then be able to have the GPMC generate burst lengths of 32@16bit or 16@32bit while transferring blocks of data between main memory and GPMC, hence maximizing throughput?

Btw., maybe a bit out-of-scope, but in the "Device Overview" of the TRM, namely section "1.3.17 General Purpose Memory Controller (GPMC)" there is neither mentioned the capability of the GPMC to operate at 32 bit, nor a 32 word burst cabability at 16 bit. However, I guess that this is some sort of copy&paste-error and this text is coming from an older micro controller.

Greetings,

Mario

0 Mario Trams 1 month ago in reply to Mario Trams

Prodigy 170 points

Swargam Anil in the mean time, are there some ideas regarding that matter?

Thanks,

Mario

0 Swargam Anil 1 month ago in reply to Mario Trams

TI__Mastermind 46976 points

Hello Mario,

I feel that, based on the above information, you are doing testing on A53 core and using Linux. Please confirm.

The ATTACHEDDEVICEPAGELENGTH in the TRM defines the maximum page/burst length that the GPMC can use 32 words.

Reads:

• Since your GPMC is 16-bit (2 bytes/word), a 16-word burst = 32 bytes.

Writes:

GPMC is 16-bit (2 bytes/word), an -8-word burst = 16 bytes.

I feel that even if you increase the burst size, you will not see any improvement that may come up with the following analysis points.

For the read, the CPU is always half the cache line, which is 32 bytes, and not the full cache line length, which is 64 bytes.

Coming to write which is 1/4 of the cache line.

The CPU fetches from memory in cache line-sized bursts (64 bytes on A-class cores, 32 Bytes on R5F cores).

They don’t always push out an entire 64-byte line in one go; instead they flush in smaller chunks.

Initially my suspect with the MMU region settings for GPMC memory .

How did you do MMU region settings for GPMC memory ?

I am looking for second set of queries and will get reply soon.

Regards,

Anil.

0 Swargam Anil 1 month ago in reply to Swargam Anil

TI__Mastermind 46976 points

Hello Mario,

If you want to increase the throughput why does GPMC run at lower frequencies of 33.33MHz rather than it supports till 133MHz ?

Regards,

Anil.

0 Mario Trams 1 month ago in reply to Swargam Anil

Prodigy 170 points

Hi Anil,

thanks for providing some insight into that matter!

To step through your questions/points:

Swargam Anil said:
If you want to increase the throughput why does GPMC run at lower frequencies of 33.33MHz rather than it supports till 133MHz ?

Those 33MHz where just for for first experiments. In the mean time I'm operating the GPMC at 100MHz. 133MHz is not an option since the target configuration is 32 Bit. There, the limit is 100MHz according to the data sheet. Alltogether the goal is to get out as much as possible of bandwidth for data transmission between main memory and GPMC device, hence approaching the theoretical limit of the GPMC interface. This is certainly only possible with burst lengths as long as possible. Because of the additional overhead during read operations (propagating the read address down the pipeline and propagating read data up the pipeline) this is specially critical. I'm also considering some sort of direct FiFo implementation where the address is inherently known within the GPMC device and the address provided by the AM6442 can be ignored. With this method some cycles could be saved per transaction, at least. However, it would be a plus when the normal memory semantics could be kept.

Btw., at 100MHz the read burst length seems to be cut down even more from 32 bytes to 16 bytes per burst. So there is also some aspect regarding the relation of the individual clocks.

Swargam Anil said:
I feel that, based on the above information, you are doing testing on A53 core and using Linux. Please confirm.

Yes, exactly.

Swargam Anil said:
I feel that even if you increase the burst size, you will not see any improvement that may come up with the following analysis points.

Yes, that is what my observation is. As I wrote, I did even tweak the omap-gpmc driver that normally limits the ATTACHEDDEVICEPAGELENGTH at 16 rather than 32 words for whatever reason. But this does not have an effect.

Swargam Anil said:
The CPU fetches from memory in cache line-sized bursts (64 bytes on A-class cores, 32 Bytes on R5F cores).

They don’t always push out an entire 64-byte line in one go; instead they flush in smaller chunks.

When this is the case, then there seems to be some other factor that is introducing cuts here - especially for reads when the GPMC burst lengths break from 32 bytes @33MHz into 16 bytes @100MHz.

One thing that we need to take care for here as well is whether the cache line size does really matter in that case. Maybe there is more the question about some sort of write buffer and prefetch buffer. This is because the GPMC memory is certainly non-cachable. However, assuming it to be cachable in junction with some explicit cache coherency such as cache flushes and cache invalidations done by software might be some trick here.

Swargam Anil said:
Initially my suspect with the MMU region settings for GPMC memory .

How did you do MMU region settings for GPMC memory ?

I did not do any specific MMU region settings here. To summarize what I'm doing is:

Setting up the the GPMC through the Linux device tree
Ensuring that the standard omap-gpmc driver loads and the GPMC module becomes configured properly
In the application process using mmap() to map the physical address of the according GPMC ChipSelect window from /dev/mem
Then simply access the mapped memory - when moving data between the GPMC device and main memory ideally using memcpy(), which is optimzed for copying.

So all settings regarding MMU stuff are more or less the standard settings made by Linux itself.

What could be done additionally here, and how could it be done? For instance declaring the GPMC regions cachable?

Anyway. I'm suspecting that the better way to go here is to make use of DMA resp. BCDMA. Though, there are also a few questions:

Would BCDMA make use of the full GPMC burst length of 64 bytes then, i.e. 32 words @16 bit or 16 words @32 bit?
Is there some existing infrastructure in Linux to make use of BCDMA? Can it be used at all, or is it already blocked by some other functionalities that make use of it?
Apart from the TRM, is there some additional documentation on how to make use of it, specifically under Linux? This will be certainly a complex matter and also involves other matters such as ensuring contiguous regions in main memory etc.

Thanks and greetings,

Mario

0 Swargam Anil 30 days ago in reply to Mario Trams

TI__Mastermind 46976 points

Hello Mario,

Based on your test results and our discussion, it seems that bursts on the wire sometimes fragment at higher GPMC frequencies due to interconnect latency or WAIT timing.

As a next step, I suggest configuring the DMA TR as follows:
• ICNT0 = 64 bytes
• ICNT1 = Total_Bytes / 64 (number of bursts required to complete the transfer)

This setup will allow the DMA to push data in aligned bursts. Even if the bursts on the wire still fragment at higher frequencies, DMA will generally provide significantly higher throughput compared to CPU-driven memcpy because it can pipeline transfers more efficiently.

Please note that I am not a Linux expert, so I am routing this query to our Linux team for additional guidance on:
• How BCDMA can be integrated with GPMC under Linux
• Whether there is an existing framework or driver support available
• Any required changes to device tree or DMAEngine configuration

Thanks for your patience while we loop in the right expertise.

Regards,

Anil.

0 Mario Trams 30 days ago in reply to Swargam Anil

Prodigy 170 points

Hello Anil,

yes, essentially I agree that DMA is promising to be better here - provided the data blocks transfered are large enough. Additionally, DMA is leaving the CPU for other tasks.

Regarding your suggestions for these ICNT0/ICNT1 settings, it makes surely sense to force that alignment.But of course this can be applied to DMA only.

Btw., I did mention this alternative to not access some external memory in an addressable RAM-fashion, but in a FiFo-fashion which allows to cut off a few cycles per transaction. I just did try this out. There can be achieved some improvement indeed. However, the gains are mariginal. I'm achieving 62MiB/s when copying from GPMC into main memory (still with a 16 bit wide interface). Thereby the system creates bursts with a length of 8 words resp. 16 bytes. Such a burst transaction takes 10 clock cycles plus one turnaround cycle, so 110ns at 100MHz. The average transmission time for a 16 byte block is around 250ns resp. 25 clock cycles (based on these 62MiB/s). So the GPMC Interface does spend more time with waiting than with transfering data. This is also clearly visible on the oscilloscope.This is kind of disappointing, but probably one can't expect much more under the given circumstances.

What do you think about that option to treat the GPMC memory areas as cachable regions in junction with software-controlled cache coherency? Might that improve something? Would that be possible in theory at all?

As for BCDMA in Linux, thanks for forwarding this to your Linux experts. In the mean time, could you also direct me to some sort of tutorial or examples that make use of BCDMA in some sort of barebone-fashion (if there is something available)?

Thanks and greetings,

Mario

0 Mario Trams 29 days ago in reply to Mario Trams

Prodigy 170 points

Hello Anil,

just some other notes and ideas...

It came in my mind that it is indeed possible to benchmark the GPMC performance at 32 Bit, although this is not available with the eval kit (TMDS64EVM and TMDS64GPEVM). Of course, there cannot happen some valid data transfer, but one can read or write anyway. So I did configure one window of the GPMC for 32 bit width. I had to extend the default omap-gpmc driver for Linux to do this, as it just supports 8 or 16 bit. However, that's not a big deal. As a result, the burst lengths are halfed in terms of words, which was almost expected because we now have 32 bit instead of 16 bit. But the performance figures are disappointing. For a transfer from main memory to the GPMC device the bandwidth is around 167MiB/s. For a transfer from the GPMC device to main memory using that aforementioned FiFo-semantics the bandwidth is around 72MiB/s. Therby both read and write bursts require in total 7 clocks resp. 70ns at 100MHz for such a 16 byte burst.

Apart from the possibility of declaring the GPMC windows as cachable areas and drive some software-controlled cache coherency, there is another point that came in my mind: What about the CPU frequency?

It seems that I'm unable to find any information on the web (or within the kernel boot messages) at what frequency the A53 cores are running actually. It seems that dynamic frequency scaling is not supported right now. I'm missing according entries under /sys/devices/system/cpu/ and there is also a posting here stating that it is not supported (https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1097572/am6442-cpu-frequency-scaling-support). However, dynamic frequency scaling and frequency setting are different matters. I'm using the recent SDK (11.01.05.03) and never specified some frequency the processors should run at, nor did I found any reference what frequency is set by default. One would possibly assume 1GHz here, but perhaps that's wrong. There is also an odd thing that the AM6442 (i.e. the IC) is always almost cold on the eval kit. This is not that bad, as it indicates a low power consumption. However, it might also indicate that it is running far below its specified 1GHz clock by default. There are also provisions to add some radiator on the eval kit and I have seen pictures of the kit with a mounted radiator. So the AM6442 seems to get hot under other circumstances. Just increasing the CPU frequency might not be the most smart option, as it will generate other troubles (power consumption, heat generation, etc.), but it might lead to an improvement.

Greetings,

Mario

0 Mario Trams 29 days ago in reply to Mario Trams

Prodigy 170 points

After further studies I learned about this k3conf tool. k3conf dump processor is giving me:

|---------------------------------------------------------------------------------------------------|
| VERSION INFO |
|---------------------------------------------------------------------------------------------------|
| K3CONF | (version 0.3-nogit built Thu Jun 26 21:17:32 UTC 2025) |
| SoC | AM64x SR2.0 |
| SoC identifiers | [0x328cd4e4] 0x19466 Func-Safe Secure 'S' Grade -40°C to 105°C ALV Package |
| SYSFW | ABI: 4.0 (firmware version 0x000b '11.1.2--v11.01.02 (Fancy Rat))') |
| F/w Capabilities | 0x1: GEN |
|---------------------------------------------------------------------------------------------------|

|-------------------------------------------------------------------------------------|
| Device ID | Processor ID | Processor Name | Processor State | Processor Frequency |
|-------------------------------------------------------------------------------------|
| 135 | 32 | A53SS0_CORE_0 | DEVICE_STATE_ON | 1000000000 |
| 136 | 33 | A53SS0_CORE_1 | DEVICE_STATE_ON | 1000000000 |
| 9 | 24 | MCU_M4FSS0_CORE0 | DEVICE_STATE_ON | 400000000 |
| 121 | 1 | R5FSS0_CORE0 | DEVICE_STATE_ON | 800000000 |
| 122 | 2 | R5FSS0_CORE1 | DEVICE_STATE_ON | 800000000 |
| 123 | 6 | R5FSS1_CORE0 | DEVICE_STATE_ON | 800000000 |
| 124 | 7 | R5FSS1_CORE1 | DEVICE_STATE_ON | 800000000 |
|-------------------------------------------------------------------------------------|

So the main cores seem to be already running at 1GHz indeed. I did not yet figure out where this frequency is set. I.e. to see what is happening with a bit of overclocking. However, overclocking in the final application is not really an option anyway....

0 Bin Liu 28 days ago in reply to Mario Trams

TI__Guru**** 169011 points

Hi Mario,

I am in the mid of debugging another problem and didn't get a chance to review this entire thread today. But to answer the questions in your last response - yes, k3conf tool can be used to check the A53 running clock. AM64x doesn't support cpu frequency scaling, so cpufreq doesn't exist in Linux sysfw. And overclocking is definitely not supported and recommended. The A53 frequency is configured on U-Boot.

0 Mario Trams 28 days ago in reply to Bin Liu

Prodigy 170 points

Hi Bin!

Yes, overclocking is definitely not an option. It was just a thought in order to see whether this is changing something with the activity on the GPMC. But if it does indeed imrove something, it couldn't be used in practice anyway...

In order to summarize the things that should be considered more seriously:

Is there an option to enable caching for the GPMC address windows in junction with software-controlled coherency? If yes, how could this be realized in Linux? I have some hope that this could give a significant performance boost, because it probably leads to a situation were always complete cache lines are transfered and the GPMC module probably generates these 64 byte bursts it seems it is essentially capable to generate.
Is there a positive perspective to overcome the existing bandwidth limitations by use of BCDMA, possibly with existing benchmark figures from practical experiments and not necessarily made under Linux? The current sustained performance figures to beat are 167MiB/s for copying data from main memory to GPMC and 72MiB/s for copying data from GPMC to main memory for a 32 bit wide GPMC interface running at 100MHz.
Is there some existing infrastructure to make use of BCDMA under Linux, and if yes, how can it be used? Can BCDMA used under Linux by some sort of application at all, or are its ressources already occupied by vital drivers?
Is there some exemplary documentation or tutorial available that is showing the application of BCDMA? Not necessarily under Linux, but in general.

Thanks and greetings,

Mario

0 Bin Liu 28 days ago in reply to Mario Trams

TI__Guru**** 169011 points

Hi Mario,

Mario Trams said:
Is there a positive perspective to overcome the existing bandwidth limitations by use of BCDMA, possibly with existing benchmark figures from practical experiments and not necessarily made under Linux? The current sustained performance figures to beat are 167MiB/s for copying data from main memory to GPMC and 72MiB/s for copying data from GPMC to main memory for a 32 bit wide GPMC interface running at 100MHz.

Do you need the DMA to write to GPMC periodically in a fixed interval?

If so, it would require the BCDMA transfers in cyclic mode, which is not implemented in Linux kernel;

If not, you could use kernel dmaengine API dmaengine_prep_dma_memcpy() to program the BCMDA to do the transfer. Please refer to kernel driver: drivers/spi/spi-cadence-quadspi.c for the API usage.

0 Mario Trams 27 days ago in reply to Bin Liu

Prodigy 170 points

Hello Bin,

Bin Liu said:
Do you need the DMA to write to GPMC periodically in a fixed interval?

If so, it would require the BCDMA transfers in cyclic mode, which is not implemented in Linux kernel;

I don't understand exactly what "periodically in a fixed interval" means, but I think I do not need this. The application would work in a way where there is made a regular check whether there is available some block of new data on the GPMC device or whether the GPMC device can take up a new block of data (depending on direction). If yes, there would be copied the block from the GPMC device to main memory or vice versa. So each transfer is a single-shot transfer, so to speak.

Bin Liu said:
If not, you could use kernel dmaengine API dmaengine_prep_dma_memcpy() to program the BCMDA to do the transfer. Please refer to kernel driver: drivers/spi/spi-cadence-quadspi.c for the API usage.

Ok, this seems to be a good starting point. Thanks for that hint! I'll dig into that and see what can be done here.

As for the option with these caching matters, you do not have an idea - at least whether this could be worthwhile at all?

Greetings,

Mario

0 Bin Liu 27 days ago in reply to Mario Trams

TI__Guru**** 169011 points

Hi Mario,

Mario Trams said:
I don't understand exactly what "periodically in a fixed interval" means,

One of the examples of such use cases is in audio applications, in which the data transfer is in a fixed short interval and fixed data length. The DMA channel is configured in cyclic mode in the beginning of the transfer. In very time one data block movement is done by the DMA, the DMA channel is ready for the next data block without software to re-configure the channel. But sounds like you don't need this use case.

Mario Trams said:
As for the option with these caching matters, you do not have an idea - at least whether this could be worthwhile at all?

I just talked to our GPMC module expert today, the GPMC itself doesn't have a mechanism to cache data.

0 Mario Trams 24 days ago in reply to Bin Liu

Prodigy 170 points

Hi Bin,

Bin Liu said:
One of the examples of such use cases is in audio applications, in which the data transfer is in a fixed short interval and fixed data length. The DMA channel is configured in cyclic mode in the beginning of the transfer. In very time one data block movement is done by the DMA, the DMA channel is ready for the next data block without software to re-configure the channel. But sounds like you don't need this use case.

I suspected something like that. So this is merely for realtime streaming applications. Though, the good thing here is that there is no need to re-configure the channel for each transfer. Depending on how heavy such a reconfiguration is, this can be critical as well...

Bin Liu said:
I just talked to our GPMC module expert today, the GPMC itself doesn't have a mechanism to cache data.

That's for sure. But I did not refer to some cache within the GPMC, but to the regular CPU cache. I did already some home work on that matter...

First of all, the driver behind /dev/mem is generally treating the mapped memory as no-cached. So even though one is mapping regular memory via /dev/mem, there would be no caching. The relevant part can be found in arch/arm64/mm/mmu.c:

pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
unsigned long size, pgprot_t vma_prot)
{
if (!pfn_is_map_memory(pfn))
return pgprot_noncached(vma_prot);
else if (file->f_flags & O_SYNC)
return pgprot_writecombine(vma_prot);
return vma_prot;
}

There is this call of pgprot_noncached(), or when O_SYNC is set for the file handle pgprot_writecombine().

pgprot_writecombine()is said to realize at least a write combining. In fact, I did use that unintentionally since the beginning. However, I don't see a change in write performance regardless of whether O_SYNC is set or not.

Those pgprot-mapping functions are defined in arch/arm64/include/asm/pgtable.h. Among others, this is looking like that:

/*
* Mark the prot value as uncacheable and unbufferable.
*/
#define pgprot_noncached(prot) \
__pgprot_modify(prot, PTE_ATTRINDX_MASK, PTE_ATTRINDX(MT_DEVICE_nGnRnE) | PTE_PXN | PTE_UXN)
#define pgprot_writecombine(prot) \
__pgprot_modify(prot, PTE_ATTRINDX_MASK, PTE_ATTRINDX(MT_NORMAL_NC) | PTE_PXN | PTE_UXN)
#define pgprot_device(prot) \
__pgprot_modify(prot, PTE_ATTRINDX_MASK, PTE_ATTRINDX(MT_DEVICE_nGnRE) | PTE_PXN | PTE_UXN)
#define pgprot_tagged(prot) \
__pgprot_modify(prot, PTE_ATTRINDX_MASK, PTE_ATTRINDX(MT_NORMAL_TAGGED))
#define pgprot_mhp pgprot_tagged
/*
* DMA allocations for non-coherent devices use what the Arm architecture calls
* "Normal non-cacheable" memory, which permits speculation, unaligned accesses
* and merging of writes. This is different from "Device-nGnR[nE]" memory which
* is intended for MMIO and thus forbids speculation, preserves access size,
* requires strict alignment and can also force write responses to come from the
* endpoint.
*/
#define pgprot_dmacoherent(prot) \
__pgprot_modify(prot, PTE_ATTRINDX_MASK, \
PTE_ATTRINDX(MT_NORMAL_NC) | PTE_PXN | PTE_UXN)

So from my perspective what is needed here is the use of pgprot_dmacoherent() rather than pgprot_noncached() or pgprot_writecombine(). Probably the simplest way to do that is to duplicate the /dev/mem driver into some sort of /dev/mem2 driver and just replace this pgprot-call there.

Next there is the question about software-controlled coherency. Here I did found the commands DC CISW (cache line clean and invalidate), DC CSW (cache line clean), and possibly also DC ZVA (cache zero by virtual address). The latter one would be useful to allocate a cache line prior to writing into a GPMC window and hence would avoid that the according cache line would be read from the GPMC first, which would be nonsense. However, the documentation I found so far regarding DC ZVA is somewhat unclear. It might be that it is not touching the cache at all. If that is the case, one would have to accept that useless initial cache line fill during a write, and then hope that this cache line remains allocated within the cache for subsequent writes. A DC CSW command would be used following the writing of 64 bytes into the GPMC in order to cause a flush of the according cacheline, hopefully creating a 64 byte write burst on the GPMC. Any read from the GPMC should cause a 64 byte cache line fill and hopefully in a 64 byte GPMC read burst as well. Immediately after processing a read block of 64 bytes (or at latest prior to the next read from that very same address), a DC CISW would need to be executed in order to ensure that no outdated data is being processed.

Can you follow these ideas?

Thanks and regards,

Mario

0 Bin Liu 23 days ago in reply to Mario Trams

TI__Guru**** 169011 points

Hi Mario,

I did talk to one of our senior developers about the questions.

Mario Trams said:
However, I don't see a change in write performance regardless of whether O_SYNC is set or not.

This O_SYNC is about data caching in the kernel block device driver, it is not general caching in memory.

Mario Trams said:
So from my perspective what is needed here is the use of pgprot_dmacoherent() rather than pgprot_noncached() or pgprot_writecombine(). Probably the simplest way to do that is to duplicate the /dev/mem driver into some sort of /dev/mem2 driver and just replace this pgprot-call there.

The kernel already has a driver "dma_buf' which can map memory regions to user space with cache enabled, but we just not sure if it will work for IO memory regions such as the 0x50000000 GPMC data memory window. You might want to give it a try.

Please check kernel devicetree k3-am62a7-sk-edgeai.dtso, it has the following node in &reserved-memory node:

edgeai_shared_region: edgeai_shared-memories {
compatible = "dma-heap-carveout";
reg = <0x00 0xa3000000 0x00 0x0ac00000>;
};

You can change the location to your GPMC region and its size in the "reg" property above, then the kernel dma-buf driver should create an entry for it under /dev/dma_heap/. You can try to open & map it on your application to see if it improves the performance.

0 Mario Trams 22 days ago in reply to Bin Liu

Prodigy 170 points

Hi Bin,

Bin Liu said:
This O_SYNC is about data caching in the kernel block device driver, it is not general caching in memory.

Yes I agree. After looking more exactly at the code I did quote above, when mapping memory (resp. physical adress space) there is always called pgprot_noncached(vma_prot). However, anyway this seems not relevant here since just write-combining would not be enough here.

Bin Liu said:
The kernel already has a driver "dma_buf' which can map memory regions to user space with cache enabled, but we just not sure if it will work for IO memory regions such as the 0x50000000 GPMC data memory window. You might want to give it a try.

Please check kernel devicetree k3-am62a7-sk-edgeai.dtso, it has the following node in &reserved-memory node:

edgeai_shared_region: edgeai_shared-memories {
compatible = "dma-heap-carveout";
reg = <0x00 0xa3000000 0x00 0x0ac00000>;
};

You can change the location to your GPMC region and its size in the "reg" property above, then the kernel dma-buf driver should create an entry for it under /dev/dma_heap/. You can try to open & map it on your application to see if it improves the performance.

Aha, that's interesting. I did try that out, but it is failing at a first glance. However, the problem seems to be more of a general nature and I believe that there is something else to do.

I did add for instance the following section into the &reserved-memory node of the device tree:

fpga: fpga-memory@53000000 {
compatible = "dma-heap-carveout";
reg = <0x00 0x053000000 0x00 0x00020000>;
//reg = <0x00 0x0C0000000 0x00 0x00020000>;
};

This is setting up a window of 128kiB starting at one of the configured GPMC windows. There appears an according entry /dev/dma_heap/carveout_fpga-memory@53000000.

The file can be opened for instance with:

fd2 = open("/dev/dma_heap/carveout_fpga-memory@53000000", O_RDWR );

But when it is mmapped for instance via:

fpga_space = (volatile int32_t *) mmap(NULL, FPGA_RAM_SPACE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd2, 0);

mmap() always fails with the error ENODEV resp. "No such device".

There might be some risk that Linux resp. dma_buf does inherently allow such a region for the actual physical main memory - you also wrote that you are not sure whether this is working for IO regions such as the GPMC. However, in that case I would expect some error message in dmesg and a missing entry unter /dev/dma_heap/. In order to check that I did experimentally place that window into the middle of the actual DDR main memory. See the commented-out reg settings in the device tree configuration above. This is failing identically as well.

So this is making me believe that there is needed something else that needs to be done before such a region can be mmapped. Do you have any ideas?

Thanks,

Mario

0 Mario Trams 22 days ago in reply to Mario Trams

Prodigy 170 points

A little update on that:

It seems that mmap() has not to be called with the file handle from the opened /dev/dma_heap/carveout... file. Instead, an IOCTL DMA_HEAP_IOCTL_ALLOC has to be executed on that file handle, which then creates another file handle to be used here. I don't really understand the sense behind that process, since the operating system actually already has got all the information needed, but who knows. I extended the code as follows:

Additional includes:

#include <sys/ioctl.h>
#include <linux/dma-heap.h>

Declaration of a structure used for the IOCTL:

dma_heap_allocation_data dma_heap1_config;

Filling in the data structure for later use:

dma_heap1_config.len = 0x2000;
dma_heap1_config.fd = 0;
dma_heap1_config.fd_flags = O_RDWR;
dma_heap1_config.heap_flags = 0;

The IOCTL, handing over the file handle of the /dev/dma_heap/carveout-file opened previously as well as the filled-in data structure:

ioctl(fd2, DMA_HEAP_IOCTL_ALLOC, &dma_heap1_config);

The call of mmap() using the file handle the IOCTL left in the data structure:

fpga_space = (volatile int32_t *) mmap(NULL, FPGA_RAM_SPACE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, dma_heap1_config.fd, 0);

Indeed, the IOCTL is returning some file handle that is looking reasonable and the "No such device"-error has gone. However, the error now is "Invalid argument". I don't know to which argument this applies and what can be wrong now.

Some other thing.... I came across that discussion: TDA4VM: [E-mirror][sdk8.5][tda4vm]convert physical address to virtual address

To quote what Nikhil Dasan wrote there:

mmap operation is not permitted in dma-heap-carveout region because it is a reserved memory region for DMA (Direct Memory Access) operations.

This memory region is used by the kernel to allocate memory for DMA operations, which are used by devices to directly access the system’s memory without involving the CPU.

The CONFIG_STRICT_DEVMEM kernel configuration option restricts access to /dev/mem files, which provide direct access to the physical memory of the system. When this option is enabled, only privileged users can access these files. Since dma-heap-carveout is a reserved memory region for DMA operations, it is not accessible through mmap operation when CONFIG_STRICT_DEVMEM is enabled.

I'm not sure whether "privileged users" refers to somebody like the user "root" or to a kernel driver. In fact, I did find CONFIG_STRICT_DEVMEM to be activated for the default kernel in the SDK. For a test I did disable it by adding CONFIG_STRICT_DEVMEM=n to board-support/ti-linux-kernel-6.12.35+git-ti/arch/arm64/configs/defconfig and rebuilt the kernel. This did not change anything, however.

0 Bin Liu 21 days ago in reply to Mario Trams

TI__Guru**** 169011 points

Hi Mario,

I will be out of office from later tomorrow for 1.5 weeks, and have a few critical things to wrap up before I leave, so I didn't have enough time today to review your full update, but

Mario Trams said:
I'm not sure whether "privileged users" refers to somebody like the user "root" or to a kernel driver.

Yes, the privileged users are root and equivelant.

Mario Trams said:
For a test I did disable it by adding CONFIG_STRICT_DEVMEM=n to board-support/ti-linux-kernel-6.12.35+git-ti/arch/arm64/configs/defconfi

This is not the right way to disable a kernel config option. (You don't find any such reference in this defconfig file.)

Instead, you need to add

# CONFIG_STRICT_DEVMEM is not set

0 Mario Trams 17 days ago in reply to Bin Liu

Prodigy 170 points

Hello Bin,

Bin Liu said:
This is not the right way to disable a kernel config option. (You don't find any such reference in this defconfig file.)

Instead, you need to add

# CONFIG_STRICT_DEVMEM is not set

I had some struggles finding the place to make that change. However, what I made in the end (and mentioned above) did lead to a change in the final .config file for the kernel where the according line has been set to:# CONFIG_STRICT_DEVMEM is not set

Honestly, I'm still struggling with those configuration policies in the Linux kernel. Anyway, doesn't matter... This STRICT_DEVMEM stuff seems not related with the issue anyway.

Then I think there are two elementary questions that need to be anserwed:

Does the AM6442's main application processors support caching of physical adresses outside the DDR RAM main memory region?
Is there anything in Linux that prohibits cachable mappings outside the DDR RAM main memory region?

The answer to the first question is probably "No" since the cache is operating on virtual addresses. So the physical addresses should not matter. But who knows...For the second question I'm not so sure and I tend that the answer is "Yes".

I was not yet able to get this dma-heap/dma-carveout-stuff running. Instead, I found this: https://github.com/ikwzm/udmabuf/tree/master

I got this "sort of running". I don't know whether this is a bug of that driver, or some other incompatibility, or whether I did something wrong. But I was not able to have that driver to associate the buffer with the according GPMC window. It always takes some part from the DDR main memory. Anyway, at a first glance I made some memory speed tests based on that, either enabling the cache or disabling it (in the way I assume it is to be done). The results are remarkable: When reading from a non-cached buffer and copy the data into some regularly allocated buffer (which is certainly cached) the bandwidth is around 103MiB/s. When doing the same from a cached buffer, the bandwidth is around 3640MiB/s - not bad. The write bandwidth is well above 1400MiB/s uncached and more than 3800MiB/s cached.

Most importantly, this memory-to-memory experiment is showing that the way to turn on or off the cache is correct.

Since I was struggling to convince this u-dma-buf driver to pick up the proper buffer settings made through device tree settings, I hardcoded the right physical address directly into the driver. Since I can see the operations on the GPMC interface then, I did tweak that very likely correctly. However, no matter whether I'm mapping the GPMC window with cache enabled or not - I'm always getting the same performance figures for reading and writing the GPMC window. So in fact, there is no cache active here. Especially in case of reads I'm seeing that there are generated transactions for the same addresses over and over again, although the processor should read from the cache for subsequent reads. However, I'm occasionally seeing 64 byte read bursts here when copying larger blocks! They are mixed into 16 byte read bursts with a rate of around 2 16 bytes bursts per one 64 byte burst. With the mapping through /dem/mem I'm exclusively seeing 16 byte read bursts (for the very same copy-code executed).

Though, there are two aditional facts worth mentioning:

When using the mapping through u-dma-buf driver instead of /dev/mem, the GPMC write burst length doubles from 4 words to 8 words (i.e. from 16 to 32 bytes for a 32 bit GMPC width). Consequently, the write performance is receiving a boost and with more than 230MiB/s is slowly approaching something I'd call an "acceptable level". This is good.
During reads, although I'm seeing some 64 byte bursts when I'm mapping via the u-dma-buf driver, there is even a slight degradation of overall performance when reading through the u-dma-buf mapping (with my current GPMC timing settings from around 74MiB/s with /dev/mem to around 70MiB/s via u-dma-buf). One of the primary issues for the read case seem to be the waiting times beween the transactions. I believe that those delays will even be present when the processor is actually using its cache, unless he is not using some sort of speculative prefetching. However, as I see the Cortex-A53 processors do include such things - even controllable through software.

So all in all there has been made some progress - although there are clear signs that no cache is used yet, in fact. The improvement of the write performance when mapping via u-dma-buf is probably some sort of write-combining effect. Why there is a degradation during reads and why there appear at least some 64 byte bursts - I honestly don't know, but probably also caused by minor changes in some buffering policies. But most importantly it seems to me that there is some general setting that is blocking the use of caching of those memory-mapped IO regions.

I'll continue on that and see whether I find something that is leading to a general non-cachable mapping of memory mapped IO in Linux. Independently, I'll also look closer into that completely different DMA matter.

Greetings,

Mario

0 Bin Liu 16 days ago in reply to Mario Trams

TI__Guru**** 169011 points

Hi, I am out of office for 1.5 weeks. Please expect delayed response.

0 Mario Trams 2 days ago in reply to Bin Liu

Prodigy 170 points

Hi Bin,

in the mean time I spent lots of time on that caching matter. There are good news and bad news...

In order to structure things I'm making a numbered list of individual points.

The questions I did pose can be ansered with "Yes, the AM6442's main application processors can cache memory areas outside the DDR area." and "No, there is no sort of hard limitation so that Linux does not permit caching outside the regular main memory."
The key thing to take care for is to add according memory@.... regions into the device tree (i.e. besides the definition of the regular DDR area). In addition there need to be made matching entries to the reserved-memory node in the device tree. I did use parameters compatible = "shared-dma-pool" and reusable there (I'm not sure whether this in particular is needed, but it is working that way). It's also important to take care about the window size/alignment. This needs to be 4MiB. When anything is not done properly here, there will be a kernel panic during boot.
Now, one can map these areas via /dev/mem. However, to do that, CONFIG_STRICT_DEVMEM=n needs to be set (and, btw., I still need to set that the way I described it initially). Without setting CONFIG_STRICT_DEVMEM=n there will be some permission error during mmap(). The use of these drivers intended to map dma areas is not needed any more. And, btw., in order to use these drivers the same tweaks within the device tree have to be applied. At least this is valud for u-dma-buf, where I did test this initially. As some sort of side-effect I did find out, that I still can use the mapping of /dev/mem. So I did continue with just /dev/mem.
Whether the cache is active or not is indeed controlled while opening /dev/mem with either O_SYNC present or not. If O_SYNC is used, the cache won't be active. If it is not used, the cache will be active.
It is worth noting that there is even a change in the writing behavior when there is no cache used. The processor seems to do a more intensive write combining here. I.e. it is a difference whether there is mapped such an prepared area in a non-cached mode, or via /dev/mem without all these extra preparations.
As a first minor issue: With cache turned on, there is - as expected and as a clear sign that the cache is active - indeed the issue that a write operation to the GPMC area causes a cache line fill operation. From a performance point of view this is nonsense. Previously I did mention that the DC ZVA instruction might be used for allocating a cache line. Unfortunately it is not doing that. Instead, it seems to clear the area of the according cache line in the physical address space, without touching the cache. This might be useful in some cases, but here not. After further investigations it seems that the ARM processors generally lack such an instruction. E.g. the PowerPC had such an instruction. In case there is some way to do that on the A53 somehow, I'm open for suggestions...
Doing cache maintenance is working as expected, although I did use the instructions DC CVAC (clean cache line by virtual address) and DC CIVAC (clean and invalidate cache line by virtual address). However, I've got the feeling that even DC CVAC is invalidating the cache line. That's kind of unnormal. However, from an point of view of memory semantics it would be ok to perform a clean&invalidate although just a clean has been requested.
I did re-use parts of the Linux memcpy() implementation in order to limit it to the function that is needed here - which is copying complete and aligned cache lines. Essentially, this is done through overlapped 128 bit load/stores, which appears to work very well.
A little pitfall is the fact that the cache maintenance operations cannot be overlapped with the actual data movement, as this hurts performance. It is better to perform an invalidate operation on all cache lines before a GPMC block read starts, and a clean operation on all cache lines after a GPMC block write did finish (from the programming perspective, of course).
Now the big issue: I believe that I did spot some bug within the AM6442. I don't know whether this has to be assigned to the A53 itself or to the surrounding logic. The problem is that during the writes there are occasionally written blocks of 16 bytes with all ones (i.e. four times 0xffffffff when talking about 32 bit words). The address alignment of that error is always at the beginning of a cache line. Or in other words: Whenever the issue occurs, the first 16 bytes of a cache line are affected. This problem is also there with the default memcpy(), so this is certainly not related with some programming mistake. It is also worth noting that the problem does mostly occur when the cache has not been activated. However, very seldomly I'm also seeing this with an active cache. On the other hand I did never see it yet when using a traditional mapping via /dev/mem (i.e. without those special tweaks of the device tree). I can also tell how this issue can be simply avoided - that is be inserting a memory barrier such as DSB ST after each written cache line. However, this is killing the performance then. Reads seem to work flawlessly, btw. The GPMC nicely creates 64 byte bursts here as expected.

Of course, the latter problem is a killer again. I know that things are tricky here and especially in junction with caches there can happen strange things quickly. However, actually this is merely related to some general write combining matters since it mostly appears when the cache is off, actually. Though I don't have an explanation why it appears rarely when the cache is turned on. The issue can certainly also not be of a fundamental nature. This would mean that memcpy() would not be working in general, which would lead to a situation where virtually everything would be crashing down. So this seems to be related to memory mapped IO exclusively.

That's all for the moment. I did still not look into the DMA option yet. But I'm afraid I have to do that as well...

Regards,

Mario

0 Bin Liu 1 day ago in reply to Mario Trams

TI__Guru**** 169011 points

Hi Mario,

Thanks for all the detailed update. It appears you have spent quite amount of effort and made great progress! But I am afraid that we won't be able to provide much of input here, your experiment is beyond the SDK Linux and we don't have experienced on those.

0 Mario Trams 1 day ago in reply to Bin Liu

Prodigy 170 points

Hi Bin,

I also think so that there is no simple solution to that. This is merely looking like something that would needed to be fixed in later releases of the controller. So I think that this should be at least forwarded to the development dept. for investigation.

Anyway, here just a few additional comments for the purpose of documentation:

I wrote above that the memory barrier done after each 64 byte block is fixing the issue. It does not, unfortunately. Though, it does reduce the probability of appearance.
Just for curiosity I changed the copy-loop per cache line from four 128 bit load/stores (as found in the standard memcpy() implementation) into 8 64 bit load/stores. The problem persists, but I'm seeing a different pattern here. That is, the problems are not exclusively related to blocks of 16 bytes beginning at an 64 byte boundary. Instead, there are blocks of 8 bytes, 16 bytes, and 24 bytes falsely written with all ones. Mostly, the alignment of the falsely written blocks is also at a 64 byte boundary. But I've also seen 32 byte boundaries. It is also worth noting that with this configuration the issue appears much more often in comparison to the variant with 128 bit load/stores. The use of the memory barrier is cutting down the probability of appearance dramatically, but does not eliminate it.
I also tried to readjust the 128 bit load/stores so that they are not interleaved any more. This is definitely killing the performance, but the problem still persists. Though, with the additional memory barrier per 64 bytes copied the problem seems gone (meaning that I have not yet seen the issue....).
I did play around with spacing the load/stores with NOPs resp. neutral instructions, but this seems to have no effect.

I think I need to give up on that caching matter and have to see what is possible with block moving DMA. What might be possible though is to use a cached mapping solely for read purposes, and a traditional mapping (i.e. even without these special memory reagion adjustments in the device tree) for the sole purpose of writing (reading that window would be associated with a rather high perfomance penalty). This would require two GPMC ChipSelects for a single memory. At least higher reading speeds should be preserved here (up to around 250MB/s btw.) while the writing speed probably sticks with those around 170MB/s I did report earlier (all with 32 bit @ 100MHz). Actually, this would already be not that bad - but this rather little AM6442 appears to be able to do more... Question is whether those performance figures can be topped with DMA at all - especially considering the overhead for the setup. But apparently nobody can tell as well what can be achieved here. So there's nothing left than implementing and trying that on ourself...

I'll report here when there are further questions.

Regards,

Mario

0 Bin Liu 5 hours ago in reply to Mario Trams

TI__Guru**** 169011 points

Hi Mario,

Thanks for the additional details, already impressive engineering work! Let's know what performance you will get with DMA.

Processors

Processors forum

AM6442: Burst size limitations of the GPMC interface