AM6442: Burst size limitations of the GPMC interface

Part Number: AM6442
Other Parts Discussed in Thread: TMDS64EVM

Tool/software:

Hello,

I'm about to try to transfer data via the GPMC interface with bursts as long as possible with the goal of maximizing the throughput.

Momentarily, the GPMC is operated at 16 bit with a moderate 33.33MHz clock.

There is this GPMC parameter termed ATTACHEDDEVICEPAGELENGTH in the TRM (bits 24:23 of CONFIG1_i). In junction with the standard omap-gpmc driver and the linux device tree it is termed gpmc,burst-length with a slightly different notation of the parameter. That is, gpmc,burst-length specifies the maximum burst length in words rather than the value coded from it into ATTACHEDDEVICEPAGELENGTH. When gpmc,burst-length is set to 16 for instance, ATTACHEDDEVICEPAGELENGTH is found to be 10 binary, which corresponds to 16 words. So this seems just fine.

One odd thing I found in a first instance is that the SDK (I'm using 11.00.09.04 momentarily) does not support the maximum burst length of 32 words, which should be applicable in 16 bit mode according to the TRM. For testing purposes I did patch the omap-gpmc driver a little bit in order to be able to set ATTACHEDDEVICEPAGELENGTH to 11 through the device tree settings.

Now, the according GPMC address window can be mmapped by a linux process and then accessed like normal memory. This is working and I see bursts being used during read and write.

In order to move data between main memory and some GPMC window, it should be sensible to make use of memcpy() that is coming with the SDK. Although I did not check that in detail, I think that the memcpy() implementation is usually highly optimized for the processor.

For instance, that would be some piece of code that moves 64 bytes of data from some buffer buffer1 in main memory to some pointer gpmc_space representing memory within the GPMC area and then back into main memory at buffer2:

        memcpy( (void*) gpmc_space, (void*) buffer1, 64);
        memcpy( (void*) buffer2, (void*) gpmc_space, 64);

It should be noted that gpmc_space as I did use it for testing was the very beginning of a GPMC window - so well aligned. buffer1 and buffer2 were normally declared 64 bit integer arrays within the code of the C function. AFAIK the compiler places such arrays aligned to their data type - so they should be aligned to 64 bit here.

As it turns out, there appear only bursts with a length of 8 words at the GPMC interface for writes, while for reads there are used bursts with a length of 16 words.

A burst length setting of 16 instead of 32 does not have an effect on the actual burst lengths.

So by using the processor, only write burst lengths of 8 words can be generated, which matches 16 bytes here, while read burst lengths of 16 words resp. 32 bytes can be created.

Is that a normal behavior and are there some tweaks to optimize it? The GPMC is certainly not attached to the cache coherency mechanism of the processor. But at least the cache line size is 64 bytes. I don't know how this is being handled within the processor. But I think there is indeed a good chance that there can be transfered 64 bytes at once here.

Thanks,

Mario

  • Hello Mario,

    Please allow me one or two days to get back to you regarding the above queries.

    FYI, TI India will be on holiday on 15th August 2025.

    Regards,

    Anil.

  • Hi Anil,

    thanks for coming back on that. So I know someone is taking care.

    I did just write down a few additional thoughts about the situation. Here they are:

    The TRM (SPRUIM2H) states the following information:

    • "32-bit interconnect target interface which supports non-wrapping and wrapping burst of up to 16x32 bits." ("Section 12.3.3.1.1 GPMC Features")

    Furthermore it ("Section 12.3.3.1.1 GPMC Features") states that "The GPMC supports the following various access types" (among others):

    • Asynchronous read page access (4-8-16-32 Word16, 4-8-16 Word32)
    • Synchronous read/write burst access with/without wrap capability (4-8-16-32 Word16, 4-8-16 Word32)

    So generally spoken, the GPMC should be able to generate burst sizes of up to 64 bytes.

    Of course, this does not necessarily mean that the processor can cause them.

    There is also a section "12.3.3.4.9.5 System Burst vs External Device Burst Support" in the TRM. It states that:

    The device system can issue the following requests to the GPMC:

    • Incrementing fixed-length bursts of two, four, and eight words

    Questions here: 

    • What exactly is refered as "device system"?
    • What is refered as "words"?

    As for the definition of "word" one could assume 64 bits here, as the AM64x is a 64 bit device. 8 x 64bit would also match these 64 bytes again. And the "device system" is probably not just the actual processor but also other units - namely DMA engines as well.

    In section "12.3.3.4.5 GPMC Interconnect Port Interface" the TRM states the following things:

    • The GPMC interconnect interface is a pipelined interface including a 16 × 32-bit word write buffer. (Note: Here we have 64 bytes again)
    • The device system can issue eight incrementing 32-bit interconnect accesses (read/write) (Note: among others - 8 is the largest one)
    • Only power-of-two-length precise bursts 2 × 32, 4 × 32, 8 × 32, and 16 × 32, with the burst base address aligned on the total
      burst size, are supported

    Here is a discrepancy that one one hand there is stated that the "device system" can issue up to 8 incrementing 32 bit accesses through the system, while on the other hand there are mentioned burst of up to 16 x 32. There is no mentioning of the 16 bit case, but one might assume that the burst lengths are just doubled here.

    As it can be taken from the default omap-gpmc driver for Linux, a setting of the GPMC to generate burst lengths of 64 bytes (meaning 32 x 16 bit) is not supported although the GPMC can at least be set to allow this (which means that ATTACHEDDEVICEPAGELENGTH has to be set to 3). I guess that this limitation is there because the typical devices attached to the GPMC do only support burst lengths of up to 16 at 16 bit anyway. In the mean time I did tweak the omap-gpmc driver a little bit so that it can also be set to allow burst lengths of up to 32. However, this does not have an effect in particular at read operations, where I'm still observing burst lengths of 16. 

    The fact that the processor cannot be used in order to generate burst lengths of 32 is not necessarily a surprise to me - although there might be ways to accomplish that I'm just not aware of.

    But what about DMA, in particular the "BCDMA"? Would BCDMA then be able to have the GPMC generate burst lengths of 32@16bit or 16@32bit while transferring blocks of data between main memory and GPMC, hence maximizing throughput?

    Btw., maybe a bit out-of-scope, but in the "Device Overview" of the TRM, namely section "1.3.17 General Purpose Memory Controller (GPMC)" there is neither mentioned the capability of the GPMC to operate at 32 bit, nor a 32 word burst cabability at 16 bit. However, I guess that this is some sort of copy&paste-error and this text is coming from an older micro controller.

    Greetings,

    Mario

  •  in the mean time, are there some ideas regarding that matter?

    Thanks,

    Mario

  • Hello Mario,

    I feel that, based on the above information, you are doing testing on A53 core and using Linux. Please confirm.

    The ATTACHEDDEVICEPAGELENGTH in the TRM defines the maximum page/burst length that the GPMC can use 32 words.

    Reads:

    • Since your GPMC is 16-bit (2 bytes/word), a 16-word burst = 32 bytes.

    Writes:

    GPMC is 16-bit (2 bytes/word), an -8-word burst = 16 bytes.

    I feel that even if you increase the burst size, you will not see any improvement that may come up with the following analysis points.

    For the read, the CPU is always half the cache line, which is 32 bytes, and not the full cache line length, which is 64 bytes.

    Coming to write which is 1/4 of the cache line.

    The CPU fetches from memory in cache line-sized bursts (64 bytes on A-class cores, 32 Bytes on R5F cores).

    They don’t always push out an entire 64-byte line in one go; instead they flush in smaller chunks.

    Initially my suspect with the MMU region settings for GPMC memory .

    How did you do MMU region settings for GPMC memory ?

    I am looking for second set of queries and will get reply soon.

    Regards,

    Anil.

  • Hello Mario,

    If you want to increase the throughput why does GPMC run at lower frequencies of 33.33MHz rather than it supports till 133MHz ?

    Regards,

    Anil.

  • Hi Anil,

    thanks for providing some insight into that matter!

    To step through your questions/points:

    If you want to increase the throughput why does GPMC run at lower frequencies of 33.33MHz rather than it supports till 133MHz ?

    Those 33MHz where just for for first experiments. In the mean time I'm operating the GPMC at 100MHz. 133MHz is not an option since the target configuration is 32 Bit. There, the limit is 100MHz according to the data sheet. Alltogether the goal is to get out as much as possible of bandwidth for data transmission between main memory and GPMC device, hence approaching the theoretical limit of the GPMC interface. This is certainly only possible with burst lengths as long as possible. Because of the additional overhead during read operations (propagating the read address down the pipeline and propagating read data up the pipeline) this is specially critical. I'm also considering some sort of direct FiFo implementation where the address is inherently known within the GPMC device and the address provided by the AM6442 can be ignored. With this method some cycles could be saved per transaction, at least. However, it would be a plus when the normal memory semantics could be kept.

    Btw., at 100MHz the read burst length seems to be cut down even more from 32 bytes to 16 bytes per burst. So there is also some aspect regarding the relation of the individual clocks.

    I feel that, based on the above information, you are doing testing on A53 core and using Linux. Please confirm.

    Yes, exactly.

    I feel that even if you increase the burst size, you will not see any improvement that may come up with the following analysis points.

    Yes, that is what my observation is. As I wrote, I did even tweak the omap-gpmc driver that normally limits the ATTACHEDDEVICEPAGELENGTH at 16 rather than 32 words for whatever reason. But this does not have an effect.

    The CPU fetches from memory in cache line-sized bursts (64 bytes on A-class cores, 32 Bytes on R5F cores).

    They don’t always push out an entire 64-byte line in one go; instead they flush in smaller chunks.

    When this is the case, then there seems to be some other factor that is introducing cuts here - especially for reads when the GPMC burst lengths break from 32 bytes @33MHz into 16 bytes @100MHz.

    One thing that we need to take care for here as well is whether the cache line size does really matter in that case. Maybe there is more the question about some sort of write buffer and prefetch buffer. This is because the GPMC memory is certainly non-cachable. However, assuming it to be cachable in junction with some explicit cache coherency such as cache flushes and cache invalidations done by software might be some trick here.

    Initially my suspect with the MMU region settings for GPMC memory .

    How did you do MMU region settings for GPMC memory ?

    I did not do any specific MMU region settings here. To summarize what I'm doing is:

    • Setting up the the GPMC through the Linux device tree
    • Ensuring that the standard omap-gpmc driver loads and the GPMC module becomes configured properly
    • In the application process using mmap() to map the physical address of the according GPMC ChipSelect window from /dev/mem
    • Then simply access the mapped memory - when moving data between the GPMC device and main memory ideally using memcpy(), which is optimzed for copying.

    So all settings regarding MMU stuff are more or less the standard settings made by Linux itself. 

    What could be done additionally here, and how could it be done? For instance declaring the GPMC regions cachable?

    Anyway. I'm suspecting that the better way to go here is to make use of DMA resp. BCDMA. Though, there are also a few questions:

    1. Would BCDMA make use of the full GPMC burst length of 64 bytes then, i.e. 32 words @16 bit or 16 words @32 bit?
    2. Is there some existing infrastructure in Linux to make use of BCDMA? Can it be used at all, or is it already blocked by some other functionalities that make use of it?
    3. Apart from the TRM, is there some additional documentation on how to make use of it, specifically under Linux? This will be certainly a complex matter and also involves other matters such as ensuring contiguous regions in main memory etc.

    Thanks and greetings,

    Mario

  • Hello Mario,

    Based on your test results and our discussion, it seems that bursts on the wire sometimes fragment at higher GPMC frequencies due to interconnect latency or WAIT timing.

    As a next step, I suggest configuring the DMA TR as follows:
    • ICNT0 = 64 bytes 
    • ICNT1 = Total_Bytes / 64 (number of bursts required to complete the transfer)

    This setup will allow the DMA to push data in aligned bursts. Even if the bursts on the wire still fragment at higher frequencies, DMA will generally provide significantly higher throughput compared to CPU-driven memcpy because it can pipeline transfers more efficiently.

    Please note that I am not a Linux expert, so I am routing this query to our Linux team for additional guidance on:
    • How BCDMA can be integrated with GPMC under Linux
    • Whether there is an existing framework or driver support available
    • Any required changes to device tree or DMAEngine configuration

    Thanks for your patience while we loop in the right expertise.

    Regards,

    Anil.

  • Hello Anil,

    yes, essentially I agree that DMA is promising to be better here - provided the data blocks transfered are large enough. Additionally, DMA is leaving the CPU for other tasks.

    Regarding your suggestions for these ICNT0/ICNT1 settings, it makes surely sense to force that alignment.But of course this can be applied to DMA only.  

    Btw., I did mention this alternative to not access some external memory in an addressable RAM-fashion, but in a FiFo-fashion which allows to cut off a few cycles per transaction. I just did try this out. There can be achieved some improvement indeed. However, the gains are mariginal. I'm achieving 62MiB/s when copying from GPMC into main memory (still with a 16 bit wide interface). Thereby the system creates bursts with a length of 8 words resp. 16 bytes. Such a burst transaction takes 10 clock cycles plus one turnaround cycle, so 110ns at 100MHz. The average transmission time for a 16 byte block is around 250ns resp. 25 clock cycles (based on these 62MiB/s). So the GPMC Interface does spend more time with waiting than with transfering data. This is also clearly visible on the oscilloscope.This is kind of disappointing, but probably one can't expect much more under the given circumstances.

    What do you think about that option to treat the GPMC memory areas as cachable regions in junction with software-controlled cache coherency? Might that improve something? Would that be possible in theory at all?

    As for BCDMA in Linux, thanks for forwarding this to your Linux experts. In the mean time, could you also direct me to some sort of tutorial or examples that make use of BCDMA in some sort of barebone-fashion (if there is something available)?

    Thanks and greetings,

    Mario

  • Hello Anil,

    just some other notes and ideas...

    It came in my mind that it is indeed possible to benchmark the GPMC performance at 32 Bit, although this is not available with the eval kit (TMDS64EVM and TMDS64GPEVM). Of course, there cannot happen some valid data transfer, but one can read or write anyway. So I did configure one window of the GPMC for 32 bit width. I had to extend the default omap-gpmc driver for Linux to do this, as it just supports 8 or 16 bit. However, that's not a big deal. As a result, the burst lengths are halfed in terms of words, which was almost expected because we now have 32 bit instead of 16 bit. But the performance figures are disappointing. For a transfer from main memory to the GPMC device the bandwidth is around 167MiB/s. For a transfer from the GPMC device to main memory using that aforementioned FiFo-semantics the bandwidth is around 72MiB/s. Therby both read and write bursts require in total 7 clocks resp. 70ns at 100MHz for such a 16 byte burst.

    Apart from the possibility of declaring the GPMC windows as cachable areas and drive some software-controlled cache coherency, there is another point that came in my mind: What about the CPU frequency?

    It seems that I'm unable to find any information on the web (or within the kernel boot messages) at what frequency the A53 cores are running actually. It seems that dynamic frequency scaling is not supported right now. I'm missing according entries under /sys/devices/system/cpu/ and there is also a posting here stating that it is not supported (https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1097572/am6442-cpu-frequency-scaling-support). However, dynamic frequency scaling and frequency setting are different matters. I'm using the recent SDK (11.01.05.03) and never specified some frequency the processors should run at, nor did I found any reference what frequency is set by default. One would possibly assume 1GHz here, but perhaps that's wrong. There is also an odd thing that the AM6442 (i.e. the IC) is always almost cold on the eval kit. This is not that bad, as it indicates a low power consumption. However, it might also indicate that it is running far below its specified 1GHz clock by default. There are also provisions to add some radiator on the eval kit and I have seen pictures of the kit with a mounted radiator. So the AM6442 seems to get hot under other circumstances. Just increasing the CPU frequency might not be the most smart option, as it will generate other troubles (power consumption, heat generation, etc.), but it might lead to an improvement.

    Greetings,

    Mario

  • After further studies I learned about this k3conf tool. k3conf dump processor is giving me:

    |---------------------------------------------------------------------------------------------------|
    | VERSION INFO                                                                                      |
    |---------------------------------------------------------------------------------------------------|
    | K3CONF           | (version 0.3-nogit built Thu Jun 26 21:17:32 UTC 2025)                         |
    | SoC              | AM64x SR2.0                                                                    |
    | SoC identifiers  | [0x328cd4e4] 0x19466 Func-Safe Secure 'S' Grade -40°C to 105°C  ALV Package  |
    | SYSFW            | ABI: 4.0 (firmware version 0x000b '11.1.2--v11.01.02 (Fancy Rat))')            |
    | F/w Capabilities | 0x1: GEN                                                                       |
    |---------------------------------------------------------------------------------------------------|

    |-------------------------------------------------------------------------------------|
    | Device ID | Processor ID | Processor Name   | Processor State | Processor Frequency |
    |-------------------------------------------------------------------------------------|
    |   135     |      32      | A53SS0_CORE_0    | DEVICE_STATE_ON | 1000000000          |
    |   136     |      33      | A53SS0_CORE_1    | DEVICE_STATE_ON | 1000000000          |
    |     9     |      24      | MCU_M4FSS0_CORE0 | DEVICE_STATE_ON | 400000000           |
    |   121     |       1      | R5FSS0_CORE0     | DEVICE_STATE_ON | 800000000           |
    |   122     |       2      | R5FSS0_CORE1     | DEVICE_STATE_ON | 800000000           |
    |   123     |       6      | R5FSS1_CORE0     | DEVICE_STATE_ON | 800000000           |
    |   124     |       7      | R5FSS1_CORE1     | DEVICE_STATE_ON | 800000000           |
    |-------------------------------------------------------------------------------------|

    So the main cores seem to be already running at 1GHz indeed. I did not yet figure out where this frequency is set. I.e. to see what is happening with a bit of overclocking. However, overclocking in the final application is not really an option anyway....  

  • Hi Mario,

    I am in the mid of debugging another problem and didn't get a chance to review this entire thread today. But to answer the questions in your last response - yes, k3conf tool can be used to check the A53 running clock. AM64x doesn't support cpu frequency scaling, so cpufreq doesn't exist in Linux sysfw. And overclocking is definitely not supported and recommended. The A53 frequency is configured on U-Boot.

  • Hi Bin!

    Yes, overclocking is definitely not an option. It was just a thought in order to see whether this is changing something with the activity on the GPMC. But if it does indeed imrove something, it couldn't be used in practice anyway...

    In order to summarize the things that should be considered more seriously:

    • Is there an option to enable caching for the GPMC address windows in junction with software-controlled coherency? If yes, how could this be realized in Linux? I have some hope that this could give a significant performance boost, because it probably leads to a situation were always complete cache lines are transfered and the GPMC module probably generates these 64 byte bursts it seems it is essentially capable to generate.
    • Is there a positive perspective to overcome the existing bandwidth limitations by use of BCDMA, possibly with existing benchmark figures from practical experiments and not necessarily made under Linux? The current sustained performance figures to beat are 167MiB/s for copying data from main memory to GPMC and 72MiB/s  for copying data from GPMC to main memory for a 32 bit wide GPMC interface running at 100MHz.
    • Is there some existing infrastructure to make use of BCDMA under Linux, and if yes, how can it be used? Can BCDMA used under Linux by some sort of application at all, or are its ressources already occupied by vital drivers? 
    • Is there some exemplary documentation or tutorial available that is showing the application of BCDMA? Not necessarily under Linux, but in general.

    Thanks and greetings,

    Mario

  • Hi Mario,

    Is there a positive perspective to overcome the existing bandwidth limitations by use of BCDMA, possibly with existing benchmark figures from practical experiments and not necessarily made under Linux? The current sustained performance figures to beat are 167MiB/s for copying data from main memory to GPMC and 72MiB/s  for copying data from GPMC to main memory for a 32 bit wide GPMC interface running at 100MHz.

    Do you need the DMA to write to GPMC periodically in a fixed interval?

    If so, it would require the BCDMA transfers in cyclic mode, which is not implemented in Linux kernel;

    If not, you could use kernel dmaengine API dmaengine_prep_dma_memcpy() to program the BCMDA to do the transfer. Please refer to kernel driver: drivers/spi/spi-cadence-quadspi.c for the API usage.

  • Hello Bin,

    Do you need the DMA to write to GPMC periodically in a fixed interval?

    If so, it would require the BCDMA transfers in cyclic mode, which is not implemented in Linux kernel;

    I don't understand exactly what "periodically in a fixed interval" means, but I think I do not need this. The application would work in a way where there is made a regular check whether there is available some block of new data on the GPMC device or whether the GPMC device can take up a new block of data (depending on direction). If yes, there would be copied the block from the GPMC device to main memory or vice versa. So each transfer is a single-shot transfer, so to speak.

    If not, you could use kernel dmaengine API dmaengine_prep_dma_memcpy() to program the BCMDA to do the transfer. Please refer to kernel driver: drivers/spi/spi-cadence-quadspi.c for the API usage.

    Ok, this seems to be a good starting point. Thanks for that hint! I'll dig into that and see what can be done here.

    As for the option with these caching matters, you do not have an idea - at least whether this could be worthwhile at all?

    Greetings,

    Mario

  • Hi Mario,

    I don't understand exactly what "periodically in a fixed interval" means,

    One of the examples of such use cases is in audio applications, in which the data transfer is in a fixed short interval and fixed data length. The DMA channel is configured in cyclic mode in the beginning of the transfer. In very time one data block movement is done by the DMA, the DMA channel is ready for the next data block without software to re-configure the channel. But sounds like you don't need this use case.

    As for the option with these caching matters, you do not have an idea - at least whether this could be worthwhile at all?

    I just talked to our GPMC module expert today, the GPMC itself doesn't have a mechanism to cache data.

  • Hi Bin,

    One of the examples of such use cases is in audio applications, in which the data transfer is in a fixed short interval and fixed data length. The DMA channel is configured in cyclic mode in the beginning of the transfer. In very time one data block movement is done by the DMA, the DMA channel is ready for the next data block without software to re-configure the channel. But sounds like you don't need this use case.

    I suspected something like that. So this is merely for realtime streaming applications. Though, the good thing here is that there is no need to re-configure the channel for each transfer. Depending on how heavy such a reconfiguration is, this can be critical as well...

    I just talked to our GPMC module expert today, the GPMC itself doesn't have a mechanism to cache data.

    That's for sure. But I did not refer to some cache within the GPMC, but to the regular CPU cache. I did already some home work on that matter...

    First of all, the driver behind /dev/mem is generally treating the mapped memory as no-cached. So even though one is mapping regular memory via /dev/mem, there would be no caching. The relevant part can be found in arch/arm64/mm/mmu.c:

    pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn,
                                  unsigned long size, pgprot_t vma_prot)
    {
            if (!pfn_is_map_memory(pfn))
                    return pgprot_noncached(vma_prot);
            else if (file->f_flags & O_SYNC)
                    return pgprot_writecombine(vma_prot);
            return vma_prot;
    }

    There is this call of pgprot_noncached(), or when O_SYNC is set for the file handle pgprot_writecombine().

    pgprot_writecombine()is said to realize at least a write combining. In fact, I did use that unintentionally since the beginning. However, I don't see a change in write performance regardless of whether O_SYNC is set or not.

    Those pgprot-mapping functions are defined in arch/arm64/include/asm/pgtable.h. Among others, this is looking like that:

    /*
     * Mark the prot value as uncacheable and unbufferable.
     */
    #define pgprot_noncached(prot) \
            __pgprot_modify(prot, PTE_ATTRINDX_MASK, PTE_ATTRINDX(MT_DEVICE_nGnRnE) | PTE_PXN | PTE_UXN)
    #define pgprot_writecombine(prot) \
            __pgprot_modify(prot, PTE_ATTRINDX_MASK, PTE_ATTRINDX(MT_NORMAL_NC) | PTE_PXN | PTE_UXN)
    #define pgprot_device(prot) \
            __pgprot_modify(prot, PTE_ATTRINDX_MASK, PTE_ATTRINDX(MT_DEVICE_nGnRE) | PTE_PXN | PTE_UXN)
    #define pgprot_tagged(prot) \
            __pgprot_modify(prot, PTE_ATTRINDX_MASK, PTE_ATTRINDX(MT_NORMAL_TAGGED))
    #define pgprot_mhp      pgprot_tagged
    /*
     * DMA allocations for non-coherent devices use what the Arm architecture calls
     * "Normal non-cacheable" memory, which permits speculation, unaligned accesses
     * and merging of writes.  This is different from "Device-nGnR[nE]" memory which
     * is intended for MMIO and thus forbids speculation, preserves access size,
     * requires strict alignment and can also force write responses to come from the
     * endpoint.
     */
    #define pgprot_dmacoherent(prot) \
            __pgprot_modify(prot, PTE_ATTRINDX_MASK, \
                            PTE_ATTRINDX(MT_NORMAL_NC) | PTE_PXN | PTE_UXN)

    So from my perspective what is needed here is the use of pgprot_dmacoherent() rather than pgprot_noncached() or pgprot_writecombine(). Probably the simplest way to do that is to duplicate the /dev/mem driver into some sort of /dev/mem2 driver and just replace this pgprot-call there.

    Next there is the question about software-controlled coherency. Here I did found the commands DC CISW (cache line clean and invalidate), DC CSW (cache line clean), and possibly also  DC ZVA (cache zero by virtual address). The latter one would be useful to allocate a cache line prior to writing into a GPMC window and hence would avoid that the according cache line would be read from the GPMC first, which would be nonsense. However, the documentation I found so far regarding DC ZVA is somewhat unclear. It might be that it is not touching the cache at all. If that is the case, one would have to accept that useless initial cache line fill during a write, and then hope that this cache line remains allocated within the cache for subsequent writes. A DC CSW command would be used following the writing of 64 bytes into the GPMC in order to cause a flush of the according cacheline, hopefully creating a 64 byte write burst on the GPMC. Any read from the GPMC should cause a 64 byte cache line fill and hopefully in a 64 byte GPMC read burst as well. Immediately after processing a read block of 64 bytes (or at latest prior to the next read from that very same address), a DC CISW would need to be executed in order to ensure that no outdated data is being processed.

    Can you follow these ideas?

    Thanks and regards,

    Mario  

  • Hi Mario,

    I did talk to one of our senior developers about the questions.

    However, I don't see a change in write performance regardless of whether O_SYNC is set or not.

    This O_SYNC is about data caching in the kernel block device driver, it is not general caching in memory.

    So from my perspective what is needed here is the use of pgprot_dmacoherent() rather than pgprot_noncached() or pgprot_writecombine(). Probably the simplest way to do that is to duplicate the /dev/mem driver into some sort of /dev/mem2 driver and just replace this pgprot-call there.

    The kernel already has a driver "dma_buf' which can map memory regions to user space with cache enabled, but we just not sure if it will work for IO memory regions such as the 0x50000000 GPMC data memory window. You might want to give it a try.

    Please check kernel devicetree k3-am62a7-sk-edgeai.dtso, it has the following node in &reserved-memory node:

    edgeai_shared_region: edgeai_shared-memories {
            compatible = "dma-heap-carveout";
            reg = <0x00 0xa3000000 0x00 0x0ac00000>;
    };

    You can change the location to your GPMC region and its size in the "reg" property above, then the kernel dma-buf driver should create an entry for it under /dev/dma_heap/. You can try to open & map it on your application to see if it improves the performance.

  • Hi Bin,

    This O_SYNC is about data caching in the kernel block device driver, it is not general caching in memory.

    Yes I agree. After looking more exactly at the code I did quote above, when mapping memory (resp. physical adress space) there is always called pgprot_noncached(vma_prot). However, anyway this seems not relevant here since just write-combining would not be enough here.

    The kernel already has a driver "dma_buf' which can map memory regions to user space with cache enabled, but we just not sure if it will work for IO memory regions such as the 0x50000000 GPMC data memory window. You might want to give it a try.

    Please check kernel devicetree k3-am62a7-sk-edgeai.dtso, it has the following node in &reserved-memory node:

    edgeai_shared_region: edgeai_shared-memories {
            compatible = "dma-heap-carveout";
            reg = <0x00 0xa3000000 0x00 0x0ac00000>;
    };

    You can change the location to your GPMC region and its size in the "reg" property above, then the kernel dma-buf driver should create an entry for it under /dev/dma_heap/. You can try to open & map it on your application to see if it improves the performance.

    Aha, that's interesting. I did try that out, but it is failing at a first glance. However, the problem seems to be more of a general nature and I believe that there is something else to do.

    I did add for instance the following section into the &reserved-memory node of the device tree:

                    fpga: fpga-memory@53000000 {
                            compatible = "dma-heap-carveout";
                            reg = <0x00 0x053000000 0x00 0x00020000>;
                            //reg = <0x00 0x0C0000000 0x00 0x00020000>;
                    };

    This is setting up a window of 128kiB starting at one of the configured GPMC windows. There appears an according entry /dev/dma_heap/carveout_fpga-memory@53000000

    The file can be opened for instance with:

    fd2 = open("/dev/dma_heap/carveout_fpga-memory@53000000", O_RDWR );

    But when it is mmapped for instance via:

    fpga_space = (volatile int32_t *) mmap(NULL, FPGA_RAM_SPACE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd2, 0);

    mmap() always fails with the error ENODEV resp. "No such device". 

    There might be some risk that Linux resp. dma_buf does inherently allow such a region for the actual physical main memory - you also wrote that you are not sure whether this is working for IO regions such as the GPMC. However, in that case I would expect some error message in dmesg and a missing entry unter /dev/dma_heap/. In order to check that I did experimentally place that window into the middle of the actual DDR main memory. See the commented-out reg settings in the device tree configuration above. This is failing identically as well.  

    So this is making me believe that there is needed something else that needs to be done before such a region can be mmapped. Do you have any ideas?

    Thanks,

    Mario

  • A little update on that:

    It seems that mmap() has not to be called with the file handle from the opened /dev/dma_heap/carveout... file. Instead, an IOCTL  DMA_HEAP_IOCTL_ALLOC has to be executed on that file handle, which then creates another file handle to be used here. I don't really understand the sense behind that process, since the operating system actually already has got all the information needed, but who knows.  I extended the code as follows:

    Additional includes:

    #include <sys/ioctl.h>
    #include <linux/dma-heap.h>

    Declaration of a structure used for the IOCTL:

    dma_heap_allocation_data dma_heap1_config;

    Filling in the data structure for later use: 

    dma_heap1_config.len = 0x2000;
    dma_heap1_config.fd = 0;
    dma_heap1_config.fd_flags = O_RDWR;
    dma_heap1_config.heap_flags = 0;

    The IOCTL, handing over the file handle of the /dev/dma_heap/carveout-file opened previously as well as the filled-in data structure:

    ioctl(fd2, DMA_HEAP_IOCTL_ALLOC, &dma_heap1_config);

    The call of mmap() using the file handle the IOCTL left in the data structure:

    fpga_space = (volatile int32_t *) mmap(NULL, FPGA_RAM_SPACE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, dma_heap1_config.fd, 0);

    Indeed, the IOCTL is returning some file handle that is looking reasonable and the "No such device"-error has gone. However, the error now is "Invalid argument". I don't know to which argument this applies and what can be wrong now. 

    Some other thing.... I came across that discussion:  TDA4VM: [E-mirror][sdk8.5][tda4vm]convert physical address to virtual address 

    To quote what  wrote there:

    mmap operation is not permitted in dma-heap-carveout region because it is a reserved memory region for DMA (Direct Memory Access) operations.

    This memory region is used by the kernel to allocate memory for DMA operations, which are used by devices to directly access the system’s memory without involving the CPU.

    The CONFIG_STRICT_DEVMEM kernel configuration option restricts access to /dev/mem files, which provide direct access to the physical memory of the system. When this option is enabled, only privileged users can access these files. Since dma-heap-carveout is a reserved memory region for DMA operations, it is not accessible through mmap operation when CONFIG_STRICT_DEVMEM is enabled.

    I'm not sure whether "privileged users" refers to somebody like the user "root" or to a kernel driver. In fact, I did find CONFIG_STRICT_DEVMEM to be activated for the default kernel in the SDK. For a test I did disable it by adding CONFIG_STRICT_DEVMEM=n to board-support/ti-linux-kernel-6.12.35+git-ti/arch/arm64/configs/defconfig and rebuilt the kernel. This did not change anything, however.

  • Hi Mario,

    I will be out of office from later tomorrow for 1.5 weeks, and have a few critical things to wrap up before I leave, so I didn't have enough time today to review your full update, but

    I'm not sure whether "privileged users" refers to somebody like the user "root" or to a kernel driver.

    Yes, the privileged users are root and equivelant.

    For a test I did disable it by adding CONFIG_STRICT_DEVMEM=n to board-support/ti-linux-kernel-6.12.35+git-ti/arch/arm64/configs/defconfi

    This is not the right way to disable a kernel config option. (You don't find any such reference in this defconfig file.)

    Instead, you need to add

    # CONFIG_STRICT_DEVMEM is not set