This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM6548: Cache coherence questions

Part Number: AM6548

I'm trying to understand the behaviour of our TI-RTOS (Processor SDK RTOS v06.01) application running on a single A53 core.

Background: By mistake our application called CacheP_Inv() on a buffer in DDR memory that was written by the CPU just before. The CPU then reads the same buffer and copies it to PRU memory. I would expect this sequence to go wrong almost all of the time, since invalidating the cache for these addresses should lead to old, stale data to be read again. On the R5f our application subsequently fails (force write through is not set). On the A53 on the other hand, the application reliably works fine, which is what I don't understand.

The buffer's memory region is mapped with attribute index 7, which should mean normal, non-transient, inner and outer write-back cacheable memory. I verified the page tables and the result of the address translation and the memory is indeed mapped correctly.

I've looked at the TI-RTOS implementation for CacheP_Inv() and that function calls Cache_inv() with Cache_Type_ALL. Cache_inv is implemented in bios.../family/arm/v8a/Cache.c and calls Cache_invL1p and Cache_invL1d, which are assembler functions that use "dc      ivac" and "ic      ivau".

According to the A53 TRM:

"dc ivac" is "Data cache invalidate by VA to PoC", but the "point of coherence" is outside of the processor system and depends on the external memory system.

"ic ivau" is "Instruction cache invalidate by virtual address (VA) to PoU", and the "point of unification" apparently depends on a configuration signal BROADCASTINNER.

Later the TRM says "If the data is dirty within the cluster then a clean is performed before the invalidate.", which would explain why the sequence described above is working on the A53.

I believe the situation is further complicated by the MSMC, which is involved in coherency, and might act as a L3 cache. The default configuration for the AM65x w/ TI-RTOS seems to use the MSMC SRAM as all SRAM though.

  • Can someone tell me what the inner and outer shareability domains for the A53s in the AM65x are?
  • Where is the "point of coherence" and the "point of unification"?
  • Is the invalidate operation on the A53 really a clean-and-invalidate?
  • If the invalidate really cleans any dirty data, does that mean I have to invalidate twice in order to manually maintain coherency? Once before a buffer might be written by DMA (to clean any potentially dirty lines), and once before I try to read it (to ensure the latest data is fetched)?
  • Is there any configuration of the MSMC (or other parts of the memory subsystem) necessary to achieve coherency? There's an old thread (https://e2e.ti.com/support/processors/f/791/t/741291) that says "Yes enabling cache coherency [...] is supported". If it "can be enabled" does that also mean that it starts out disabled?
  • The MSMC apparently only takes care of coherency for DDR and MSMC_SRAM (?). How about data in MCU_SRAM or one of the PRU RAMs?

Regards,

Dominic

  • Dominic,

    I'll need to consult internally on these questions and get back with you.

    Regards,
    Frank

    • Can someone tell me what the inner and outer shareability domains for the A53s in the AM65x are?

    If it is shared it should be marked both inner and outer shared. In theory you could try to limit snoop overhead with more elaborate scheme by only using outer shared for memory with IO access but with AM6548 you probably would not see a benefit in this optimization.

    • Where is the "point of coherence" and the "point of unification"?

    DDR and MSMC SRAM is the PoC and PoU For A53’s including their caches. All other masters including DMA and Cortex R see DDR and MSMC SRAM as coherent with A53’s caches. Nothing else on the chip is cache coherent.

    • Is the invalidate operation on the A53 really a clean-and-invalidate?

    The C in DC CIVAC is for clean, DC IVAC is just invalidate by virtual address. But note there is no guarantee that A53 only write allocates, it can write stream, there is no explicit control to guarantee data was not written out to memory backing up the cache.

    • If the invalidate really cleans any dirty data, does that mean I have to invalidate twice in order to manually maintain coherency? Once before a buffer might be written by DMA (to clean any potentially dirty lines), and once before I try to read it (to ensure the latest data is fetched)?

    With AM6548 is that A53 is coherent with IO, DMA reads will snoop from A53 caches, and writes will invalidate as long as you use memory barriers (DMB is sufficient). SW managed coherency will be significantly slower, you will still need barriers to ensure all cache management operations complete.

    • Is there any configuration of the MSMC (or other parts of the memory subsystem) necessary to achieve coherency? There's an old thread (https://e2e.ti.com/support/processors/f/791/t/741291) that says "Yes enabling cache coherency [...] is supported". If it "can be enabled" does that also mean that it starts out disabled?

    In AM6548 the north bridge (NB) registers, named NAVSS_MEMATTR16M0_Y for DDR, and NAVSS_MEMATTR64K_Y for SRAM, implement the role of IOMMU. 16MB regions for DDR, 64kB regions for MSMC SRAM. The memory definitions in these registers need to be consistent with the A53 MMU settings. One possible configuration is to mark all DDR and MSMC SRAM memory shared and normal cacheable.

    • The MSMC apparently only takes care of coherency for DDR and MSMC_SRAM (?). How about data in MCU_SRAM or one of the PRU RAMs?

    Correct. A53, MSMC SRAM and DDR are coherent with all A53 generated traffic and all traffic from system masters like DMA to or from MSMC SRAM or DDR.

      Pekka

  • Hello Pekka,

    thanks for your replies.

    Pekka Varis said:
    The C in DC CIVAC is for clean, DC IVAC is just invalidate by virtual address. But note there is no guarantee that A53 only write allocates, it can write stream, there is no explicit control to guarantee data was not written out to memory backing up the cache.

    My issue is not that I wouldn't expect that data could be in main memory already, my issue is that consistently all data that I invalidated was already in main memory. I was trying to rule out that there's anything wrong with my configuration that could lead to the caches operating in a write-through mode. On the R5F the FWT (force write through) used to be set by previous versions of the PDK/BIOS, but I couldn't find anything similar for the A53.

    The only explanation that I found was that paragraph from the A53 TRM: "DCIMVAC operations in AArch32 and DC IVAC instructions in AArch64 perform an invalidate of the target address. If the data is dirty within the cluster then a clean is performed before the invalidate.".

    Pekka Varis said:
    SW managed coherency will be significantly slower, you will still need barriers to ensure all cache management operations complete.

    In general I agree with you, but if we're using MCU SRAM to exchange data between the A53 and the R5f then we need to manually manage coherency or make sure that the A53 maps that memory as uncacheable, right?

    Pekka Varis said:
    In AM6548 the north bridge (NB) registers, named NAVSS_MEMATTR16M0_Y for DDR, and NAVSS_MEMATTR64K_Y for SRAM, implement the role of IOMMU. 16MB regions for DDR, 64kB regions for MSMC SRAM. The memory definitions in these registers need to be consistent with the A53 MMU settings. One possible configuration is to mark all DDR and MSMC SRAM memory shared and normal cacheable.

    Ok, so the contents of these registers specify how DMA accesses to DDR or MSMC SRAM behave with regard to the A53's caches?

    Is this configuration something that PDK/BIOS handles for me, or do I have to configure these registers myself?

    Regards,

    Dominic

  • I checked the PDK sources regarding those MEMATTR registers, and while I see them declared I couldn't find any reference where they're accessed.

    Also, the TRM tells me where the bit fields are, but not the encoding of those bits. The headers in pdk.../packages/ti/csl/src/ip/navss/V0/cslr_nb.h also define only the location/mask, but not encoding. Is this documented somewhere?

    Regards,

    Dominic

  • Dominic,

    Sorry about that, clearly the TRM is missing key information. We need to update the TRM, and probably also the file cslr_nb.h to define what the bit fields mean. Here is the definition of the fields, bit mapping is based on ARMv8 page table format. The cache allocation is a hint for the MSMC L3$ when configured, it will not allocate in L2$.

    Bits

    Field

    Type

    Reset

    Description

    31:8

    reserved

    r/o

    0

    Always read as 0.   Writes have no affect.

    7:6

    memtype

    r/w

    1

    This defines the type for the memory.

    0 = Device

    1 = Writeback

    2 = Writethrough

    3 = Non-cacheable

    5:4

    sdomain

    r/w

    1

    This defines the shareability domain of the memory.

    0 = Non-shared

    1 = Inner shared

    2 = Outer shared

    3 = System shared

    3:2

    outer

    r/w

    0

    This defines the outer allocatability of the memory.

    0 = Non-allocatable

    1 = Writes allocate, reads do not

    2 = Reads allocate, writes do not

    3 = Reads and writes allocate

    1:0

    inner

    r/w

    0

    This defines the inner allocatability of the memory.

    0 = Non-allocatable

    1 = Writes allocate, reads do not

    2 = Reads allocate, writes do not

    3 = Reads and writes allocate

      Pekka

  • For the A53 invalidated data to always be at the backing memory I don't have a good explanation. I suspect a mismatch between MMU settings and NB MEMATTR could be one cause. But the use case for A53 for having normal cached memory and having SW control over writes to the memory backing up the cache is not explicitly defined.

    For A53 the 512kB OCRAM close to the MCU is not cache coherent. MSMC SRAM is. R5 reading and writing to MSMC SRAM is cache coherent with A53. My suggestion is to use coherent SRAM for A53 to R5 communication, and R5 does SW managed cache coherency. Or MSMC SRAM for R5 to A53 communication, R5 TCM for A53 to R5 communication.

    Regarding the MEMATTR, see below, my appologies our TRM and the cslr_nb.h are missing the key information.

      Pekka

  • Hello Pekka,

    thanks a lot for your replies.

    Pekka Varis said:
    The cache allocation is a hint for the MSMC L3$ when configured, it will not allocate in L2$.

    Ok, so the OUTER and INNER fields shouldn't matter as long as I don't have MSMC configured as L3 cache?

    According to the AM65x TRM and the table you posted above the NB MEMATTR registers configure all of MSMC SRAM and DDR memory as write-back INNER shareable memory with the allocation hints set to non-cacheable (0x50).

    Pekka Varis said:
    For the A53 invalidated data to always be at the backing memory I don't have a good explanation.

    Sorry if I keep pestering you about this. Don't you agree that the paragraph 6.2.4 from the A53 TRM DDI 0500J explains the behaviour I'm seeing? Are you interpreting this differently? On the other hand: If this really is a result of a configuration mismatch between the MMU and the MEMATTR settings then this would be exactly what I've been fearing.

    I'll try configuring the MEMATTR registers in our TI-RTOS A53 application according to our MMU setup. We map all DDR and MSMC SRAM memory as write-back inner/outer shareable, so that shouldn't be much of a problem. I agree that we need to get rid of all cache maintenance calls for data in these memories, since they're obviously not required.

    How about Linux? Do you know if there's code in the TI linux kernel for the AM65x that configures these registers? I did a grep for MEMATTR and couldn't find anything with regard to the AM65x. Are there guarantees in the kernel that no mapping is ever created with different attributes? Is there any other code that configures the MEMATTR registers, maybe in U-Boot or ATF? Juding from the device tree files it looks like the TI BSP configures the MSMC as half L3 cache / half SRAM, so the allocation hints should matter here, too.

    Regarding A53/R5 communication I was thinking of using MCU SRAM because the R5f has a very high latency when accessing MSMC SRAM. Since we might have to put time-critical parts for e.g. EtherCAT communication into the R5f it could probably hurt performance if memory accesses caused the R5f to stall for too long. Anyway, I guess I understand what our options are here.

    Regards,

    Dominic

  • On second look I do read the A53 TRM section 6.2.4 " If the data is dirty within the cluster then a clean is performed before the invalidate " to be a good explanation. Earlier I was maybe not paying enough attention. The typical pattern in ARM documentation is that the architecture reference manual is logically complete, but maybe leaves some uncertainty and then the specific implementation like A53 TRM documents the implementation choice made. So it looks like an invalidate will always do clean on A53, DC IVAC and DC CIVAC behaviour would be identical.

    For sure unpredictable behavior regarding coherency will happen if the NB0 and NB1 MEMATTR is not consistent with the A53 page tables, snoops will be filtered by MSMC etc.

    The value 0x50, mapping to WB-InnerShared-noRWA for Inner looks to me like the value we use for Linux. Everything is shared and there is no larger shared domain of outer or even larger system. It is not dynamically changed so won't show up in the normal iommu code location, I'm looking for a pointer to the code. Yes on the half and half, but the static 0x50 setting is also true, the default does not leverage the L3$ allocate.

    I feel your pain on the R5 <-> A53 interaction, primary use case scenarios have both running fairly independently, not communicating regularly in a control loop. that is why the push model of to R5 is TCM, to A53 is IO coherent MSMC SRAM is my suggestion.

  • Dominic,

    Confirming 0x50, mapping to WB-InnerShared-noRWA, is setup during boot by the SPL running on the R5 and not touched by Linux.

      Pekka

  • Hello Pekka,

    Pekka Varis said:
    On second look I do read the A53 TRM section 6.2.4 " If the data is dirty within the cluster then a clean is performed before the invalidate " to be a good explanation.

    thanks for checking that part again. I guess we can assume that the original issue is solved.

    Pekka Varis said:
    The value 0x50, mapping to WB-InnerShared-noRWA for Inner looks to me like the value we use for Linux. Everything is shared and there is no larger shared domain of outer or even larger system.

    Ok, so "inner" with regard to the NBSS means "inside the AM65x"?

    Pekka Varis said:
    Confirming 0x50, mapping to WB-InnerShared-noRWA, is setup during boot by the SPL running on the R5 and not touched by Linux.

    Could you tell me where this happens within U-Boot SPL? I was hoping to find the code so that I could use it as a reference, because I'm still not 100% sure I know what registers to use:

    According to AM65x TRM (SPRUID7E) there are two instances of each of the register sets (NB_MEMATTR64K_Y and NB_MEMATTR16M[01]_Y):

    • NAVSS0_NBSS_NB0_MEM_ATTR0_CFG @ 03820000h has NB_MEMATTR64K_Y
    • NAVSS0_NBSS_NB0_MEM_ATTR1_CFG @ 03828000h has NB_MEMATTR64K_Y
    • NAVSS0_NBSS_NB1_MEM_ATTR0_CFG @ 03840000h has NB_MEMATTR16M0_Y and NB_MEMATTR16M1_Y
    • NAVSS0_NBSS_NB1_MEM_ATTR1_CFG @ 03850000h has NB_MEMATTR16M0_Y and NB_MEMATTR16M1_Y

    NB0 is SRAM (hence 64K regions) and NB1 is DDR (hence 16M regions), but what does ATTR0 and ATTR1 refer to?

    I assume NB_MEMATTR16M0_Y vs. NB_MEMATTR16M1_Y refers to the lower (below 4GB) and upper DDR memory ranges.

    I was also looking into the AM75x TRM (SPRUIL1A), and that document doesn't even mention these registers. Is it because the AM75x has the VIRTSS with its SMMU for this purpose?

    Pekka Varis said:
    I feel your pain on the R5 <-> A53 interaction, primary use case scenarios have both running fairly independently, not communicating regularly in a control loop. that is why the push model of to R5 is TCM, to A53 is IO coherent MSMC SRAM is my suggestion.

    Okay, thanks for the explanation.

    Best Regards,

    Dominic

  • Yes with 0x50 inner shared is the the A53 clusters, DDR, and MSMC SRAM.

    MEM_ATTR0 and MEM_ATTR1 is not I believe 32-bit vs 64-bit memory map, it looks like the settings are for traffic with orderid 0:7 and orderid 8:15 respectively. The purpose or orderid is the optionally isolate traffic for QoS purposes as described in section 3.3.2 Quality of Service (QoS) in the TRM. Traffic on orderid 0:7 arrives on port covered by ATTR0 and traffic on orderid 8:15 arrives on port covered by ATTR1. I don't see why you would have the two sets of registers inconsistent, so I'd have the same values on both. Again I apologize the TRM is not sufficient on this. I'm looking for the bootloader code on setting these registers

    AM75x has an SMMU which makes the memory region registers redundant, instead you describe the coherency with normal page tables.

    I'm looking for the pointer to code that sets up the MEM_ATTR registers, wanted to respond to the questions already earlier.

      Pekka

  • Dominic, Pekka,

    I haven't seen any activity on this thread for a while now, so I'll close the thread.

    Regards,
    Frank

  • Hello Frank,
    Hello Pekka,

    I was under the impression that Pekka wanted to look for the code that sets up the MEM_ATTR registers.

    : Did you have any success looking for that code?

    The U-Boot code might of course simply rely on the reset default values, so maybe there isn't any code that shows how these registers are set up.

    Other than that, I agree that this issue is solved.

    Regards,

    Dominic

  • Dominic,

    Sorry it took a little to close on this. A new revision of the TRM is in the pipeline, the key clarifications for _ATTR0 and _ATTR1 are below, it is a QoS separation using the bus information OrderID not related to 32-bit memory addressing (i.e. lower 4GB and upper). Having a different memory setting for the same region based on OrderID is probably a very unique use case, having the same setting for both _ATTR0 and _ATTR1 is the typical setting.

    10.2.9.3.3 NAVSS0_NBSS_NB0_MEM_ATTR0_CFG Registers

    Memory attributes for traffic using OrderID [0-7] on VBUSM0 slave interface.


    and

    10.2.9.3.4 NAVSS0_NBSS_NB0_MEM_ATTR1_CFG Registers

    Memory attributes for traffic using OrderID [8-15] on VBUSM1 slave interface.

    The OrderID is a internal bus QoS parameter to select the route/bridge port to take. The intended use is for example to separate out so all traffic the is going to MSMC SRAM uses a port that does not have any traffic going to DDR. So any congestion or DDR refresh type events will not create timing burstiness on reads and writes to SRAM. This is relevant if the use case is concerned with 1 microsecond level worst case behavior.

    With regard to the values set in the registers and example code, 0x50 is the reset setting. Everything is marked inner shared domain (no need to mark outer shared as outer is always a superset of inner). Adding [L3$] cache allocation is probably the most relevant change to the reset values I'd think about, this should increase throughput in applications like networking.

    And yes the AM75x uses a SMMU type approach to control this information. 

      Pekka

  • Hi Dominic,

    Sorry for the delayed response on this thread.

    Here is the latest AM65x TRM (published in 2019) we have for external communication. I will get back to you with the status on newer version.

    https://www.ti.com/lit/pdf/spruid7

    Thanks & Regards,

    Sunita.

  • Hello Sunita,

    I'm aware of the latest version of the TRM. This is what I've been using since December 2019.

    Pekka mentioned "A new revision of the TRM is in the pipeline" back in March, but apparently nothing has happened since then, or at least it isn't visible to users. It would be great if you could figure out when an update of the TRM could be expected.

    This isn't the only piece of information missing from the TRM, and it's hard to keep track of all the information provided only on the E2E forum.

    Regards,

    Dominic

  • Hi Dominic,

    Yes, the next version of the TRM, Version F, is pending for external publication which is addressing these cache coherency settings.

    Our TRM team is still working on it. Can you please signup for ti.com update auto notifications, if not done already, so that you get notified as soon as the TRM is updated on ti.com.

    Thanks & Regards,

    Sunita.