AM6548: Cache coherency between A53 and R5f for DDR / MSMC SRAM

Dominic Rath

Part Number: AM6548
Other Parts Discussed in Thread: SYSBIOS

Dear TI team,

we're trying to split our application to run timing critical code on a dedicated R5f core. For this we're looking at our options to implement inter-processor communication via shared memory.

I've always been under the impression that the A53 is cache-coherent with other bus masters when it comes to DDR and MSMC SRAM (see also related thread). This has for example been true for an external PCIe device writing via DMA into our DDR memory.

While trying to get the IPC driver code working in our own application we've come across issues relating to cache coherence. Unfortunately the TI IPC examples appear to be configuring the shared memory as uncacheable on the A53 so we can't look at those to figure out the right way to configure this.

Our setup looks like this:

A53 running TI-RTOS application
- All of DDR memory mapped as normal, cacheable memory (MAIR 7)
R5f running bare-metal application
- Most of DDR memory mapped as normal, WBWA memory via lower priority MPU entry
- Part of DDR memory mapped as strongly ordered or device memory via higher priority MPU entry (tried several different mappings)

The A53 and R5f use the same phsical memory range for the IPC/VRING stuff, but we've had issues with not all of the data being visible on the A53 after writes from the R5f.

To us it looks like the snooping into A53 caches is failing for accesses coming from the R5f. Since the DDR memory is coherent with the same settings for accesses coming from PCIe for example we're assuming that the A53 setup is fine.

Is there any example showing cache-coherent memory being shared between A53 and R5f?

How should we configure the R5f MPU to ensure that the MSMC snoops A53 caches for our R5f accesses?

Do we have to configure something within the MSMC / NBSS / ... for this to work?

Are there any recommendations from TI in order to debug this issue?

Unfortunately the TRM is rather sparse on details regarding cache coherence (see also related thread), and there still hasn't been an update to the TRM. The latest version is still from 2019, despite previous promises that a newer version was in the pipeline (March 2020) or pending external publication (January 2021). Is there any hope of that new TRM being released anytime soon? Is there any reliable schedule for this?

Best Regards,

Dominic

over 4 years ago

0 Dominic Rath over 4 years ago

Mastermind 7470 points

We've further debugged this issue.

It seems that the debugger (XDS110) / CCS has been fooling with us, i.e. the debug memory view can't be trusted with regard to memory coherence. To make things worse, we've repeatedly seen subsequent runs of the target fail when accessing the DDR memory if we had the "CS_DAP_0" opened via the Scripting Console before, and/or if we had a SoC analysis transaction trace running:

Debug the target, use "CS_DAP_0" to read/write memory and/or collect traces with transaction logging (Tools->Traffic Profiling).
Disconnect all processors in the debugger, leave the debug session running
Power-cycle target (power-off for several seconds)
Sometimes our bootloader would crash before the first UART output, sometimes it would crash somewhere within the bootloader, or fail to load the application

We didn't seen crashes if we didn't use the transaction log and didn't open "CS_DAP_0". It doesn't seem to matter if we use the debugger to load applications to the A53 or R5f, as long as we disconnect the cores before power-cycling, and don't use "CS_DAP_0" and the transaction log.

Our latest findings suggest that the A53's view of the memory gets updated, but it takes longer (a lot) than we expected.

Our tests show that it doesn't matter (not sure if there are subtle differences) if the R5f memory is strongly ordered, normal+shared+uncached or normal+shared+wbwa memory, i.e. in all cases our writes from the R5f are immediately seen in the transaction log.

Our R5f code does the following:

Wait on a strongly-ordered location in PSRAM for a "request" from the A53
Write a buffer of 64 bytes with 4-byte stores that is strongly-ordered (or normal+shared+x)
Write strongly-ordered location in PSRAM to indicate it's done
Wait again (i.e. a while(1) loop)

Our A53 code does the following:

Runs a TI-RTOS application with lots of code, but in that case mostly idle
Receives a command via a UART command line interpreter
Writes a device+nG+nR+nE ("strongly ordered") location in PSRAM to request the R5f to start writing the buffer
Receives a command via a UART command line interpreter
Reads the device+nG+nR+nE ("strongly ordered") location in PSRAM to see if the R5f is already done (it always is, since the whole UART driver + cmdline interpreter takes way longer than the R5f to "see" the request and perform the 64 bytes write)
Receives a command via a UART command line interpreter
Reads the normal memory (MAIR 7) buffer that was supposedly written by the R5f (we always see the original content, not the newly written content)
Receives a command via a UART command line interpreter
Reads the normal memory (MAIR 7) buffer that was supposedly written by the R5f (we always see the original content, not the newly written content) <= we really repeat that step)
Waits on the UART command line driver for the next request

If we try reading the buffer via our command line interpreter a few seconds later, we see the recently written new content. Sometimes it takes several retries, i.e. we see the old content for several attempts, each with roughly 1 second delay (command line interaction). Eventually we see the new content.

We've also implemented another test case that automates the steps on the A53, with even more surprising results:

Runs a TI-RTOS application with lots of code, but in that case mostly idle
Receives a command via a UART command line interpreter
Reads the normal memory (MAIR 7) buffer to see the current content
Writes a device+nG+nR+nE ("strongly ordered") location in PSRAM to request the R5f to start writing the buffer
Spins in a loop reading the device+nG+nR+nE ("strongly ordered") location in PSRAM to wait for the R5f to be done
Spins in a loop reading the normal memory to wait for the new content from the R5f. If we don't see a change for 5+ seconds, we break the loop.

Once we exit that loop, and re-read the normal memory using our command line interpreter, we eventually see the new value (same as above). But we could wait for a lot longer in the loop than we ever waited on the command line.

We've also added memory barriers to the A53 code, i.e. write SO memory, DMB, spin on SO memory, DMB, spin on normal memory, but without success.

We've haven't seen similar issues with DMA accesses from outside masters, although I'm not 100% that our tests would have detected this behavior, especially since I'm not sure what causes the R5f writes to eventually become visible.

I've noticed that there is an erratum i2021 that might be related to our issue, but according to Errata Rev. E (June 2020) this only applies to SR1.0 hardware. We're using SR2.0 hardware. Reading that erratum and the proposed workaround, I'm not sure how I could ensure that "all DMA masters perform coherent transactions only to any memory that is cached locally".

Does that erratum apply to SR2.0 as well, or does it really only apply to SR1.0?
How can we ensure that our DMA masters perform coherent transactions?
How can we control whether our R5f transactions are coherent?

Regards,

Dominic

0 Dominic Rath over 4 years ago in reply to Dominic Rath

Mastermind 7470 points

I've reduced this to two small test projects running on an AM65x IDK SR2.0:

A53 TI-RTOS application with a single thread
- very basic UART based menu ('s'tart test, 'd'ump memory, change 'v'alue, 'f'lush caches)
- actual test loop:
  - Disable interrupts
  - tell R5f to start via strongly-ordered memory location
  - wait R5f to ack completion via stronlgy-ordered memory location (spin)
  - DMB
  - wait for shared memory (normal memory on A53) content to change (spin)
    - DMB between reads
    - abort if no change for a few seconds
  - Reenable interrupts

R5f bare-metal application
- wait for start signal from A53 on strongly-ordered memory location
- write 64 bytes of shared memory (strongly ordered on R5f) with new value
- signal completion to A53 on strongly-ordered memory location

The new content written by the R5f only becomes visible if I manually flush the caches on the A53. The test spins for a few seconds, then aborts the loop. Dumping the memory shows the previous content. After flushing the caches, the new content is visible.

The previous beavior that showed the content to change "eventually" was probably due to the much larger A53 TI-RTOS application that eventually caused the cached data to be evicted (or at least I don't have another explanation right now).

I've verified the A53 MMU settings, and the shared memory (like all of our RAM) use attribute index 7 with shareability set to outer shareable. MAIR7 is set for write-back cacheable GRE normal memory.

I've tried setting the BCM bit in MSMC_COHCTRL but that didn't have any effect.

My current assumption is that the NAVSS North Bridge (NB) is the component that should control whether transactions coming from the R5f as coherent (see previous question), since that appears to be the bridge from the non-coherent VBUSM to the coherent VBUSM.C. According to the TRM it uses "memory attribute lookup table to add the coherence memory attributes required for VBUSM.C for VBUSM commands missing them".

The memory attribute tables that I could find are all configured for inner shareable (see related thread for how to determine that), and that should be allright.

Maybe Pekka Varis is able to help me here?

Regards,

Dominic

0 Pekka Varis over 4 years ago in reply to Dominic Rath

TI__Mastermind 27050 points

Dominic,

The A53 page table entry memory settings should match the NB settings for the same region. If I read the description correctly the A53 page table entry for strongly ordered (attributes in MAIR register is all zeroes meaning Device-nGnRnE in AArch64). This sounds like works in both directions.

The R5F writing to what is configured as strongly ordered in its MPU (ARMv7), shows up in memory for A53 via NB_MEMATTR64K_Y (or NB_MEMATTR16M0_Y if DDR) setting the attributes to something inner sharable. This something in NB_MEMATTR register needs to be identical to what the A53 MMU setting for the attributes pages covering this same region. I would suggest to use inner shared Write-Back Cacheable (transient or non-transient is a don't care in AM6548),for the region R5 is writing to.

For coherency on AM6548 I always consider the entire coherent domain to be one, so inner and outer are the same (all A53's and the coherent DMA).

Dominic Rath said:
According to the TRM it uses "memory attribute lookup table to add the coherence memory attributes required for VBUSM.C for VBUSM commands missing them".

Correct, the MSMC snoop filter uses the NB_MEMATTR values to decide if the A53 L2 needs to be invalidated. Or snooped in case of a read.

Dominic Rath said:
f I manually flush the caches

flush here means "sw manage" equal to invalidate right?

Pekka

0 Dominic Rath over 4 years ago in reply to Pekka Varis

Mastermind 7470 points

Hello Pekka,

thank you very much for getting back to me on this issue.

For the strongly-ordered location I'm actually using MCU_PSRAM0_RAM @ 0x40280000. This is mapped with MAIR0 (http://software-dl.ti.com/dsps/dsps_public_sw/sdo_sb/targetcontent/sysbios/6_83_00_18/exports/bios_6_83_00_18/docs/cdoc/ti/sysbios/family/arm/v8a/Mmu.html#.M.A.I.R0) which is indeed all zeroes. The R5f has the MCU_PSRAM0_RAM mapped strongly-ordered. Communication via this memory is working just fine in both directions.

I would like to map a part of DDR memory used for actual data as normal memory on the A53 (MMU), and access it as strongly-ordered (maybe normal-uncached with appropriate barriers) on the R5f (via MPU). This isn't working.

The NAVSS0_NBSS_NB1_MEM_ATTR0_CFG.NB_MEMATTR16M0_Y and NAVSS0_NBSS_NB1_MEM_ATTR1_CFG.NB_MEMATTR16M0_Y registers are all set to 0x50. That should mean "inner shareable".

Pekka Varis said:
flush here means "sw manage" equal to invalidate right?

Yeah, I actually call Cache_wbInv() from SYS/BIOS. That code uses "dc civac, x0" to clean and invalidate the cache. I understand that software-managing caches contradicts the hardware coherency and shouldn't be used, nor should it be necessary. My issue is that it appears to be necessary, even though everything else should be set up correctly, at least as far as I can tell. The problem manifests if I never call that Cache_wbInv() function. Only in order to see the values previously written by the R5f I call that function.

I could provide you with my sample code if that's any help. My setup consists of:

AM65x IDK SR2.0
SD-card with SBL from SDK 07.01 (the processor_sdk_rtos_am65xx_07_01_00_14\prebuilt-sdcards folder in that release contains outdated files. The pdk_am65xx_07_01_00_55\packages\ti\boot\sbl\binary contains files that match the source code from the release)
A dummy "stub" TI appimage that puts both A53 and R5f in an endless loop.
CCS to connect A53 and R5f and load the two test projects.

Regards,

Dominic

0 Pekka Varis over 4 years ago in reply to Dominic Rath

TI__Mastermind 27050 points

Thanks overall makes sense.

Dominic Rath said:
I've verified the A53 MMU settings, and the shared memory (like all of our RAM) use attribute index 7 with shareability set to outer shareable. MAIR7 is set for write-back cacheable GRE normal memory.

My suspicion is still around a mismatch of the NB region attributes registers and the page table entry. The page table entries for the 16MB region (MAIR7) is something like 0b11001100 (looking at ARMv8 architecture manual D7.2.63 MAIR_EL1, Memory Attribute Indirection Register (EL1)) ? It should not matter but are you using full AArch64 3-level page table entries (or maybe 2-levels, 2MB pages).

0 Dominic Rath over 4 years ago in reply to Pekka Varis

Mastermind 7470 points

Hello Pekka,

this should give you an overview on how the A53 sees the shared memory (the normal memory, the one in which I'm interested in regarding coherency):

TCR_EL1 is 0x0000000500002510

As far as I understand this means I have a 48-bit address space with a 4KB granule.

The page table built by SYS/BIOS has three levels for the 2MB block I'm accessing:

TTBR0_EL1 = 0x70028000

The address I'm trying to access is 0x9b800000:

0x70028000[0] = 0x0000000070029003 (0-512 GB -> 0x70029000)

0x70029000[2] = 0x000000007003f003 (3rd GB 0x80000000...0xc0000000)

0x7003f003[220] = 004000009B80063D (2MB 0x9b800000...0x9ba00000)

lower attributes are thus b0110001111:

SH[1:0] is b10 == outer shareable

The MAIR_EL1 register is 0xFFBB4F440C080400

MAIR7 is thus 0xff, normal memory, outer and inner write-back non-transient read and write allocate.

I've worked on this issue some more, to verify my assumption that coherency is fine for other DMA masters, but fails for the R5f.

For this I've set up the DRU to do a 2-dimensional block move (copy) from a buffer at 0x9ba00000 (also normal, cached memory) to my "shared memory" buffer at 0x9b800000. The "(i)o test" looks like this:

Disable interrupts
program DRU via DRU_SUBMIT_WORD0...11
DMB
wait for shared memory (normal memory on A53) content to change (spin)
- DMB between reads
- abort if no change for a few seconds
Reenable interrupts

The memory copy from the DRU memory copy is always immediately visible, i.e. the DRU appears to be able to invalidate the A53's caches, while the R5f using strongly ordered (or any of the other memory types) doesn't seem to be able to do the same.

Everything I see points at the R5f (or how it is connected to the MSMC) as the culprit, not the A53's mapping of the memory.

Regards,

Dominic

0 Pekka Varis over 4 years ago in reply to Dominic Rath

TI__Mastermind 27050 points

Dominic Rath said:
I've worked on this issue some more, to verify my assumption that coherency is fine for other DMA masters, but fails for the R5f.

This would have been my exact next step to ask for. Unfortunately the DRU is the one DMA resource on AM6548 that lives in the coherent (VBUSM.C) side of the chip. DRU traffic does not go through the NB region based registers. All the other DMA's and masters go through NB. Would trying out UDMA with the same setup as DRU be possible?

The one thing that does stick out to me is the MAIR7 as 0xff (I know this is from our sysbios init code). I believe it means also read-write allocate , bits 5:4 for outer and 0:1 for inner (in D7.2.63 MAIR_EL1, Memory Attribute Indirection Register (EL1): 11RW Normal Memory, Outer Write-back non-transient in ARMv8 Arch, R read allocate, W write allocate). So there is a mismatch, the NB MEMATTR sets no allocate (0x50) and the MMU entry says allocate. One would think allocate is only a hint, but maybe the mismatch is an issue. Could you try either 0xCC in the MAIR7 (removes the allocate) or NB MEMATTR to 0x5F (adds allocate to both inner and outer).

Sorry about this back and forth, bare-metal A53 is not a primary use case on AM6548. IO coherency is used in the Linux Ethernet/DMA so it should be working, but bare-metal A53 - R5 we just don't have an example.

0 Dominic Rath over 4 years ago in reply to Pekka Varis

Mastermind 7470 points

Hello Pekka,

I'm aware of the DRU being special in that regard. The DRU has its own set of MEM_ATTR registers. I've checked those, too, and they're all set to 0x50, just like the NB MEM_ATTR registers.

I've used the DRU because the direct submit registers make it very easy to use. I wont be able to add a UDMA test, since that probably requires a lot more infrastructure OR the use of quite alot of PDK code, which I tried to avoid for complexity reasons (the less code involved the easier it is to pinpoint the source of the problem) and due to time constraints (I can't spend more time on this right now).

Wouldn't dropping both allocation hints (neither RA nor WA) from the MAIR kind of defeat the purpose of having a cache, if the region isn't allowed to allocate in the caches? I'll give the changed NB MEMATTR a try, but I don't expect it to change anything.

We haven't seen issues with other DMA masters snooping/invalidating the A53's caches - it is really only the R5f where I'm unable to use cached memory on the A53. We've used an external PCIe FPGA writing into our memory, we're using ICSSG dual-mac (via UDMA) to write into our memory, and we're using a DMA capable SD/MMC driver. Neither had any issues with coherency so far.

At this point I'm pretty confident that there's an issue with how the R5f transactions reach the NB that prevents the A53's caches from being snooped.

Regards,

Dominic

0 Pekka Varis over 4 years ago in reply to Dominic Rath

TI__Mastermind 27050 points

Dominic Rath said:
We haven't seen issues with other DMA masters snooping/invalidating the A53's caches - it is really only the R5f where I'm unable to use cached memory on the A53. We've used an external PCIe FPGA writing into our memory, we're using ICSSG dual-mac (via UDMA) to write into our memory, and we're using a DMA capable SD/MMC driver. Neither had any issues with coherency so far.

Ok especially with this datapoint that IO coherency from multiple DMA masters works it is probably not those few bits. I will look into what could be different for the R5 on AM65x. The transactions from DMA or R5, once they show up at the NB, should not look any different. Two thoughts:

1. The physical address R5 uses, could you try using the 0x08 0000 0000-> (From R5 the address you use 0x9b800000 becomes 0x081b800000) from the R5 after the RAT. This is set in the RAT 6.3.3.5 MCU_ARMSS Region-Based Address Translation (RAT). I don't see this restriction documented in the TRM, but I'm fairly certain the IO coherency is designed for the "40-bit physical" memory map, if R5 access (after RAT) uses the low 32-bit for DDR that might be the issue. The alias mapping of first 2GB DDR into the low 32-bit of the SoC level 40-bit memory map is something I'd generally not use, but use the 0x08 0000 0000 -> instead.

2. 8.1.4.5 MSMC_COHCTRL Register (Offset = 2048h) [reset = 0h] bit which is a force to broadcast everything at the snoop filter. Looks like you already tried this. So not relevant.

0 Dominic Rath over 4 years ago in reply to Pekka Varis

Mastermind 7470 points

Hello Pekka,

Pekka Varis said:
Could you try either 0xCC in the MAIR7 (removes the allocate) or NB MEMATTR to 0x5F (adds allocate to both inner and outer).

I tried setting the NB MEMATTR registers at 0x3840000 and 0x3850000 to 0x5f, but that didn't change the behavior at all. Transfers from the R5f are still not visible in the A53's caches unless I invalidate the A53 caches. Transfers from the DRU are immediately visible.

Pekka Varis said:
1. The physical address R5 uses, could you try using the 0x08 0000 0000-> (From R5 the address you use 0x9b800000 becomes 0x081b800000) from the R5 after the RAT. This is set in the RAT 6.3.3.5 MCU_ARMSS Region-Based Address Translation (RAT). I don't see this restriction documented in the TRM, but I'm fairly certain the IO coherency is designed for the "40-bit physical" memory map, if R5 access (after RAT) uses the low 32-bit for DDR that might be the issue. The alias mapping of first 2GB DDR into the low 32-bit of the SoC level 40-bit memory map is something I'd generally not use, but use the 0x08 0000 0000 -> instead.

I've changed the code to use a buffer at +2GB in the DDR memory as the "normal" shared memory. The R5f maps that at 0xc0000000 via the RAT, R5f MPU sees it as strongly ordered. The A53 maps the memory 1:1 at 0x880000000. Unfortunately the behavior is the same. If I write the memory from the R5f I need to first invalidate the A53's caches before the change becomes visible to the A53.

Pekka Varis said:
2. 8.1.4.5 MSMC_COHCTRL Register (Offset = 2048h) [reset = 0h] bit which is a force to broadcast everything at the snoop filter. Looks like you already tried this. So not relevant.

I've been experimenting with the broadcast mode for the snoop filter the whole time, but that didn't change anything either.

Pekka Varis said:
Sorry about this back and forth, bare-metal A53 is not a primary use case on AM6548. IO coherency is used in the Linux Ethernet/DMA so it should be working, but bare-metal A53 - R5 we just don't have an example.

I've looked into the TI linux kernel sources for the R5f remoteproc implementation to get an idea how this is handled in Linux, but to me it looks like the R5f rproc implementation maps the R5f memories as uncached:

https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/tree/drivers/remoteproc/ti_k3_r5_remoteproc.c?h=ti-linux-5.10.y#n928

        kproc->rmem[i].cpu_addr = ioremap_wc(rmem->base, rmem->size);

Are you sure that the coherency between R5f and A53's caches was ever tested?

Regards,

Dominic

0 Pekka Varis over 4 years ago in reply to Dominic Rath

TI__Mastermind 27050 points

Dominic Rath said:
The R5f maps that at 0xc0000000 via the RAT, R5f MPU sees it as strongly ordered. The A53 maps the memory 1:1 at 0x880000000.

Sorry if it was not clear. The address out of the R5 after RAT to DDR should be the high alias, 0x08 0000 0000 or larger to reach DDR with coherency logic. Coherency in general is tested with the 40bit memory map.

0 Dominic Rath over 4 years ago in reply to Pekka Varis

Mastermind 7470 points

Pekka Varis said:
Sorry if it was not clear. The address out of the R5 after RAT to DDR should be the high alias, 0x08 0000 0000 or larger to reach DDR with coherency logic. Coherency in general is tested with the 40bit memory map.

I guess it's just my description that wasn't clear.

Physical DDR memory is at 0x8000.0000-0xffff.ffff (first 2 GB) then 0x8.8000.0000-0x8.ffff.ffff (2nd 2GB).

My shared memory buffer starts at physical address 0x8.8000.0000.

On the A53 this is accessed as it is. On the R5f, this is remapped to 0xc000.0000 in the R5f memory space via RAT. The address after the RAT is actually 0x8.8000.0000.

0 Pekka Varis over 4 years ago in reply to Dominic Rath

TI__Mastermind 27050 points

Dominic Rath said:
Part of DDR memory mapped as strongly ordered or device memory via higher priority MPU entry (tried several different mappings)

I discussed with the design lead for MSMC. The R5 MPU settings is the one place where there could be a difference with for example PCIe traffic. The suggested MPU setting for the A53 IO coherent memory in DDR from the R5 direction was normal-shared. Shared on R5 disables caching. Strongly ordered from R5 bypasses the NB memory attributes (sorry about that, I was not aware of this, but I guess mixing ARMv7 cores and ARMv8 memory views there are some compromises to make). The design architecture assumes everybody maps strongly ordered as strongly ordered (Device-nGnRnE).

Dominic Rath said:
Our tests show that it doesn't matter (not sure if there are subtle differences) if the R5f memory is strongly ordered, normal+shared+uncached or normal+shared+wbwa memory, i.e. in all cases our writes from the R5f are immediately seen in the transaction log.

I can see that before narrowing down to the small example you had tried normal-shared from R5. So I'm not clear will this solve the issue of a very long delay but at least for the smaller example, and clear up one item in this topic (R5 must not use device/strongly-ordered with IO coherency).

0 Dominic Rath over 4 years ago in reply to Pekka Varis

Mastermind 7470 points

Hello Pekka,

thanks for your continued support.

You're right that I haven't tested with any normal memory mappings in the R5f MPU since I've reduced the test case to the small example. I'll give this a try on monday.

Regards,

Dominic

0 Dominic Rath over 4 years ago in reply to Dominic Rath

Mastermind 7470 points

Hello Pekka,

I just tested again with normal-shared configurations. I tried with region access control 0x324 (TEX=b100, CB=b00, S=1) and 0x30c (TEX=b001, CB=b00, S=1) - both should mean normal, outer and inner non-cacheable, shareable memory, and 0x32d (TEX=b101, CB=b01, S=1) - should mean normal, wb-wa-cacheable, shareable (thus actually uncached).

Unfortunately neither configuration showed any difference in behavior, i.e. the data written by the R5f isn't visible to the A53's caches before the A53 manually invalidated its caches. Using the transaction log I can immediately see the writes from the R5f at the MAIN NAVDDR LO probe.

Pekka Varis said:
Strongly ordered from R5 bypasses the NB memory attributes

The TRM says the following in the NB chapter (10.2.9.2.7): "When a command without memory attributes, an atype = 0, is input to the M2C bridge, this table performs the lookup in parallel and returns the memory attributes to the M2C bridge in the next cycle"

I couldn't find a specification for the CBA 4.0, nor are the details of the NB's VBUSM->VBUSM.C bridging specified, but I guess that's where things go wrong. Maybe accesses from the R5f carry enough information that the NB doesn't perform the MEM ATTR lookup (atype != 0), but of course not enough to handle coherency on its own?

Is there any hope that coherency works between A53's caches and the R5f?

How about the AM64x? Are the A53s coherent with anything (except the other A53) at all? The TRM doesn't mention a VBUSM.C interface, but it does mention the A53's ACE and ACP interfaces.

Regards,

Dominic

+1 Pekka Varis over 4 years ago in reply to Dominic Rath

TI__Mastermind 27050 points

Dominic Rath said:
Maybe accesses from the R5f carry enough information that the NB doesn't perform the MEM ATTR lookup (atype != 0), but of course not enough to handle coherency on its own?

Dominic, this is exactly the issue. For R5 on AM654x devices the atype is set so the NB memory attributes are bypassed, effectively resulting in IO coherency with A53 not being supported for R5. Sorry for all this effort you've put in, but on digging into the this mismatch is specific to AM654x, on newer parts the atype for R5 is configurable. So no IO coherency for R5 on AM654x devices.

0 Dominic Rath over 4 years ago in reply to Pekka Varis

Mastermind 7470 points

Hello Pekka,

thanks for confirming this.

Pekka Varis said:
newer parts the atype for R5 is configurable

Hmm, that still leaves room for interpretation. Right now I'm specifically interested in how this works for the AM64x. Is the AM64x coherent with other DMA masters (PCIe, UDMA)? How can I configure the atype for R5f on the AM64x? Should I create a separate thread for this?

Regards,

Dominic

0 Pekka Varis over 4 years ago in reply to Dominic Rath

TI__Mastermind 27050 points

AM64x IO coherency is based on ACP port, and using asel value to get there, not MSMC/NB. I think it would be better to cover on a separate thread, as this is already pretty deep into MSMC/NB details.

Processors

Processors forum

AM6548: Cache coherency between A53 and R5f for DDR / MSMC SRAM