SK-AM64B: RPmsg with TCM to speed up communication

Chris

Part Number: SK-AM64B

Tool/software:

Hi,

i have a question related to RPmsg on the AM64. To get the best performance i tried to change the rpmsg_char_zeroccopy example to use TCM. The changed example is running fine. The data is inverted on the Cortex-R5 an validated in UserSpace Task on Cortex-A53.

The example uses shared memory and RPmsg just to signal the available data.

Is it enough to change the data that will be exchanged between the cores to TCM?

The other thing that I do not understand:

If I look into TRM memory map main domain, I can see TCM B for Cortex-R5 Core 0

So I changed shared mem in device tree to the TCM address.

apps-shared-memory {
compatible = "dma-heap-carveout";
reg = <0x00 0x78100000 0x00 0x2000>;
no-map;
};

If I print the address on the R5 with ibuf.addr it is the right address. If read out the data with devmem2 on A53 I see the inverted data. All looks good!

But if I look into memory map file for R5 Core 0:

The address for TCM B is another address then in ibuf.addr.
Could you please tell me how this could be?

right now i''m using:

SDK 09_02_01_10

thanks again!

over 1 year ago

+1 Nick Saulnier over 1 year ago

TI__Guru** 103340 points

Hello Chris,

Part 1 - general memory usage guidance

There are multiple ways we could consider memory usage here:

1) The memory that is used to store instruction code

2) The memory that is used to store variables, etc, needed to run the program

3) The shared memory location itself

Instruction memory

Local instruction memory provides a single cycle access - i.e., if all of your instruction code is stored in local instruction memory, then you will be able to execute an instruction every single clock cycle. If your program is too large to fit into the local instruction memory, this gets a lot more complicated. For now, I'll assume all of your instruction code is stored in the local instruction memory.

Data memory

When the R5F needs to access a variable that it saved in memory, where does it go? Local memory accesses (e.g., TCM) will be faster than elsewhere in the chip (e.g., SRAM will take longer to access, DDR will take much longer to access). So in general, people tend to put data that is accessed a lot, or data that is accessed during time-critical applications, in the R5F's local memory. I am not entirely sure how to account for the R5F's cache in these calculations - we can pull in another team member if you need to do a bit of a deeper dive into the math.

General "getting started" thoughts on how to check where your data is getting stored can be found in the AM64x academy, multicore module here:
https://dev.ti.com/tirex/explore/node?a=7qm9DIS__LATEST&node=A__AdPavpRhU8yrU-EQk33UdQ__AM64-ACADEMY__WI1KRXP__LATEST

Shared memory location

In general, if the shared memory is inside the processor (i.e., SRAM or TCM), both processor cores will be able to access it faster than if the shared memory was in DDR.

Off the top of my head I'm not sure if memory accesses to all of these potential memory regions can be considered as 32 bit read/writes, or if some memories have other options that could lead to better data throughput, regardless of the access latency (e.g., 64 bit wide busses, or wider). Let me know if this is something you want us to look into.

Part 2 - setting up the shared memory location

At this point I have not tried running the zerocopy example with the shared memory in TCM. Could you share your Linux devicetree modifications for that? I would be curious to see how you are defining the memory for Linux userspace.

Part 3 - different addresses to access TCM memory

Different cores have different "views". From the perspective of the rest of the processor, the ATCM & BTCM will always be accessed by reading or writing to the same physical address. However, from the perspective of the R5F, it accesses the ATCM & BTCM by reading or writing to a different address range.

The R5F's local address for ATCM & BTCM is actually programmable. From the TRM R5F chapter:

6.2.3.2.2 Tightly-Coupled Memories (TCMs)

TCMs are low-latency, tightly integrated memories for the R5F to use. Either TCM can be used for any
combination of instruction and/or data. TCM performance is equal to performance on instructions/data that are
in cache. However, TCMs have some additional advantages over cache. TCMs can be loaded with instructions
that do not cache well (such as ISRs) or preloaded with code by an external source, before that code is needed,
to save cache miss time. TCMs are also a good place for blocks of data for intense processing. They can be
loaded (or pre-loaded by an external source) before the data is needed, saving cache miss time. The data can
then be directly accessed by an external source, instead of needing to do cache evicts.

As mentioned, TCMs can be accessed (either read or written) by an external source over the TCM VBUSM
target interface. This allows instructions or data to be preloaded, or for data to be read out after the R5F has
processed it. The VBUSM target has a lower priority to accessing TCMs than the R5F but care must be taken
to keep an external source from reading or writing TCM data that the R5F is working on. This handshaking is
external to any of the R5FSS hardware.

...

If a TCM is not enabled, then it does not appear in the R5F’s memory view, but it can be accessed by an
external source. If a TCM is enabled, then its place in the R5F memory map is determined by a combination
of bootstrap signal and system register. If the CPUn_LOCZRAMA bootstrap signal is high, then the initial base
address of ATCM is 0x0000_0000 and the initial address of BTCM is 20’h41010. If the CPUn_LOCZRAMA
bootstrap signal is low, then the initial base address of BTCM is 0x0000_0000 and the initial base address of
ATCM is 20’h41010.

Note
This base address of 0x41010 for ATCM/BTCM based on the CPUn_LOCZRAMA bootstrap only
affects the R5F’s memory view. The SoC will see the ATCM/BTCM based on the TCM target interface
regions, as defined in Section 6.2.3.3.2. The base address of either TCM may be overwritten via the
ATCM or BTCM region register. Care must be taken not to move the base address of a TCM when it
may be being accessed.

That description aligns with the memory map information you observed above.

So here's a fun thought experiment: what would happen if the R5F read from the address 0x7810_0000? My suspicion is that the read would still work, but it would take MUCH longer than a single clock cycle to complete. That's because when reading from 0x4101_0000 (assuming the address wasn't changed), the R5F core can directly access the TCM within the R5F subsystem.. But if reading from 0x7810_0000, the signal would need to exit the R5F subsystem, go through the processor's bus infrastructure to access the TCM "from the outside" the same way an external core would, and then access the TCM from there.

Regards,

Nick

0 Chris over 1 year ago in reply to Nick Saulnier

Prodigy 120 points

Hi Nick,

thank you so much for the detailed explanation. That helps me a lot.

My Device Tree entry is in my first post.

Your "funny" thought was absolutely right. The R5 core processes the data much more faster if I access the data "directly" with 0x4101_0000. I did not realize that the R5 Core can access the data via processor's infrastructure, because the address that is communicated via RPmsg (0x7810_0000) isn't in the TRM Memory Map for R5 Cores.

regards,

Chris

Processors

Processors forum

SK-AM64B: RPmsg with TCM to speed up communication