TDA4VH-Q1: cache handling (L2, L3_msmc) - integration of our software stack

Thomas Willetal

Part Number: TDA4VH-Q1

Tool/software:

Hello,

Could you please provide more details concerning the L2 and L3 cache handling on the ARM Cortex-A72 compute cluster?

Our current setup is:

Board: J784s4 custom board
PDK 09.02.00.30
Linux
SPL Boot

MSMC: 6MB of this memory is configured as L3 cache (board-cfg.yaml) at uboot

    # msmc_cache_size calculation:
    # If the whole memory is X MB the value you write to this field is n.
    # The value of n sets the cache size as n * X/64. The value of n should
    # be given in steps of 4, which makes the size of cache to be
    # configured in steps on X/8 MB.
    # Simplified: n = Cache_in_MB * 8
    
    msmc:
        subhdr:
            magic: 0xA5C3
            size: 5
        # enable 6MB msmc cache
        msmc_cache_size : 0x30

The corresponding kernel config is:

&msmc_l3 {
	cache-size = <0x600000>;  // Set the L3 cache to 6MB
	cache-line-size = <128>;  // Cache line size is 128byte
	cache-sets = <2048>;      // Number of cache sets
};

Output of lscpu:

lscpu
Architecture:            aarch64
  CPU op-mode(s):        32-bit, 64-bit
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0,1,4,5
  Off-line CPU(s) list:  2,3,6,7
Vendor ID:               ARM
  Model name:            Cortex-A72
    Model:               0
    Thread(s) per core:  1
    Core(s) per cluster: 4
    Socket(s):           -
    Cluster(s):          1
    Stepping:            r1p0
    BogoMIPS:            400.00
    Flags:               fp asimd aes pmull sha1 sha2 crc32 cpuid
Caches (sum of all):
  L1d:                   128 KiB (4 instances)
  L1i:                   192 KiB (4 instances)
  L2:                    4 MiB (2 instances)
  L3:                    6 MiB (1 instance)

Current software situation is:
On the first A72 cluster (cluster 0) we have a dotnet runtime that hosts a couple of C# applications and at the
other cluster (cluster 1) is running a c++ realtime application.

Our problem is:
The dotnet runtime spawns one garbage collection (GC) thread for each CPU core it runs on (cluster 0).
When the GC threads occasionally do their work, we experience interference with threads of our realtime
application running on the second A72 cluster (cluster 1) to have an almost doubled CPU time.
This leads to miss RT deadlines in some situations resulting in error conditions in our application.

The garbage collection work initiated by dotnet causes heavy memory transactions resulting in (a) cache
invalidation's, esp. L3 cache shared between both A72 clusters, or (b) DDR memory controller bus saturation, or (c) both phenomena.

We have some questions verify of our theory:

1. How can we check our hypothesis by e.g. measuring dedicated performance counters regarding cache misses, bus saturation et. al.?
2. Is there a possibility to separate/partition the L3 cache and assign these to the two A72 clusters individually?
3. Are there options to increase the DDR memory bus throughput (when it proved to be the bottleneck)?
4. Other reasons we did not think of?

Kind Regards
Thomas Willetal

7 months ago

0 Keerthy J 7 months ago

TI__Guru**** 162670 points

Hi,

One confirmation on the hypothesis: If you have already disabled the GC thread and made sure that RT deadlines are met?

Thomas Willetal said:
there a possibility to separate/partition the L3 cache and assign these to the two A72 clusters individually?

This is not possible. It's for both. We do not have provision for customising it per core.

Best Regards,

Keerthy

0 Thomas Willetal 7 months ago in reply to Keerthy J

Prodigy 131 points

Hi Keerthy,

Thank you and yes we have already checked it with the GC thread. We only observe longer CPU execution times when the GC threads are active.

Kind Regards,
Thomas

0 Keerthy J 7 months ago in reply to Thomas Willetal

TI__Guru**** 162670 points

Hi,

I have looped in our expert. We will get back to you on this.

Thanks,

Keerthy

0 Richard Woodruff 7 months ago in reply to Keerthy J

TI__Mastermind 23455 points

Hello,

What is the magnitude of the execution jitter you are observing (5uS, 5mS, 50mS, ...)? Often 1st order effects come from SW and its scheduling. 2nd order effects (which can be significant depending on your needs) can happen due to shared structures (coherent L1/L2 caches, shared L3, shared DDR). For harder RT needs, MSMC-SRAM would be the best choice.

Are you running SMP linux across the clusters but using affinity to pin your RT tasks? Or are you running AMP instances? Some descoping can be achieved by marking AMP clusters MMU as 'non-shared'. This will reduce delays which may be coherency-snoop related.

If your application runs on a TI J784S4 EVM I do find using external hardware trace out the MIPI-60 into a Lauterbach receiver to be very good. Both system and processor trace (along with PMU exports) provide a lot of HLOS, RTOS, and HW execution timing information.

If your clusters can cross communicate, perhaps you can signal the non-RT thread when its safe (or unsafe) to do GC operations. This level of coordination might help avoid the issue.

Regards,

Richard W.

0 Thomas Willetal 7 months ago in reply to Richard Woodruff

Prodigy 131 points

Hello Richard,

Thank you for the answer. We have observed an jitter of 5ms and it affects all (SW) threads that we see executing in a kernel trace log. (Visualized in kernelshark).
Our system is a SMP linux system running at the two A72 clusters and additional firmware components are running at the r5f cores, operated in split mode.

The available 8MB msmc memory is spitted into 6mb L3 cache and the remaining 2MB is used by our r5f firmware's.
This different rf5 applications using this memory partly as fast text section. It is also used as shared memory for data exchange between the A72 linux app's and the r5f firmware.

In another forum thread we had the problem that the code execution at an r5f is stalled.
TDA4VH-Q1: Stalls in R5F execution - Processors forum - Processors - TI E2E support forums

The behavior could be improved by adopting the MPU table at r5f side and using some system performance feature, like MSMC QoS, DRAM CoS.
Documentation we have found is https://www.ti.com/lit/an/spraci6/spraci6.pdf

Our custom hardware does not have any MIPI-60 Jtag connector, but our connector is the TI-14 with the EMU0/1. We have used this for the r5f Lauterbach traces.

Questions:
1. Could it be that something similar happens on the r5f side also happens at the A72?
Maybe our RT application stalls similar to the R5f, particularly when the .NET GC threads are running. This invalidates all the caches and so a reload of the data from the LPDDR4 is required and as consequence RT app stalls unil the data is refetched.
2. We also found something like MSMC Way Group Partitioning. Does this settings also have a influence to the MSMC L3 cache?
3. The j784s4 soc provides lot of real time optimizations settings, msmc qos, dram cos, msmc way group partitioning.
Are some examples available with best practice values for real-time applications?
4. Are any registers available, like some performance counters, to measure the modifications at this optimization parameters?
5. Which counters have to be monitored/traced (At the Trace32 Peripherals View) in order to draw conclusions about the cache and DDR status?

Kind Regards
Thomas

0 Richard Woodruff 7 months ago in reply to Thomas Willetal

TI__Mastermind 23455 points

Hello Thomas,

5mS seems to be in the wrong scale range a big for single HW bubble, though depending on the total amount of execution time of the RT routine maybe many HW events are combined. I would guess SW is the first order contributor but certainly HW will play a part.

Can you simplify the system and scale it down such that run everything runs on 1 cluster? If you see jitter in between 4 cores vs. 8 it will be an easier issue to work on. Any other simplifying steps could help. Perhaps reduce the RT cores to simple flag setting (no work) and see if GC still disrupts at the same magnitude.

If after checking 1 cluster if you find its a cross cluster issue, maybe some of the application's cross cluster communication is depending on underlying SEV/WFEs then a change like I suggested in this other E2E would likely help your system. The current SW/FW images fail to link the clusters through the CLEC properly. This results in SEVs only working inside a cluster not across it. It might be your code has to wait for some other event to wake which is taken more time. In the video I show how to use your TRACE32 to fix this. It would be a quick run time test to make. Eventually this will be part of our base images after the JIRA is handled.

https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1440747/tda4vh-q1-event-communication-between-clusters-is-not-possible/5527113#5527113

Each cluster can be set to a different priority which can boost a cluster designated as RT. The cluster priority is located in each CPUs local address space. A quick test can be done using your debugger. Set the core focus to a core on 1-4 and set one cluster priority, then set the focus to 5-8 and set the other priority. Probably the best place to set this is in some ATF code before launch but maybe some other way is acceptable. For cluster0 the priority is at ASD:0x6000020 and for cluster1 the priority is at 0x61000020.

When touching QOS settings, I do find it useful to use the cptracer probes which exist around the chip. By setting a probe at an endpoint like an EMIF you can see that QOS settings are taking effect using a transaction probe. Use of a bandwidth or a latency probe can be a good way to measure effects. In example CMM projects I have some comments in the notes directory and also in the cptracer.cmm files. There is also some companion PPTs which you may have got from your TI field rep. I'll attach a short video of verifying a priority setting.

/cfs-file/__key/communityserver-discussions-components-files/791/a72clust_5F00_priority_5F00_cptt_5F00_trans.mp4

Attached are the scripts used to make the video. The password for the zip is "T32". With tracers, using an address or routeid (which master) filter helps sort traffic. The script map_active_bus_routes.cmm will be helpful as it lists all the masters in one spot.

/cfs-file/__key/communityserver-discussions-components-files/791/4274.cmm_2D00_tda4vh_5F00_j784s4.7z

For tracing, offchip trace is the best using EMU pins, however, there are small onchip trace buffers which can be effective for statistics (like bandwidth) if the sample period is set high or if filtering is used. Your 14 pin connector will not help much for offchip trace as it doesn't have enough trace pins nor does it naturally route to an external trace receiver. With a custom adaptor hacks you can get 1 bit trace offchip which can give some cluses but is not wide enough so it will have overflows without filter throttling. Prototyping on an EVM with full trace on the mipi60 is probably the fastest way to experiment.

A rescan of my previous comments may be in order. Any memory which can be marked as MMU-non-shared will reduce HW overhead and reduce a possible jitter source. In systems which have final stage independent streams coming from the C7 this is down to increase efficiency. SMP Linux will have everything shared by default but you can make carveouts if they make sense. Also, making MSMC-SRAM instead of cache makes for something much more predictable then a shared LPDDR.

Hopefully the above points provides some experimental angles. An E2E thread doesn't scale well for this topic.

Regards,

Richard W.