Tool/software:
Hello,
Could you please provide more details concerning the L2 and L3 cache handling on the ARM Cortex-A72 compute cluster?
Our current setup is:
Board: J784s4 custom board
PDK 09.02.00.30
Linux
SPL Boot
MSMC: 6MB of this memory is configured as L3 cache (board-cfg.yaml) at uboot # msmc_cache_size calculation:
# If the whole memory is X MB the value you write to this field is n.
# The value of n sets the cache size as n * X/64. The value of n should
# be given in steps of 4, which makes the size of cache to be
# configured in steps on X/8 MB.
# Simplified: n = Cache_in_MB * 8
msmc:
subhdr:
magic: 0xA5C3
size: 5
# enable 6MB msmc cache
msmc_cache_size : 0x30
The corresponding kernel config is:&msmc_l3 {
cache-size = <0x600000>; // Set the L3 cache to 6MB
cache-line-size = <128>; // Cache line size is 128byte
cache-sets = <2048>; // Number of cache sets
};
Output of lscpu:lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0,1,4,5
Off-line CPU(s) list: 2,3,6,7
Vendor ID: ARM
Model name: Cortex-A72
Model: 0
Thread(s) per core: 1
Core(s) per cluster: 4
Socket(s): -
Cluster(s): 1
Stepping: r1p0
BogoMIPS: 400.00
Flags: fp asimd aes pmull sha1 sha2 crc32 cpuid
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 192 KiB (4 instances)
L2: 4 MiB (2 instances)
L3: 6 MiB (1 instance)
Current software situation is:
On the first A72 cluster (cluster 0) we have a dotnet runtime that hosts a couple of C# applications and at the
other cluster (cluster 1) is running a c++ realtime application.
Our problem is:
The dotnet runtime spawns one garbage collection (GC) thread for each CPU core it runs on (cluster 0).
When the GC threads occasionally do their work, we experience interference with threads of our realtime
application running on the second A72 cluster (cluster 1) to have an almost doubled CPU time.
This leads to miss RT deadlines in some situations resulting in error conditions in our application.
invalidation's, esp. L3 cache shared between both A72 clusters, or (b) DDR memory controller bus saturation, or (c) both phenomena.
We have some questions verify of our theory:
1. How can we check our hypothesis by e.g. measuring dedicated performance counters regarding cache misses, bus saturation et. al.?
2. Is there a possibility to separate/partition the L3 cache and assign these to the two A72 clusters individually?
3. Are there options to increase the DDR memory bus throughput (when it proved to be the bottleneck)?
4. Other reasons we did not think of?
Kind Regards
Thomas Willetal