Other Parts Discussed in Thread: TDA4VH
All the tests below are under the following system configuration:
- 8-core CPUs, Cortex-A72, 2Ghz per CPU.
- 16GB DRAM, 4266 MT/S, about 13GB is allocated to the kernel.
- L1 Instruction Cache size 48KB, L1 Data Cache size 32KB, L2 Shared Data Cache size 2MB, L3 MSMC Cache size 1MB.
- Linux Kernel 5.10.120, 64K Page Size, Transparent Huge Pages (THP) is set to `madvise`.
hackbench
can prove that. hackbench
running on 4 cpus in single one cluster and 4 cpus in different clusters shows a large contrast:
$ pwd
/run/media/mmcblk1p1/
# Within a cluster:
$ ./taskset.util-linux -c 0,1,2,3 ./hackbench -p -T -l 20000 -g 1
Running in threaded mode with 1 groups using 40 file descriptors each (== 40 tasks)
Each sender will pass 20000 messages of 100 bytes
Time: 5.148
# Across clusters:
$ ./taskset.util-linux -c 0,1,6,7 ./hackbench -p -T -l 20000 -g 1
Running in threaded mode with 1 groups using 40 file descriptors each (== 40 tasks)
Each sender will pass 20000 messages of 100 bytes
Time: 8.873
Below are some comparison data with hackbench
tool. The hackbench
command we used is like below with changing the '-g' parameter to measure the performance, for each different '-g', we run the command 5 times and average time. Hackbench
will report the time which is needed to complete a certain number of messages transmissions between a certain number of tasks, for example:
# One cluster:
$ ./taskset.util-linux -c 0,1,2,3 ./hackbench -p -T -l 20000 -g 2
Running in threaded mode with 2 groups using 40 file descriptors each (== 80 tasks)
Each sender will pass 20000 messages of 100 bytes
Time: 12.434
# Two Clusters (All), ./taskset.util-linux -c 0,1,2,3,4,5,6,7:
$ ./hackbench -p -T -l 20000 -g 2
Running in threaded mode with 2 groups using 40 file descriptors each (== 80 tasks)
Each sender will pass 20000 messages of 100 bytes
Time: 12.102
The below is the result of hackbench
:
g= | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|
One Cluster | 12.434 | 23.152 | 34.147 | 44.810 | 54.718 | 64.253 | 74.181 |
Two Clusters (All) | 12.102 | 15.228 | 20.592 | 29.302 | 40.926 | 51.873 | 60.225 |
+2.67% | +34.22% | +39.70% | +34.61% | +25.21% | +19.27% | +18.81% |
From above data, we can see the two clusters mode can improve the testing performance greatly. This is because the two clusters mode has more cpu cores, 8 cpus, and the kernel sched domain SCHED_MC
scans cluster before scanning the whole LLC to try to gatter related tasks in one cluster. But this performance is still extremely poor. Through some simple tests, we boldly speculate that the best performance is most likely to be obtained by using only 6 cpus. The below is the result of hackbench
with 6 cpus:
g= | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|
Two Clusters (All) | 12.102 | 15.228 | 20.592 | 29.302 | 40.926 | 51.873 | 60.225 |
0,1,2,3,4,5 Cores | 8.591 | 14.323 | 22.803 | 33.964 | 43.350 | 47.539 | 53.198 |
+29.01% | +5.94% | -10.74% | -15.91% | -5.92% | +8.36% | +11.67% |