This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA4VH-Q1: TDA4VH Scheduler Performance Issue

Part Number: TDA4VH-Q1
Other Parts Discussed in Thread: TDA4VH

Hi expers,

All the tests below are under the following system configuration:

- 8-core CPUs, Cortex-A72, 2Ghz per CPU.
- 16GB DRAM, 4266 MT/S, about 13GB is allocated to the kernel.
- L1 Instruction Cache size 48KB, L1 Data Cache size 32KB, L2 Shared Data Cache size 2MB, L3 MSMC Cache size 1MB.
- Linux Kernel 5.10.120, 64K Page Size, Transparent Huge Pages (THP) is set to `madvise`.

On TDA4VH, cpus within one cluster can communicate wit each other much faster than cpus across different clusters. A simple hackbench can prove that. hackbench running on 4 cpus in single one cluster and 4 cpus in different clusters shows a large contrast:

 $ pwd
 /run/media/mmcblk1p1/
 
 # Within a cluster:
 $ ./taskset.util-linux -c 0,1,2,3 ./hackbench -p -T -l 20000 -g 1
 Running in threaded mode with 1 groups using 40 file descriptors each (== 40 tasks)
 Each sender will pass 20000 messages of 100 bytes
 Time: 5.148
 
 # Across clusters:
 $ ./taskset.util-linux -c 0,1,6,7 ./hackbench -p -T -l 20000 -g 1
 Running in threaded mode with 1 groups using 40 file descriptors each (== 40 tasks)
 Each sender will pass 20000 messages of 100 bytes
 Time: 8.873

Below are some comparison data with hackbench tool. The hackbench command we used is like below with changing the '-g' parameter to measure the performance, for each different '-g', we run the command 5 times and average time. Hackbench will report the time which is needed to complete a certain number of messages transmissions between a certain number of tasks, for example:

 # One cluster:
 $ ./taskset.util-linux -c 0,1,2,3 ./hackbench -p -T -l 20000 -g 2
 Running in threaded mode with 2 groups using 40 file descriptors each (== 80 tasks)
 Each sender will pass 20000 messages of 100 bytes
 Time: 12.434
 
 # Two Clusters (All), ./taskset.util-linux -c 0,1,2,3,4,5,6,7:
 $ ./hackbench -p -T -l 20000 -g 2
 Running in threaded mode with 2 groups using 40 file descriptors each (== 80 tasks)
 Each sender will pass 20000 messages of 100 bytes
 Time: 12.102

The below is the result of hackbench:

g= 2 3 4 5 6 7 8
One Cluster 12.434 23.152 34.147 44.810 54.718 64.253 74.181
Two Clusters (All) 12.102 15.228 20.592 29.302 40.926 51.873 60.225
+2.67% +34.22% +39.70% +34.61% +25.21% +19.27% +18.81%

From above data, we can see the two clusters mode can improve the testing performance greatly. This is because the two clusters mode has more cpu cores, 8 cpus, and the kernel sched domain SCHED_MC scans cluster before scanning the whole LLC to try to gatter related tasks in one cluster. But this performance is still extremely poor. Through some simple tests, we boldly speculate that the best performance is most likely to be obtained by using only 6 cpus. The below is the result of hackbench with 6 cpus:

g= 2 3 4 5 6 7 8
Two Clusters (All) 12.102 15.228 20.592 29.302 40.926 51.873 60.225
0,1,2,3,4,5 Cores 8.591 14.323 22.803 33.964 43.350 47.539 53.198
+29.01% +5.94% -10.74% -15.91% -5.92% +8.36% +11.67%

From above data, we can see 6 cpus mode can improve the group 2, 3, 7, 8 testing performance greatly, while performance drops significantly in group 4, 5, 6 testing.

Thanks 

QuanLi

  • Hi QuanLi,

    Can you confirm the below:

    • All tests are done on the TI EVM?
    • All tests are done using 8.6 Linux SDK?
    • Additional tools that need to be installed on top of SDK to do the above testing. If any?

    - Keerthy

  • Hi Keerthy

    All tests are done on the TI EVM?

    No, on our tda4vh board (16GB DRAM, 4266 MT/S, about 13GB is allocated to the kernel. )

    All tests are done using 8.6 Linux SDK?

    Yes

    Additional tools that need to be installed on top of SDK to do the above testing. If any?

    No additional tools, test by hackbench .

    Thanks 

    QuanLi

  • Hello,

    I've not seen a detailed level workload break down of hackbench to comment much.  Smaller cache footprint things tend to do better using all 8 cores, some things with large memory footprints do better with a 3+3 setup as there is some bottlenecking in each clusters L2FEQ.  You showed a 4+2 was better in your runs.  You might get more with 3+3, it depends.

    Do you have any concurrent work running on the C7x's and MMAs or are you benchmarking just the A72 clusters against a quiet system?  The TDA4VH's performance is setup such that it can run full system concurrent use cases.  The A72 clusters by themselves will not be able to max out the memory controllers.  They can't issue and track enough transactions natively (if they were allowed to run to their max).  The max benchmarks we compare are multi-camera + AI + GPU ++ system use cases, these need to run at target FPS.  When looking at resource (like DDR) usage for max use cases most of it can be seen consumed.  In a A72 only case, that will not be the case.  The targets are set to system use case + safety in a power budget all at some $cost point.

    Regards,
    Richard W.