TDA4VH-Q1: TDA4VH Scheduler Performance Issue

li quan

Part Number: TDA4VH-Q1
Other Parts Discussed in Thread: TDA4VH

Hi expers,

All the tests below are under the following system configuration:

- 8-core CPUs, Cortex-A72, 2Ghz per CPU.
- 16GB DRAM, 4266 MT/S, about 13GB is allocated to the kernel.
- L1 Instruction Cache size 48KB, L1 Data Cache size 32KB, L2 Shared Data Cache size 2MB, L3 MSMC Cache size 1MB.
- Linux Kernel 5.10.120, 64K Page Size, Transparent Huge Pages (THP) is set to `madvise`.

On TDA4VH, cpus within one cluster can communicate wit each other much faster than cpus across different clusters. A simple hackbench can prove that. hackbench running on 4 cpus in single one cluster and 4 cpus in different clusters shows a large contrast:

 $ pwd
 /run/media/mmcblk1p1/
 
 # Within a cluster:
 $ ./taskset.util-linux -c 0,1,2,3 ./hackbench -p -T -l 20000 -g 1
 Running in threaded mode with 1 groups using 40 file descriptors each (== 40 tasks)
 Each sender will pass 20000 messages of 100 bytes
 Time: 5.148
 
 # Across clusters:
 $ ./taskset.util-linux -c 0,1,6,7 ./hackbench -p -T -l 20000 -g 1
 Running in threaded mode with 1 groups using 40 file descriptors each (== 40 tasks)
 Each sender will pass 20000 messages of 100 bytes
 Time: 8.873

Below are some comparison data with hackbench tool. The hackbench command we used is like below with changing the '-g' parameter to measure the performance, for each different '-g', we run the command 5 times and average time. Hackbench will report the time which is needed to complete a certain number of messages transmissions between a certain number of tasks, for example:

 # One cluster:
 $ ./taskset.util-linux -c 0,1,2,3 ./hackbench -p -T -l 20000 -g 2
 Running in threaded mode with 2 groups using 40 file descriptors each (== 80 tasks)
 Each sender will pass 20000 messages of 100 bytes
 Time: 12.434
 
 # Two Clusters (All), ./taskset.util-linux -c 0,1,2,3,4,5,6,7:
 $ ./hackbench -p -T -l 20000 -g 2
 Running in threaded mode with 2 groups using 40 file descriptors each (== 80 tasks)
 Each sender will pass 20000 messages of 100 bytes
 Time: 12.102

The below is the result of hackbench:

g=	2	3	4	5	6	7	8
One Cluster	12.434	23.152	34.147	44.810	54.718	64.253	74.181
Two Clusters (All)	12.102	15.228	20.592	29.302	40.926	51.873	60.225
	+2.67%	+34.22%	+39.70%	+34.61%	+25.21%	+19.27%	+18.81%

From above data, we can see the two clusters mode can improve the testing performance greatly. This is because the two clusters mode has more cpu cores, 8 cpus, and the kernel sched domain SCHED_MC scans cluster before scanning the whole LLC to try to gatter related tasks in one cluster. But this performance is still extremely poor. Through some simple tests, we boldly speculate that the best performance is most likely to be obtained by using only 6 cpus. The below is the result of hackbench with 6 cpus:

g=	2	3	4	5	6	7	8
Two Clusters (All)	12.102	15.228	20.592	29.302	40.926	51.873	60.225
0,1,2,3,4,5 Cores	8.591	14.323	22.803	33.964	43.350	47.539	53.198
	+29.01%	+5.94%	-10.74%	-15.91%	-5.92%	+8.36%	+11.67%

From above data, we can see 6 cpus mode can improve the group 2, 3, 7, 8 testing performance greatly, while performance drops significantly in group 4, 5, 6 testing.

Thanks

QuanLi

over 2 years ago

0 Keerthy J over 2 years ago

TI__Guru**** 162500 points

Hi QuanLi,

Can you confirm the below:

All tests are done on the TI EVM?
All tests are done using 8.6 Linux SDK?
Additional tools that need to be installed on top of SDK to do the above testing. If any?

- Keerthy

0 li quan over 2 years ago in reply to Keerthy J

Intellectual 955 points

Hi Keerthy

Keerthy J said:
All tests are done on the TI EVM?

No, on our tda4vh board (16GB DRAM, 4266 MT/S, about 13GB is allocated to the kernel. )

Keerthy J said:
All tests are done using 8.6 Linux SDK?

Yes

Keerthy J said:
Additional tools that need to be installed on top of SDK to do the above testing. If any?

No additional tools, test by hackbench .

Thanks

QuanLi

0 Richard Woodruff over 2 years ago in reply to li quan

TI__Mastermind 23395 points

Hello,

I've not seen a detailed level workload break down of hackbench to comment much. Smaller cache footprint things tend to do better using all 8 cores, some things with large memory footprints do better with a 3+3 setup as there is some bottlenecking in each clusters L2FEQ. You showed a 4+2 was better in your runs. You might get more with 3+3, it depends.

Do you have any concurrent work running on the C7x's and MMAs or are you benchmarking just the A72 clusters against a quiet system? The TDA4VH's performance is setup such that it can run full system concurrent use cases. The A72 clusters by themselves will not be able to max out the memory controllers. They can't issue and track enough transactions natively (if they were allowed to run to their max). The max benchmarks we compare are multi-camera + AI + GPU ++ system use cases, these need to run at target FPS. When looking at resource (like DDR) usage for max use cases most of it can be seen consumed. In a A72 only case, that will not be the case. The targets are set to system use case + safety in a power budget all at some $cost point.

Regards,

Richard W.

Processors

Processors forum

TDA4VH-Q1: TDA4VH Scheduler Performance Issue