EVMK2H: Benchmark program between DSP and ARM

Chanh Nguyen64

Part Number: EVMK2H
Other Parts Discussed in Thread: MATHLIB, SYSBIOS, TEST2

Hi TI,

1. Do you have any example of benchmarking program to comapare the performance of DSP and ARM? We want to study which tasks to be run on DSP/ARM.

2. I try to implement a small program invlove mostly mathematic calculation, my expectation is DSP should be faster (as stated in TI DSP Benchmarking - SPRAC13). However what I see is that the ARM execution time is faster. My program is just run after loading GEL file, no specific setting for either DSP or ARM. What could be the reason of the observe performace?

3. Is there any reference or guideline on which program should be run on DSP/ARM?

Thanks a lot.

over 5 years ago

0 lding over 5 years ago

TI__Guru* 95265 points

Hi,

https://www.ti.com/lit/an/sprac13/sprac13.pdf you mentioned is our application note for A15 and C66x bench-marking. The table 1 well summarized the results where C66x shows advantages over A15.

Typically, anything that involves in heavy signal processing, matrix operations, linear algebra are good for DSP implementation. The program with more control code are good for ARM processor.

>>> I try to implement a small program invlove mostly mathematic calculation>>>> For any benchmarking, you need have a framework setting up right: cache, cycle counter, memory placement, compiler/linker options, etc for comparison between different processor architectures. Also, if any optimized library code (like TI MATHLIB) is used for better performance.

Regards, Eric

0 Chanh Nguyen64 over 5 years ago in reply to lding

Expert 1230 points

Hi Eric,

Thanks for your advice.

Do you have any example DSP and ARM project that contains proper setting for benchmarking purpose?

0 lding over 5 years ago in reply to Chanh Nguyen64

TI__Guru* 95265 points

Hi,

Attached one is for C66x Dhrystone, it should report number like "Normalized MIPS/MHz = 0.8297". Then you can replace with your own application.

For A15, I try to find one for you.

Regards, EricDhrystone_C66.zip

0 lding over 5 years ago in reply to lding

TI__Guru* 95265 points

Hi,

C66x sample output:

C66xx_0]
Dhrystone Benchmark, Version 2.1 (Language: C)

Program compiled without 'register' attribute

Please give the number of runs through the benchmark:
Execution starts, 1000000 runs through Dhrystone
Execution ends

Final values of the variables used in the benchmark:

Int_Glob: 5
should be: 5
Bool_Glob: 1
should be: 1
Ch_1_Glob: A
should be: A
Ch_2_Glob: B
should be: B
Arr_1_Glob[8]: 7
should be: 7
Arr_2_Glob[8][7]: 1000010
should be: Number_Of_Runs + 10
Ptr_Glob->
Ptr_Comp: 8435272
should be: (implementation-dependent)
Discr: 0
should be: 0
Enum_Comp: 2
should be: 2
Int_Comp: 17
should be: 17
Str_Comp: DHRYSTONE PROGRAM, SOME STRING
should be: DHRYSTONE PROGRAM, SOME STRING
Next_Ptr_Glob->
Ptr_Comp: 8435272
should be: (implementation-dependent), same as above
Discr: 0
should be: 0
Enum_Comp: 1
should be: 1
Int_Comp: 18
should be: 18
Str_Comp: DHRYSTONE PROGRAM, SOME STRING
should be: DHRYSTONE PROGRAM, SOME STRING
Int_1_Loc: 5
should be: 5
Int_2_Loc: 13
should be: 13
Int_3_Loc: 7
should be: 7
Enum_Loc: 1
should be: 1
Str_1_Loc: DHRYSTONE PROGRAM, 1'ST STRING
should be: DHRYSTONE PROGRAM, 1'ST STRING
Str_2_Loc: DHRYSTONE PROGRAM, 2'ND STRING
should be: DHRYSTONE PROGRAM, 2'ND STRING

Total 686000129 cycles spend for 1000000 iterations
Microseconds for one run through Dhrystone: 0.7
Dhrystones per Second: 1457725.8

Normalized MIPS/MHz = 0.8297

For A15 core, we don't have example on K2H but we have it on other processors. I created a SYSBIOS one on K2H A15 for your reference, the sample output:

Total 176305320 cycles spend for 1000000 iterations
Microseconds for one run through Dhrystone:
Dhrystones per Second:

Normalized MIPS/MHz =

For some reason, floating point didn't print out properly under SYSBIOS environment. You can do the rest calculation, it should be

Dhrystones per Second: 5671978.6

Normalized MIPS/MHz = 3.328

Hope both projects can be used for your bench-marking framework.

Regards, Eric

Dhrystone_A15.zip

0 Chanh Nguyen64 over 5 years ago in reply to lding

Expert 1230 points

Hi Eric,

I have try the provided example, and can get similar result as yours.

Just few question about the calculation of the result:

Normalized MIPS/MHz = Dhrystones_Per_Second/1757.0/1000.0 --> Can I say that in this calculation we assume that 1000.0 MHz is processor clock? So if I change the processor PLL I should change the number 1000.0 as well, Am I correct? And from my searching, the number 1757 is "The industry has adopted the VAX 11/780 as the reference 1 MIP machine. The VAX 11/780 achieves 1757 Dhrystones per second".

Thanks a lot.

0 lding over 5 years ago in reply to Chanh Nguyen64

TI__Guru* 95265 points

Hi,

So if I change the processor PLL I should change the number 1000.0 as well, Am I correct? >>>Correct, you need to change this PLL number.

Microseconds, Dhrystones_Per_Second will depend on the processor clock, while Normalized MIPS/MHz should be constant (or could be vary a little bit) regarless of the processor clock, am I correct? >>>>>>> The cycle count for an algorithm should be constant regardless of your CPU speed. Let's say at 1000 MHz, you can run N iterations per second. When you set clock at half (500 MHz), you should be able to run only N/2 iterations per second. When you normalized by MHz, the results should be constant regardless of CPU speed setting.

Regards, Eric

0 Chanh Nguyen64 over 5 years ago in reply to lding

Expert 1230 points

Got it. Thanks Eric.

0 Chanh Nguyen64 over 5 years ago in reply to Chanh Nguyen64

Expert 1230 points

Hi Eric,

The comments in the file performance_unit.s says that ARM_CCNT_Read returns the clock value divided by 64 cycles. Is it the case?

If yes then the total cycles on ARM should be multiplied by 64, is it correct?

Another issue is that, could you please explain more on the float printing issue? Any solution for that? We may need to print in float number for verification. Actually my existing project can print float number, but I compare the two project and cannot find any setting different.

Thanks.

0 lding over 5 years ago in reply to Chanh Nguyen64

TI__Guru* 95265 points

Hi,

See this: https://developer.arm.com/documentation/ddi0438/c/performance-monitor-unit/pmu-register-descriptions/performance-monitor-control-register

[3]

Clock divider:

0: When enabled, PMCCNTR counts every clock cycle. This is the reset value.

1: When enabled, PMCCNTR counts every 64 clock cycles.

This bit is read/write.

Please check the bit 3: D-bit setting in the code. ORR R0, R0, #0x5, D-bit is 0 ========> so counts every cycle.

2) For me, floating point in some project print and didn't print on others. I didn't track why. I will ask my colleague if they know. Good to know there is no such problem in your setup.

Regards, Eric

0 Chanh Nguyen64 over 5 years ago in reply to lding

Expert 1230 points

Hi Eric,

Thanks for your information.

I have added my testing code to your sample project (For DPS, since it is not BIOS project, I replaced it by my BIOS project and run the drystone test, I can get the similar result with you project). my observation is that:

DSP timing is more consistent than ARM.
DSP timing is larger than ARM, even the test code only involve calculation.

Could you please advise how to exlpain the result, as we all know DSP calculation speed should be better than ARM? For the ARM timing, anyway to improve the consistency for the Real-time Application?

My test result and test project are as below. Thanks a lot.

WS_66AK2_Calc_W_Drystone_200709.zip

0 lding over 5 years ago in reply to Chanh Nguyen64

TI__Guru* 95265 points

Hi,

Is that correct the figure you showed above is the cycles for your calc_test() and calc_test2()? This is not the Dhrystone?

You test is data addition and multiplications and math functions in generic C code. I'm not sure the performance with such algorithm C66x vs A15. Maybe this is what you can get, or you can check how many cycles spends on each steps to understand which consumes more cycles. DSP code can be optimized with intrinsic which will greatly improve the performance, but this needs expertise.

ARM A15 is a supersclar and in order instruction execution is not guaranteed. That may be the reason you see the fluctuation. You can add ARM barrier like dsb, imb, dmb to see if helps.

Another angle is for you to check the A15 instruction cache and data cache usage, to see if any cache missing caused this. Your test code is small and it should fully cached, but I am not sure as you didn't delete the Dhrystone code, which was also runs.

Regards, Eric

0 Chanh Nguyen64 over 5 years ago in reply to lding

Expert 1230 points

Hic Eric,

Yes, they are calculation function time, not Dhrystone.

Could you please could you please elaborate more on ARM barrier like dsb, imb, dmb, and A15 instruction cache and data cache usage. Or is there any document I can refer to?

Thanks a lot.

0 lding over 5 years ago in reply to Chanh Nguyen64

TI__Guru* 95265 points

Hi,

Please see https://developer.arm.com/documentation/ddi0438/g/, looking at Chapter 11. Performance Monitor Unit. There are different events to check I-cache, D-cache access. Also search for dsb/lsb. I am not sure if you need to go that far.

Regards, Eric

0 Chanh Nguyen64 over 5 years ago in reply to lding

Expert 1230 points

Hi Eric,

The 2 function is already very simple (e.g. contains those operations like +-*/ or math functions like sin/cos/pow/sqrt), so I wonder if breakdown the timing could help to understand more.

I search online, but I still do not under stand how those mentioned methods (ARM barrier, I cache and D cache usage). May I know if those technique is used on your benchmark program between DSP and ARM?

Could you please help to confirm if my observation in previos post is correct? I feels very supprise as this conflict with what we know about DSP and ARM. And could you please provide the explaination on this result also?

Further more, I also see some part on DSP can be improved using intrinsic (for example for loop, some addition/multiplication operation), but for those math function, the only way to optimize is to use MathLib, am I correct? My DSP project should already use MathLib.

Thanks a lot.

0 Chanh Nguyen64 over 5 years ago in reply to Chanh Nguyen64

Expert 1230 points

Hi Eric,

Just suplement, the most important topic is why in my test result, the ARM is faster than DSP. Is it expected. So I would appreciate your help on the conclusion and exlaination on this topic.

For the fluctuation on the ARM, it is also important but not as high priority as the performance comparation between DSP and ARM. Sorry if my previous post is not clear.

Thanks a lot.

0 lding over 5 years ago in reply to Chanh Nguyen64

TI__Guru* 95265 points

Hi,

Here is what I got on K2H DSP and ARM for your benchmarking algorithms:

	C66x		A15
Iterations	rgu32TimeSpend	rgu32TimeSpend2	rgu32TimeSpend	rgu32TimeSpend2
1	119	2092	79	1814
2	90	1651	33	338
3	90	1651	31	249
4	90	1708	32	254
5	97	1662	35	271
6	98	1632	31	252
7	90	1632	35	305
8	90	1699	31	260
9	90	1679	31	250
10	90	1623	31	272
11	90	1662	31	257
12	90	1710	31	245
13	90	1662	33	262
14	90	1660	31	269
15	90	1623	31	248
16	90	1623	33	279
17	90	1660	31	259
18	90	1671	32	261
19	90	1660	31	249
20	90	1660	32	263
21	97	1632	31	281
22	98	1688	31	260
23	90	1632	31	285
24	90	1739	113	356
25	90	1651	31	271
26	90	1808	33	319
27	90	1701	31	259
28	90	1767	31	267
29	90	1795	31	294
30	90	1758	32	259
31	90	1767	32	262
32	90	1758	31	265
33	90	1758	31	257
34	90	1808	33	280
35	90	1730	31	250
36	90	1730	35	268
37	97	1758	31	274
38	98	1769	31	341
39	90	1758	27	269
40	90	1769	31	297
41	90	1739	33	279
42	90	1769	31	264
43	90	1730	31	253
44	90	1739	32	268
45	90	1797	31	284
46	90	1769	31	260
47	90	1778	31	274
48	90	1778	31	275
49	90	1806	32	344
50	90	1797	31	286
51	90	1660	35	267
52	90	1758	31	313
53	97	1690	31	266
54	98	1662	31	274
55	90	1671	32	269
56	90	1690	31	276
57	90	1660	31	244
58	90	1671	32	273
59	90	1671	35	275
60	90	1690	31	252
61	90	1623	32	261
62	90	1688	31	280
63	90	1671	31	274
64	90	1690	31	268
65	90	1662	31	255
66	90	1651	32	282
67	90	1710	31	254
68	90	1651	31	260
69	97	1632	31	255
70	98	1651	32	257
71	90	1671	31	281
72	90	1671	31	254
73	90	1651	31	291
74	90	1690	31	289
75	90	1671	38	358
76	90	1739	31	313
77	90	1778	31	280
78	90	1767	31	265
79	90	1797	31	256
80	90	1739	31	280
81	90	1769	31	283
82	90	1739	33	252
83	90	1769	31	258
84	90	1767	32	261
85	97	1769	31	257
86	98	1806	31	250
87	90	1730	33	249
88	90	1730	31	251
89	90	1786	35	264
90	90	1639	32	238
91	90	1678	33	389
92	90	1639	31	218
93	90	1697	31	193
94	90	1706	32	216
95	90	1706	31	186
96	90	1686	32	284
97	90	1667	31	198
98	90	1678	33	186
99	90	1717	33	197
100	90	1667	30	183

Yes, with conventional C code, C66x is slower than A15. (My C66x for rgu32TimeSpend2 is even slower than yours for some reason, also the C66x showed fluctuation).

Regards, Eric

0 Chanh Nguyen64 over 5 years ago in reply to lding

Expert 1230 points

Hi Eric,

Can we have a clear explaination on this result?

Or can you share the source of benchmarking program that shows DSP is faster than ARM, to gether with your test result (e.g. program used in document SPRAC13, or any similar)?

Thanks a lot.

0 lding over 5 years ago in reply to Chanh Nguyen64

TI__Guru* 95265 points

Hi,

The test code we both tried is generic C code. To improve the performance we need to rewrite with C66x intrinsic and use the MATHLIB for those math functions.

When we use include <math.h>, those functions come from C66x run time library (rts6600_elf.lib), not the MATHLIB.

I am checking if we have the code for SPRAC13.

Regards, Eric

0 Chanh Nguyen64 over 5 years ago in reply to lding

Expert 1230 points

Hi Eric,

I rebuilt MathLib with OVERRIDE_RTS flag and put mathlib.ae66 above lib.a, so when the same interface with normal funtion is used, it still be repalced by coresponding function from MathLib. We can vefiry it by the map file.

Thanks.

0 lding over 5 years ago in reply to Chanh Nguyen64

TI__Guru* 95265 points

Hi,

For the C66x code used in SPRAC13, you can try the mathlib_c66x_3_1_2_4\packages\ti\mathlib\src. There are all the CCS projects you can import, build and run. For several math functions referred in the SPRAC13 and in your test cases, I put numbers below, they should better than the ARM A15 implementation.

--------------------------------------------------------------------------------

Verification Results: rsqrtSP
--------------------------------------------------------------------------------
Pre-defined Data: Passed
Special Case Data: Passed
Extended Range Data: Passed
Random Data (seed = 7878): Passed
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
Cycle Profile: rsqrtSP
--------------------------------------------------------------------------------
RTS: 180 cycles
ASM: 81 cycles
C: 81 cycles
Inline: 129 cycles
Vector: 6 cycles
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
Memory Profile: rsqrtSP
--------------------------------------------------------------------------------
ASM: 0 bytes
C: 256 bytes
Vector: 256 bytes
--------------------------------------------------------------------------------

-------------------------------------------------------------------------------

Verification Results: atan2SP
--------------------------------------------------------------------------------
Pre-defined Data: Passed
Special Case Data: Passed
Extended Range Data: Passed
Random Data (seed = 7878): Passed
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
Cycle Profile: atan2SP
--------------------------------------------------------------------------------
RTS: 351 cycles
ASM: 114 cycles
C: 111 cycles
Inline: 132 cycles
Vector: 22 cycles
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
Memory Profile: atan2SP
--------------------------------------------------------------------------------
ASM: 0 bytes
C: 896 bytes
Vector: 2688 bytes
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
Verification Results: log10SP
--------------------------------------------------------------------------------
Pre-defined Data: Passed
Special Case Data: Passed
Extended Range Data: Passed
Random Data (seed = 7878): Passed
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
Cycle Profile: log10SP
--------------------------------------------------------------------------------
RTS: 166 cycles
ASM: 89 cycles
C: 89 cycles
Inline: 257 cycles
Vector: 12 cycles
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
Memory Profile: log10SP
--------------------------------------------------------------------------------
ASM: 0 bytes
C: 576 bytes
Vector: 1312 bytes
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
Verification Results: cosSP
--------------------------------------------------------------------------------
Pre-defined Data: Passed
Special Case Data: Passed
Extended Range Data: Passed
Random Data (seed = 7878): Passed
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
Cycle Profile: cosSP
--------------------------------------------------------------------------------
RTS: 175 cycles
ASM: 101 cycles
C: 106 cycles
Inline: 97 cycles
Vector: 10 cycles
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
Memory Profile: cosSP
--------------------------------------------------------------------------------
ASM: 0 bytes
C: 576 bytes
Vector: 1760 bytes
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
Verification Results: sinSP
--------------------------------------------------------------------------------
Pre-defined Data: Passed
Special Case Data: Passed
Extended Range Data: Passed
Random Data (seed = 7878): Passed
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
Cycle Profile: sinSP
--------------------------------------------------------------------------------
RTS: 164 cycles
ASM: 95 cycles
C: 95 cycles
Inline: 74 cycles
Vector: 10 cycles
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
Memory Profile: sinSP
--------------------------------------------------------------------------------
ASM: 0 bytes
C: 448 bytes
Vector: 1376 bytes
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
Verification Results: powSP
--------------------------------------------------------------------------------
Pre-defined Data: Passed
Special Case Data: Passed
Extended Range Data: Passed
Random Data (seed = 7878): Passed
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
Cycle Profile: powSP
--------------------------------------------------------------------------------
RTS: 685 cycles
ASM: 167 cycles
C: 167 cycles
Inline: 573 cycles
Vector: 53 cycles
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
Memory Profile: powSP
--------------------------------------------------------------------------------
ASM: 0 bytes
C: 1408 bytes
Vector: 2816 bytes
--------------------------------------------------------------------------------

Regards, Eric

0 Chanh Nguyen64 over 5 years ago in reply to lding

Expert 1230 points

Hi Eric,

Follow your suggestion I have get the test code of Filter from DSPLIB and test with both DSP and ARM.

Thanks.

Processors

Processors forum

EVMK2H: Benchmark program between DSP and ARM