This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

EVMK2H: Benchmark program between DSP and ARM

Part Number: EVMK2H
Other Parts Discussed in Thread: MATHLIB, SYSBIOS, TEST2

Hi TI,

1. Do you have any example of benchmarking program to comapare the performance of DSP and ARM? We want to study which tasks to be run on DSP/ARM.

2. I try to implement a small program invlove mostly mathematic calculation, my expectation is DSP should be faster (as stated in  TI DSP Benchmarking - SPRAC13). However what I see is that the ARM execution time is faster. My program is just run after loading GEL file, no specific setting for either DSP or ARM. What could be the reason of the observe performace?

3. Is there any reference or guideline on which program should be run on DSP/ARM?

Thanks a lot.

  • Hi,

    https://www.ti.com/lit/an/sprac13/sprac13.pdf you mentioned is our application note for A15 and C66x bench-marking. The table 1 well summarized the results where C66x shows advantages over A15.

    Typically, anything that involves in heavy signal processing, matrix operations, linear algebra are good for DSP implementation. The program with more control code are good for ARM processor.

    >>> I try to implement a small program invlove mostly mathematic calculation>>>> For any benchmarking, you need have a framework setting up right: cache, cycle counter, memory placement, compiler/linker options, etc for comparison between different processor architectures. Also, if any optimized library code (like TI MATHLIB) is used for better performance.

    Regards, Eric  

  • Hi Eric,

    Thanks for your advice.

    Do you have any example DSP and ARM project that contains proper setting for benchmarking purpose?

  • Hi,

    Attached one is for C66x Dhrystone, it should report number like "Normalized MIPS/MHz =   0.8297". Then you can replace with your own application.

    For A15, I try to find one for you.

    Regards, EricDhrystone_C66.zip

  • Hi,

    C66x sample output:

    C66xx_0]
    Dhrystone Benchmark, Version 2.1 (Language: C)

    Program compiled without 'register' attribute

    Please give the number of runs through the benchmark:
    Execution starts, 1000000 runs through Dhrystone
    Execution ends

    Final values of the variables used in the benchmark:

    Int_Glob: 5
    should be: 5
    Bool_Glob: 1
    should be: 1
    Ch_1_Glob: A
    should be: A
    Ch_2_Glob: B
    should be: B
    Arr_1_Glob[8]: 7
    should be: 7
    Arr_2_Glob[8][7]: 1000010
    should be: Number_Of_Runs + 10
    Ptr_Glob->
    Ptr_Comp: 8435272
    should be: (implementation-dependent)
    Discr: 0
    should be: 0
    Enum_Comp: 2
    should be: 2
    Int_Comp: 17
    should be: 17
    Str_Comp: DHRYSTONE PROGRAM, SOME STRING
    should be: DHRYSTONE PROGRAM, SOME STRING
    Next_Ptr_Glob->
    Ptr_Comp: 8435272
    should be: (implementation-dependent), same as above
    Discr: 0
    should be: 0
    Enum_Comp: 1
    should be: 1
    Int_Comp: 18
    should be: 18
    Str_Comp: DHRYSTONE PROGRAM, SOME STRING
    should be: DHRYSTONE PROGRAM, SOME STRING
    Int_1_Loc: 5
    should be: 5
    Int_2_Loc: 13
    should be: 13
    Int_3_Loc: 7
    should be: 7
    Enum_Loc: 1
    should be: 1
    Str_1_Loc: DHRYSTONE PROGRAM, 1'ST STRING
    should be: DHRYSTONE PROGRAM, 1'ST STRING
    Str_2_Loc: DHRYSTONE PROGRAM, 2'ND STRING
    should be: DHRYSTONE PROGRAM, 2'ND STRING

    Total 686000129 cycles spend for 1000000 iterations
    Microseconds for one run through Dhrystone: 0.7
    Dhrystones per Second: 1457725.8

    Normalized MIPS/MHz = 0.8297

    For A15 core, we don't have example on K2H but we have it on other processors. I created a SYSBIOS one on K2H A15 for your reference, the sample output:

    Total 176305320 cycles spend for 1000000 iterations
    Microseconds for one run through Dhrystone:
    Dhrystones per Second:

    Normalized MIPS/MHz =

    For some reason, floating point didn't print out properly under SYSBIOS environment. You can do the rest calculation, it should be

    Dhrystones per Second: 5671978.6

    Normalized MIPS/MHz = 3.328

    Hope both projects can be used for your bench-marking framework.

    Regards, Eric

    Dhrystone_A15.zip

  • Hi Eric,

    I have try the provided example, and can get similar result as yours.

    Just few question about the calculation of the result:

    Normalized MIPS/MHz =  Dhrystones_Per_Second/1757.0/1000.0 --> Can I say that in this calculation we assume that 1000.0 MHz is processor clock? So if I change the processor PLL I should change the number 1000.0 as well, Am I correct? And from my searching, the number 1757 is "The industry has adopted the VAX 11/780 as the reference 1 MIP machine. The VAX 11/780 achieves 1757 Dhrystones per second".

    Microseconds, Dhrystones_Per_Second will depend on the processor clock, while Normalized MIPS/MHz should be constant (or could be vary a little bit) regarless of the processor clock, am I correct?

    Thanks a lot.

  • Hi,

    So if I change the processor PLL I should change the number 1000.0 as well, Am I correct? >>>Correct, you need to change this PLL number.

    Microseconds, Dhrystones_Per_Second will depend on the processor clock, while Normalized MIPS/MHz should be constant (or could be vary a little bit) regarless of the processor clock, am I correct? >>>>>>> The cycle count for an algorithm should be constant regardless of your CPU speed. Let's say at 1000 MHz, you can run N iterations per second. When you set clock at half (500 MHz), you should be able to run only N/2 iterations per second. When you normalized by MHz, the results should be constant regardless of CPU speed setting.

    Regards, Eric

  • Got it. Thanks Eric.

  • Hi Eric,

    The comments in the file performance_unit.s says that ARM_CCNT_Read returns the clock value divided by 64 cycles. Is it the case?

    If yes then the total cycles on ARM should be multiplied by 64, is it correct?

    Another issue is that, could you please explain more on the float printing issue? Any solution for that? We may need to print in float number for verification. Actually my existing project can print float number, but I compare the two project and cannot find any setting different.

    Thanks.

  • Hi,

    See this: https://developer.arm.com/documentation/ddi0438/c/performance-monitor-unit/pmu-register-descriptions/performance-monitor-control-register

    [3] D

    Clock divider:

    0

    When enabled, PMCCNTR counts every clock cycle. This is the reset value.

    1

    When enabled, PMCCNTR counts every 64 clock cycles.

    This bit is read/write.

    Please check the bit 3: D-bit setting in the code. ORR R0, R0, #0x5, D-bit is 0 ========> so counts every cycle.

    2) For me, floating point in some project print and didn't print on others. I didn't track why. I will ask my colleague if they know. Good to know there is no such problem in your setup.

    Regards, Eric

  • Hi Eric,

    Thanks for your information.

    I have added my testing code to your sample project (For DPS, since it is not BIOS project, I replaced it by my BIOS project and run the drystone test, I can get the similar result with you project). my observation is that:

    • DSP timing is more consistent than ARM.
    • DSP timing is larger than ARM, even the test code only involve calculation.

    Could you please advise how to exlpain the result, as we all know DSP calculation speed should be better than ARM? For the ARM timing, anyway to improve the consistency for the Real-time Application?

    My test result and test project are as below. Thanks a lot.

    WS_66AK2_Calc_W_Drystone_200709.zip

  • Hi,

    Is that correct the figure you showed above is the cycles for your calc_test() and calc_test2()? This is not the Dhrystone?

    You test is data addition and multiplications and math functions in generic C code. I'm not sure the performance with such algorithm C66x vs A15. Maybe this is what you can get, or you can check how many cycles spends on each steps to understand which consumes more cycles. DSP code can be optimized with intrinsic which will greatly improve the performance, but this needs expertise.

    ARM A15 is a supersclar and in order instruction execution is not guaranteed. That may be the reason you see the fluctuation. You can add ARM barrier like dsb, imb, dmb to see if helps.

    Another angle is for you to check the A15 instruction cache and data cache usage, to see if any cache missing caused this. Your test code is small and it should fully cached, but I am not sure as you didn't delete the Dhrystone code, which was also runs.

    Regards, Eric          

  • Hic Eric,

    Yes, they are calculation function time, not Dhrystone.

    Could you please could you please elaborate more on ARM barrier like dsb, imb, dmb, and A15 instruction cache and data cache usage. Or is there any document I can refer to?

    Thanks a lot.

  • Hi,

    Please see https://developer.arm.com/documentation/ddi0438/g/, looking at Chapter 11. Performance Monitor Unit. There are different events to check I-cache, D-cache access. Also search for dsb/lsb. I am not sure if you need to go that far.

    Regards, Eric

  • Hi Eric,

    The 2 function is already very simple (e.g. contains those operations like +-*/ or math functions like sin/cos/pow/sqrt), so I wonder if breakdown the timing could help to understand more.

    I search online, but I still do not under stand how those mentioned methods (ARM barrier, I cache and D cache usage). May I know if those technique is used on your benchmark program between DSP and ARM?

    Could you please help to confirm if my observation in previos post is correct? I feels very supprise as this conflict with what we know about DSP and ARM. And could you please provide the explaination on this result also?

    Further more, I also see some part on DSP can be improved using intrinsic (for example for loop, some addition/multiplication operation), but for those math function, the only way to optimize is to use MathLib, am I correct? My DSP project should already use MathLib.

    Thanks a lot.

  • Hi Eric,

    Just suplement, the most important topic is why in my test result, the ARM is faster than DSP. Is it expected. So I would appreciate your help on the conclusion and exlaination on this topic.

    For the fluctuation on the ARM, it is also important but not as high priority as the performance comparation between DSP and ARM. Sorry if my previous post is not clear.

    Thanks a lot.

  • Hi,

    Here is what I got on K2H DSP and ARM for your benchmarking algorithms:

      C66x   A15  
     Iterations rgu32TimeSpend rgu32TimeSpend2 rgu32TimeSpend rgu32TimeSpend2
    1 119 2092 79 1814
    2 90 1651 33 338
    3 90 1651 31 249
    4 90 1708 32 254
    5 97 1662 35 271
    6 98 1632 31 252
    7 90 1632 35 305
    8 90 1699 31 260
    9 90 1679 31 250
    10 90 1623 31 272
    11 90 1662 31 257
    12 90 1710 31 245
    13 90 1662 33 262
    14 90 1660 31 269
    15 90 1623 31 248
    16 90 1623 33 279
    17 90 1660 31 259
    18 90 1671 32 261
    19 90 1660 31 249
    20 90 1660 32 263
    21 97 1632 31 281
    22 98 1688 31 260
    23 90 1632 31 285
    24 90 1739 113 356
    25 90 1651 31 271
    26 90 1808 33 319
    27 90 1701 31 259
    28 90 1767 31 267
    29 90 1795 31 294
    30 90 1758 32 259
    31 90 1767 32 262
    32 90 1758 31 265
    33 90 1758 31 257
    34 90 1808 33 280
    35 90 1730 31 250
    36 90 1730 35 268
    37 97 1758 31 274
    38 98 1769 31 341
    39 90 1758 27 269
    40 90 1769 31 297
    41 90 1739 33 279
    42 90 1769 31 264
    43 90 1730 31 253
    44 90 1739 32 268
    45 90 1797 31 284
    46 90 1769 31 260
    47 90 1778 31 274
    48 90 1778 31 275
    49 90 1806 32 344
    50 90 1797 31 286
    51 90 1660 35 267
    52 90 1758 31 313
    53 97 1690 31 266
    54 98 1662 31 274
    55 90 1671 32 269
    56 90 1690 31 276
    57 90 1660 31 244
    58 90 1671 32 273
    59 90 1671 35 275
    60 90 1690 31 252
    61 90 1623 32 261
    62 90 1688 31 280
    63 90 1671 31 274
    64 90 1690 31 268
    65 90 1662 31 255
    66 90 1651 32 282
    67 90 1710 31 254
    68 90 1651 31 260
    69 97 1632 31 255
    70 98 1651 32 257
    71 90 1671 31 281
    72 90 1671 31 254
    73 90 1651 31 291
    74 90 1690 31 289
    75 90 1671 38 358
    76 90 1739 31 313
    77 90 1778 31 280
    78 90 1767 31 265
    79 90 1797 31 256
    80 90 1739 31 280
    81 90 1769 31 283
    82 90 1739 33 252
    83 90 1769 31 258
    84 90 1767 32 261
    85 97 1769 31 257
    86 98 1806 31 250
    87 90 1730 33 249
    88 90 1730 31 251
    89 90 1786 35 264
    90 90 1639 32 238
    91 90 1678 33 389
    92 90 1639 31 218
    93 90 1697 31 193
    94 90 1706 32 216
    95 90 1706 31 186
    96 90 1686 32 284
    97 90 1667 31 198
    98 90 1678 33 186
    99 90 1717 33 197
    100 90 1667 30 183

    Yes, with conventional C code, C66x is slower than A15. (My C66x for rgu32TimeSpend2 is even slower than yours for some reason, also the C66x showed fluctuation).

    Regards, Eric

  • Hi Eric,

    Can we have a clear explaination on this result?

    Or can you share the source of benchmarking program that shows DSP is faster than ARM, to gether with your test result (e.g. program used in document SPRAC13, or any similar)?

    Thanks a lot.

  • Hi,

    The test code we both tried is generic C code. To improve the performance we need to rewrite with C66x intrinsic and use the MATHLIB for those math functions. 

    When we use include <math.h>, those functions come from C66x run time library (rts6600_elf.lib), not the MATHLIB. 

    I am checking if we have the code for SPRAC13.

    Regards, Eric 

  • Hi Eric,

    I rebuilt MathLib with OVERRIDE_RTS flag and put mathlib.ae66 above lib.a, so when the same interface with normal funtion is used, it still be repalced by coresponding function from MathLib. We can vefiry it by the map file.

    Thanks.

  • Hi,

    For the C66x code used in SPRAC13, you can try the mathlib_c66x_3_1_2_4\packages\ti\mathlib\src. There are all the CCS projects you can import, build and run. For several math functions referred in the SPRAC13 and in your test cases, I put numbers below, they should better than the ARM A15 implementation.

    --------------------------------------------------------------------------------

    Verification Results: rsqrtSP
    --------------------------------------------------------------------------------
    Pre-defined Data: Passed
    Special Case Data: Passed
    Extended Range Data: Passed
    Random Data (seed = 7878): Passed
    --------------------------------------------------------------------------------

    --------------------------------------------------------------------------------
    Cycle Profile: rsqrtSP
    --------------------------------------------------------------------------------
    RTS: 180 cycles
    ASM: 81 cycles
    C: 81 cycles
    Inline: 129 cycles
    Vector: 6 cycles
    --------------------------------------------------------------------------------

    --------------------------------------------------------------------------------
    Memory Profile: rsqrtSP
    --------------------------------------------------------------------------------
    ASM: 0 bytes
    C: 256 bytes
    Vector: 256 bytes
    --------------------------------------------------------------------------------

    -------------------------------------------------------------------------------

    Verification Results: atan2SP
    --------------------------------------------------------------------------------
    Pre-defined Data: Passed
    Special Case Data: Passed
    Extended Range Data: Passed
    Random Data (seed = 7878): Passed
    --------------------------------------------------------------------------------

    --------------------------------------------------------------------------------
    Cycle Profile: atan2SP
    --------------------------------------------------------------------------------
    RTS: 351 cycles
    ASM: 114 cycles
    C: 111 cycles
    Inline: 132 cycles
    Vector: 22 cycles
    --------------------------------------------------------------------------------

    --------------------------------------------------------------------------------
    Memory Profile: atan2SP
    --------------------------------------------------------------------------------
    ASM: 0 bytes
    C: 896 bytes
    Vector: 2688 bytes
    --------------------------------------------------------------------------------


    --------------------------------------------------------------------------------
    Verification Results: log10SP
    --------------------------------------------------------------------------------
    Pre-defined Data: Passed
    Special Case Data: Passed
    Extended Range Data: Passed
    Random Data (seed = 7878): Passed
    --------------------------------------------------------------------------------

    --------------------------------------------------------------------------------
    Cycle Profile: log10SP
    --------------------------------------------------------------------------------
    RTS: 166 cycles
    ASM: 89 cycles
    C: 89 cycles
    Inline: 257 cycles
    Vector: 12 cycles
    --------------------------------------------------------------------------------

    --------------------------------------------------------------------------------
    Memory Profile: log10SP
    --------------------------------------------------------------------------------
    ASM: 0 bytes
    C: 576 bytes
    Vector: 1312 bytes
    --------------------------------------------------------------------------------

    --------------------------------------------------------------------------------
    Verification Results: cosSP
    --------------------------------------------------------------------------------
    Pre-defined Data: Passed
    Special Case Data: Passed
    Extended Range Data: Passed
    Random Data (seed = 7878): Passed
    --------------------------------------------------------------------------------

    --------------------------------------------------------------------------------
    Cycle Profile: cosSP
    --------------------------------------------------------------------------------
    RTS: 175 cycles
    ASM: 101 cycles
    C: 106 cycles
    Inline: 97 cycles
    Vector: 10 cycles
    --------------------------------------------------------------------------------

    --------------------------------------------------------------------------------
    Memory Profile: cosSP
    --------------------------------------------------------------------------------
    ASM: 0 bytes
    C: 576 bytes
    Vector: 1760 bytes
    --------------------------------------------------------------------------------


    --------------------------------------------------------------------------------
    Verification Results: sinSP
    --------------------------------------------------------------------------------
    Pre-defined Data: Passed
    Special Case Data: Passed
    Extended Range Data: Passed
    Random Data (seed = 7878): Passed
    --------------------------------------------------------------------------------

    --------------------------------------------------------------------------------
    Cycle Profile: sinSP
    --------------------------------------------------------------------------------
    RTS: 164 cycles
    ASM: 95 cycles
    C: 95 cycles
    Inline: 74 cycles
    Vector: 10 cycles
    --------------------------------------------------------------------------------

    --------------------------------------------------------------------------------
    Memory Profile: sinSP
    --------------------------------------------------------------------------------
    ASM: 0 bytes
    C: 448 bytes
    Vector: 1376 bytes
    --------------------------------------------------------------------------------


    --------------------------------------------------------------------------------
    Verification Results: powSP
    --------------------------------------------------------------------------------
    Pre-defined Data: Passed
    Special Case Data: Passed
    Extended Range Data: Passed
    Random Data (seed = 7878): Passed
    --------------------------------------------------------------------------------

    --------------------------------------------------------------------------------
    Cycle Profile: powSP
    --------------------------------------------------------------------------------
    RTS: 685 cycles
    ASM: 167 cycles
    C: 167 cycles
    Inline: 573 cycles
    Vector: 53 cycles
    --------------------------------------------------------------------------------

    --------------------------------------------------------------------------------
    Memory Profile: powSP
    --------------------------------------------------------------------------------
    ASM: 0 bytes
    C: 1408 bytes
    Vector: 2816 bytes
    --------------------------------------------------------------------------------

    Regards, Eric

  • Hi Eric,

    Follow your suggestion I have get the test code of Filter from DSPLIB and test with both DSP and ARM.

    Thanks.