Hi TI,
I want to compare C66 and C71 in terms of computational performance. To do so, I use DSP_fir_r8_66_LE_ELF example from dsplib_c66x_3_4_0_0. I booted TDA4 Eval Board using No-Boot mode and applied the corresponding launch.js script to be able to connect to C66 and/or C71 core. After running the code on the corresponding core, the board is cold reset.
The steps I did for C66 are the following:
- Import the above mentioned project in CCS and build it (since it is a C66 project, no further steps are required)
- Additional info: The whole application is located in L2SRAM, I was building it in release mode (using optimization level -O3).
- To measure CPU cycles, I am using TSCH and TSCL register
This produces the following output (only showing the last lines, where the number of output samples (nr) is biggest):
DSP_fir_r8 Iter#: 121 Result Successful (r_i) NR = 124 NH = 8 natC: 466 optC: 443 DSP_fir_r8 Iter#: 122 Result Successful (r_i) NR = 124 NH = 16 natC: 719 optC: 569 DSP_fir_r8 Iter#: 123 Result Successful (r_i) NR = 124 NH = 24 natC: 987 optC: 695 DSP_fir_r8 Iter#: 124 Result Successful (r_i) NR = 124 NH = 32 natC: 1255 optC: 821 DSP_fir_r8 Iter#: 125 Result Successful (r_i) NR = 128 NH = 8 natC: 429 optC: 444 DSP_fir_r8 Iter#: 126 Result Successful (r_i) NR = 128 NH = 16 natC: 701 optC: 572 DSP_fir_r8 Iter#: 127 Result Successful (r_i) NR = 128 NH = 24 natC: 973 optC: 700 DSP_fir_r8 Iter#: 128 Result Successful (r_i) NR = 128 NH = 32 natC: 1245 optC: 828
The steps I did for C71 are the following:
- Create a new CCS Project for C71 using the corresponding C7000 compiler (v2.1.1 from PSDK RTOS v08.02).
- Copy DSP_fir_r8_d.c to the project to be used as main.c
- Copy the corresponding .c and .h files, which contain the FIR calculation functions (DSP_fir_r8_cn.c/.h, DSP_fir_r8.c/.h)
- Remove include of c6x.h and use c7x.h instead
- Add c6x_migration.h to migrate C66 code to C71 code
- Added linker file, which puts the whole application also in L2RAM_C7x_1 (located at 0x64800000)
- Build in release mode, which means I am using -O3 optimization level and using -mf5, which optimizes for speed
- To measure CPU cycles, I am using __TSC register
This produces the following output (only showing the last lines, where the number of output samples (nr) is biggest):
DSP_fir_r8 Iter#: 121 Result Successful (r_i) NR = 124 NH = 8 natC: 2046 optC: 2055
DSP_fir_r8 Iter#: 122 Result Successful (r_i) NR = 124 NH = 16 natC: 2046 optC: 2954
DSP_fir_r8 Iter#: 123 Result Successful (r_i) NR = 124 NH = 24 natC: 4020 optC: 3865
DSP_fir_r8 Iter#: 124 Result Successful (r_i) NR = 124 NH = 32 natC: 4020 optC: 4768
DSP_fir_r8 Iter#: 125 Result Successful (r_i) NR = 128 NH = 8 natC: 2108 optC: 1784
DSP_fir_r8 Iter#: 126 Result Successful (r_i) NR = 128 NH = 16 natC: 2108 optC: 2694
DSP_fir_r8 Iter#: 127 Result Successful (r_i) NR = 128 NH = 24 natC: 4147 optC: 3586
DSP_fir_r8 Iter#: 128 Result Successful (r_i) NR = 128 NH = 32 natC: 4147 optC: 4482
Q1: Why does C71 need so much more CPU cycles than C66 core? Is there maybe some optimization missing in compiler settings? I tried to be as fair as possible (Putting app both DSPs in L2 memory, using the same optimization level, using the same code)
Q2: On C71 side, why does natC code perform (in most cases) better than the optimized one (optC)?
Thanks for your help and best regards,
Felix