Because of the Thanksgiving holiday in the U.S., TI E2E™ design support forum responses may be delayed from November 25 through December 2. Thank you for your patience.

This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA4VM: Performance comparison between C66 and C71

Part Number: TDA4VM

Hi TI,

I want to compare C66 and C71 in terms of computational performance. To do so, I use DSP_fir_r8_66_LE_ELF example from dsplib_c66x_3_4_0_0. I booted TDA4 Eval Board using No-Boot mode and applied the corresponding launch.js script to be able to connect to C66 and/or C71 core. After running the code on the corresponding core, the board is cold reset.

The steps I did for C66 are the following:

  • Import the above mentioned project in CCS and build it (since it is a C66 project, no further steps are required)
  • Additional info: The whole application is located in L2SRAM, I was building it in release mode (using optimization level -O3).
  • To measure CPU cycles, I am using TSCH and TSCL register

This produces the following output (only showing the last lines, where the number of output samples (nr) is biggest):

DSP_fir_r8	Iter#: 121	Result Successful (r_i)	NR = 124	NH = 8	natC: 466	optC: 443
DSP_fir_r8	Iter#: 122	Result Successful (r_i)	NR = 124	NH = 16	natC: 719	optC: 569
DSP_fir_r8	Iter#: 123	Result Successful (r_i)	NR = 124	NH = 24	natC: 987	optC: 695
DSP_fir_r8	Iter#: 124	Result Successful (r_i)	NR = 124	NH = 32	natC: 1255	optC: 821
DSP_fir_r8	Iter#: 125	Result Successful (r_i)	NR = 128	NH = 8	natC: 429	optC: 444
DSP_fir_r8	Iter#: 126	Result Successful (r_i)	NR = 128	NH = 16	natC: 701	optC: 572
DSP_fir_r8	Iter#: 127	Result Successful (r_i)	NR = 128	NH = 24	natC: 973	optC: 700
DSP_fir_r8	Iter#: 128	Result Successful (r_i)	NR = 128	NH = 32	natC: 1245	optC: 828

The steps I did for C71 are the following:

  • Create a new CCS Project for C71 using the corresponding C7000 compiler (v2.1.1 from PSDK RTOS v08.02).
  • Copy DSP_fir_r8_d.c to the project to be used as main.c
  • Copy the corresponding .c and .h files, which contain the FIR calculation functions (DSP_fir_r8_cn.c/.h, DSP_fir_r8.c/.h)
  • Remove include of c6x.h and use c7x.h instead
  • Add c6x_migration.h to migrate C66 code to C71 code
  • Added linker file, which puts the whole application also in L2RAM_C7x_1 (located at 0x64800000)
  • Build in release mode, which means I am using -O3 optimization level and using -mf5, which optimizes for speed
  • To measure CPU cycles, I am using __TSC register

This produces the following output (only showing the last lines, where the number of output samples (nr) is biggest):

DSP_fir_r8	Iter#: 121	Result Successful (r_i)	NR = 124	NH = 8	natC: 2046	optC: 2055
DSP_fir_r8	Iter#: 122	Result Successful (r_i)	NR = 124	NH = 16	natC: 2046	optC: 2954
DSP_fir_r8	Iter#: 123	Result Successful (r_i)	NR = 124	NH = 24	natC: 4020	optC: 3865
DSP_fir_r8	Iter#: 124	Result Successful (r_i)	NR = 124	NH = 32	natC: 4020	optC: 4768
DSP_fir_r8	Iter#: 125	Result Successful (r_i)	NR = 128	NH = 8	natC: 2108	optC: 1784
DSP_fir_r8	Iter#: 126	Result Successful (r_i)	NR = 128	NH = 16	natC: 2108	optC: 2694
DSP_fir_r8	Iter#: 127	Result Successful (r_i)	NR = 128	NH = 24	natC: 4147	optC: 3586
DSP_fir_r8	Iter#: 128	Result Successful (r_i)	NR = 128	NH = 32	natC: 4147	optC: 4482

Q1: Why does C71 need so much more CPU cycles than C66 core? Is there maybe some optimization missing in compiler settings? I tried to be as fair as possible (Putting app both DSPs in L2 memory, using the same optimization level, using the same code)

Q2: On C71 side, why does natC code perform (in most cases) better than the optimized one (optC)?

Thanks for your help and best regards,

Felix