Good day!
I'am using EVMC6678L
I have a trouble with performance of complex multiply( working with float)
for example:
#define N_ 2048
#define M_ 192
#pragma DATA_ALIGN(bufA, 8);
#pragma DATA_SECTION(bufA, "SHARED_MEM");
float bufA[M_][N_*complex_size];
#pragma DATA_ALIGN(bufC, 8);
#pragma DATA_SECTION(bufC,"SHARED_MEM");
float bufC[M_][M_*complex_size];
......
DSPF_sp_dotp_cplx(&bufA[0][0], &bufA[0][0], N_, &bufC[0][0], &bufC[0][1]); tooks about 1.04 cycles per iteration (one complex multiply, when i'am using precompiled library, expected performance about 0.75-0.8 cycles (from TI test benchmark))
but for solving my task i have a bit change source code of this function (changes _complex_mpysp to _complex_conjugate_mpysp)
running from source(not from lib) its took 2.56 cycles per iteration (optimization -O2 and -O3), such result was taken for original source code of DSPF_sp_dotp_cplx
So, main question, could you explain how to correct build optimized code? Another way - rebuld overall dsplib with my changes, but i don't know how to do this.
Thanks!