I am benchmarking some applications on C6678 and have written some vector math routines not provided by DSPLIB to speed up performance. One routine is a complex vector multiply. Unfortunately my code for the multiply is slower than DSPLIB's FFT by 40%, and I don't understand how that could be given the multiply is 6N flops and the FFT is 5NlogN flops. Is the following code the best you can do for a complex vector multiply (where storage is re/im/re/im not im/re/im/re)?
#pragma MUST_ITERATE(2,,2)
#pragma UNROLL(2)
for (i=0; i<n; i++) {
__float2_t dv = _complex_mpysp(_amemd8(&da[i]),_amemd8(&db[i]));
_amemd8(&dc[i]) = _ftod(_lof(dv),-_hif(dv));
}
Here is my compile line with some extraneous stuff cut out
"C:/ti/C6000 Code Generation Tools 7.4.2/bin/cl6x" -mv6600 -c -mv6600 --abi=eabi -k -O3 --define=C66_PLATFORMS --display_error_number --diag_warning=225 -Dxdc_target_types__="ti/targets/elf/std.h" -Dxdc_target_name__=C66 ../../source/ti_c6678/e_cvmul.c --output_file=e_cvmul.obj
And here is the ASM which is generated