Speed of vector complex multiply using Intrinsics

James Steed

Expert 1905 points

I am benchmarking some applications on C6678 and have written some vector math routines not provided by DSPLIB to speed up performance. One routine is a complex vector multiply. Unfortunately my code for the multiply is slower than DSPLIB's FFT by 40%, and I don't understand how that could be given the multiply is 6N flops and the FFT is 5NlogN flops. Is the following code the best you can do for a complex vector multiply (where storage is re/im/re/im not im/re/im/re)?

#pragma MUST_ITERATE(2,,2)
#pragma UNROLL(2)
for (i=0; i<n; i++) {
__float2_t dv = _complex_mpysp(_amemd8(&da[i]),_amemd8(&db[i]));
_amemd8(&dc[i]) = _ftod(_lof(dv),-_hif(dv));
}

Here is my compile line with some extraneous stuff cut out

"C:/ti/C6000 Code Generation Tools 7.4.2/bin/cl6x" -mv6600 -c -mv6600 --abi=eabi -k -O3 --define=C66_PLATFORMS --display_error_number --diag_warning=225 -Dxdc_target_types__="ti/targets/elf/std.h" -Dxdc_target_name__=C66 ../../source/ti_c6678/e_cvmul.c --output_file=e_cvmul.obj

And here is the ASM which is generated

5722.e_cvmul.asm

over 10 years ago

0 Alberto Chessa over 10 years ago

Mastermind 6650 points

Hi,

With a similar routine, I measured 216 ticks to mul two vectors of 100 complex. My build generate a sw pipelined loop with 8 iteraiton in parallel, while your give only 2.

Try to add the following compilation options "-opt_for_speed=5" and "-speculate_unknown_loads" (this generate a warning).

My test routines is:

unsigned int v_complex_mul_test(int n, const my_complex_t* const restrict a, const my_complex_t* const restrict b, my_complex_t* const restrict r1)
{
unsigned int start=TSCL;
_nassert((n % 2) == 0);
_nassert((n>=2));
_nassert((int)(a) % 8 == 0);
_nassert((int)(b) % 8 == 0);
_nassert((int)(r1) % 8 == 0);

for(int i=0; i<n; ++i)
{
    double res=_complex_mpysp(*(double*)&a[i], *(double*)&b[i]);
    r1[i].im=_hif(res);
    r1[i].re=_lof(res);
}
unsigned int t=TSCL-start;
return t;
}

0 Asheesh Bhardwaj over 10 years ago in reply to Alberto Chessa

TI__Expert 4680 points

Please refer the two post which will help you to understand how to handle the complex arithmetic. FFT lib takes the data in the form of Re/Im and all the DSP library function takes that as an standard format.

http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/p/325684/1135801.aspx#1135801

http://e2e.ti.com/support/embedded/tirtos/f/355/p/305811/1082832.aspx#1082832

The _complex_mpysp intrinsics in little endian mode requires a swap and negate to work for real/ Imag format which is handled in all the dsplib functions which you may want to refer. Try to hide that with some other operations while using little endian mode.

It is not clear from the code whether you are using the restrict pointers for optimization of the pointers. Refer the optimization guide for simple optimization tricks.

In your for loop there should be atleast two cmpysp instructions in order to get the schedule of ii = 3 and better compiler pipe lining

Regards,

Asheesh.

0 James Steed over 10 years ago in reply to Asheesh Bhardwaj

Expert 1905 points

Thanks both for your responses. It seems that I'm wasn't getting the loop unrolling I need (as Alberto noticed) without using the restrict pointers (as Asheesh recommended)

One question about restrict

If my new function is e_cvmul(complex * restrict a,complex * restrict b, complex * restrict c,int n) that does c[i] = a[i] * b[i], does restrict also tell the compiler that a != c? I want to write a function that can be invoked for the cases where a==c, b==c, or a==b and the documentation on restrict makes me think I am telling the compiler that cannot be true. Do I have to write different loops for the different cases? Or if the memory access is linear and simple, do I not have to worry about such things and it should work?

0 Asheesh Bhardwaj over 10 years ago in reply to James Steed

TI__Expert 4680 points

Refer section 6.5.5 of spru187u for details of the restrict pointer. The restrict pointer is used to tell the compiler that the pointers pointing to memory regions never overlap in memory.

You can also refer different optimization guides for references to restrict pointer usage SPRABG7, SPRABF2.

Regards,

Asheesh

Processors

Processors forum

Speed of vector complex multiply using Intrinsics