This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Speed of vector complex multiply using Intrinsics

I am benchmarking some applications on C6678 and have written some vector math routines not provided by DSPLIB to speed up performance. One routine is a complex vector multiply. Unfortunately my code for the multiply is slower than DSPLIB's FFT by 40%, and I don't understand how that could be given the multiply is 6N flops and the FFT is 5NlogN flops. Is the following code the best you can do for a complex vector multiply (where storage is re/im/re/im not im/re/im/re)?

#pragma MUST_ITERATE(2,,2)
#pragma UNROLL(2)
for (i=0; i<n; i++) {
  __float2_t dv = _complex_mpysp(_amemd8(&da[i]),_amemd8(&db[i]));
  _amemd8(&dc[i]) = _ftod(_lof(dv),-_hif(dv));
}

Here is my compile line with some extraneous stuff cut out

"C:/ti/C6000 Code Generation Tools 7.4.2/bin/cl6x" -mv6600 -c  -mv6600 --abi=eabi -k -O3 --define=C66_PLATFORMS --display_error_number --diag_warning=225  -Dxdc_target_types__="ti/targets/elf/std.h" -Dxdc_target_name__=C66 ../../source/ti_c6678/e_cvmul.c --output_file=e_cvmul.obj

And here is the ASM which is generated

5722.e_cvmul.asm

  • Hi,

    With a similar routine, I measured 216 ticks to mul two vectors of 100 complex. My build generate a sw pipelined loop with 8 iteraiton in parallel, while your give only 2.

    Try to add the following compilation options "-opt_for_speed=5" and "-speculate_unknown_loads" (this generate a warning).

    My test routines is:

    unsigned int v_complex_mul_test(int n, const my_complex_t* const restrict a, const my_complex_t* const restrict b, my_complex_t* const restrict r1)
    {
      unsigned int start=TSCL;
      _nassert((n % 2) == 0);
      _nassert((n>=2));
      _nassert((int)(a) % 8 == 0);
      _nassert((int)(b) % 8 == 0);
      _nassert((int)(r1) % 8 == 0);

      for(int i=0; i<n; ++i)
      {
        double res=_complex_mpysp(*(double*)&a[i], *(double*)&b[i]);
        r1[i].im=_hif(res);
        r1[i].re=_lof(res);
      }
      unsigned int t=TSCL-start;
      return t;
    }

  • Please refer the two post which will help you to understand how to handle the complex arithmetic. FFT lib takes the data in the form of Re/Im and all the DSP library function takes that as an standard format.

    http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/p/325684/1135801.aspx#1135801

    http://e2e.ti.com/support/embedded/tirtos/f/355/p/305811/1082832.aspx#1082832

    The _complex_mpysp intrinsics in little endian mode requires a swap and negate to work for real/ Imag format which is handled in all the dsplib functions which you may want to refer. Try to hide that with some other operations while using little endian mode.

    It is not clear from the code whether you are using the restrict pointers for optimization of the pointers. Refer the optimization guide for simple optimization tricks.

    In your for loop there should be atleast two cmpysp instructions in order to get the schedule of ii = 3 and better compiler pipe lining

    Regards,

    Asheesh.

  • Thanks both for your responses. It seems that I'm wasn't getting the loop unrolling I need (as Alberto noticed) without using the restrict pointers (as Asheesh recommended)

    One question about restrict

    If my new function is e_cvmul(complex * restrict a,complex * restrict b, complex * restrict c,int n) that does c[i] = a[i] * b[i], does restrict also tell the compiler that a != c? I want to write a function that can be invoked for the cases where a==c, b==c, or a==b and the documentation on restrict makes me think I am telling the compiler that cannot be true. Do I have to write different loops for the different cases? Or if the memory access is linear and simple, do I not have to worry about such things and it should work?

  • Refer section 6.5.5 of spru187u for details of the restrict pointer. The restrict pointer is used to tell the compiler that the pointers pointing to memory regions never overlap in memory.

    You can also refer different optimization guides for references to restrict pointer usage SPRABG7, SPRABF2.

    Regards,

    Asheesh