This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Complex multiplication and intrisics



Hi,

I'm trying to see the benefits of the instruction set of the C6678.

I took an example : I want to multiply two complex numbers a and b.

My first version is basic :

c.re = a.re*b.re-a.im*b.im;

c.im = a.re*b.im+b.re*a.im;

This code takes 29 CPU cycles

So... I wanted to use the intrinsics, I thought it would be quicker :

prod = _cmpysp(_ftof2(a.re,a.im),_ftof2(b.re,b.im));

c.re = _hif(_hid128(prod)) + _hif(_lod128(prod));

c.im = _lof(_lod128(prod)) + _lof(_lod128(prod));

And this code takes 43 CPU cycles

So... Do I use the intrisics the wrong way ? Or does it depend on what we want to do ?

Thanks,

Alex

  • Please see this post.  I recommend you do something similar, except look for the intrinsic _cmpysp.  

    Thanks and regards,

    -George

  • Hi,

    It seem to me that your use of intrinsic is Ok, but for this kind of opertion I measure very different performance:

    • non-intrinsic: 18 cycles
    • intrinsic: 17 cycles
    • intrinsic with _complex_mpysp: 17 cycles

    The meausre are taken on the EVM L2RAM (the simulator give the same results) and it includes the registers load and store operations. Note that you have to execute twice and take the last result, otherwise you could measure the caching time also.

    Have you activated all the optimizations, I mean --opt_level=3 and also --opt_for_speed=5 and --optimize_with_debug?

    How do you measure the cycles? I you use the CCS clock tools with breakpoint note that you have a penality due to CPU pileline (up to 10 cycles)

  • Hi,

    Thank you both for your answer. I prefer your answer Alberto because it seems like we "have to" create a very long code to get some good performance.

    So... The optimizations were activated, except for --opt_for_speed = 5.

    I measure the cycles with the timer described in sprabf2.

    I don't understand now, I activated the 3 optimizations you told, and now it tells me 0 cycle... (for both ways, intrinsics or not) Do you know why ?

  • Alexandre NGUYEN said:

    I don't understand now, I activated the 3 optimizations you told, and now it tells me 0 cycle... (for both ways, intrinsics or not) Do you know why ?

    I suppose your code don't use the results: the compiler is very smart and when it detects useless code it remove it completly.

    To force the compiler to generate the code use as destination variable a global variable.(non static).

    I use a function like this:

    unsigned int complex_simple_test(const my_complex_t a, const my_complex_t b, my_complex_t& r1)
    {
      unsigned int start=TSCL;
      r1.re = a.re*b.re-a.im*b.im;
      r1.im = a.re*b.im+b.re*a.im;
      unsigned int t=TSCL-start;
      return t;
    }

    I don't calibrate, but with high optimization level the TSCl loads can be placed in parralel with other operation (looking at the asm code I can see that my overhead should be 0).

  • Thank you for your answer. I declared my variable "c" as a global variable and I got 14 cycles with intrinsics and 15 cycles without.

    Another last question... With this little test we see that the multiplication with intrisics is quicker than without.

    So, if I change all my complex multiplications in a bigger program, I should get better performance right ?

    That seems logic to me, but I did it in a function in another program, and it's quicker without the intrisics. Is it because of some loads and stores, maybe something else ?

    Thanks

  • Alexandre NGUYEN said:

    Another last question... With this little test we see that the multiplication with intrisics is quicker than without.

    So, if I change all my complex multiplications in a bigger program, I should get better performance right ?

    Yes, in this case I think so but since there is a very little difference (only 1 cycles). maybe you'll not be able to appreciate the gain, fo rinstance if the data will never be ready in the cache.

    If you simple put the intrinsic version in a loop, you appreciate the difference. On  vectors of 100 complex:

    • non-instrinsic: 515 (5.1 cycles per multiplication)
    • intrinsic: 417 (4.1 cycles per multiplication - always only 1 cycle faster)

    Anyway, in general I see that the compiler is very smart so sometimes it is easy to write an intrisic version that is slower then the non-intrinsic. As you say, thare are some dependency on the data format, memory alignment, usage of the delay slots  and so on.

  • Well thank you. In fact I was testing an algorithm of LU decomposition for a matrix 100x100.

    There are two loops in it. In the first loop there is a division and in the second there's a multiplication.

    By changing the non-intrinsic version of my function by the intrinsic version, I get a difference of 1,5 million cycles, which is kinda huge. That's why I posted the topic in the first place. I think I'll have to use the non intrinsic version to get better performance, otherwise it means I didn't code properly (I think).

    Meanwhile I can't explain in my report (I'm in an internship) why I get this result despite the fact that I see with the little test that it's quicker with intrinsic.