Complex multiplication and intrisics

Alexandre NGUYEN

Intellectual 260 points

Hi,

I'm trying to see the benefits of the instruction set of the C6678.

I took an example : I want to multiply two complex numbers a and b.

My first version is basic :

c.re = a.re*b.re-a.im*b.im;

c.im = a.re*b.im+b.re*a.im;

This code takes 29 CPU cycles

So... I wanted to use the intrinsics, I thought it would be quicker :

prod = _cmpysp(_ftof2(a.re,a.im),_ftof2(b.re,b.im));

c.re = _hif(_hid128(prod)) + _hif(_lod128(prod));

c.im = _lof(_lod128(prod)) + _lof(_lod128(prod));

And this code takes 43 CPU cycles

So... Do I use the intrisics the wrong way ? Or does it depend on what we want to do ?

Thanks,

Alex

over 12 years ago

George Mock over 12 years ago

TI__Guru**** 251180 points

Please see this post. I recommend you do something similar, except look for the intrinsic _cmpysp.

Thanks and regards,

-George

Alberto Chessa over 12 years ago

Mastermind 6670 points

Hi,

It seem to me that your use of intrinsic is Ok, but for this kind of opertion I measure very different performance:

non-intrinsic: 18 cycles
intrinsic: 17 cycles
intrinsic with _complex_mpysp: 17 cycles

The meausre are taken on the EVM L2RAM (the simulator give the same results) and it includes the registers load and store operations. Note that you have to execute twice and take the last result, otherwise you could measure the caching time also.

Have you activated all the optimizations, I mean --opt_level=3 and also --opt_for_speed=5 and --optimize_with_debug?

How do you measure the cycles? I you use the CCS clock tools with breakpoint note that you have a penality due to CPU pileline (up to 10 cycles)

Alexandre NGUYEN over 12 years ago in reply to Alberto Chessa

Intellectual 260 points

Hi,

Thank you both for your answer. I prefer your answer Alberto because it seems like we "have to" create a very long code to get some good performance.

So... The optimizations were activated, except for --opt_for_speed = 5.

I measure the cycles with the timer described in sprabf2.

I don't understand now, I activated the 3 optimizations you told, and now it tells me 0 cycle... (for both ways, intrinsics or not) Do you know why ?

Alberto Chessa over 12 years ago in reply to Alexandre NGUYEN

Mastermind 6670 points

Alexandre NGUYEN said:

I don't understand now, I activated the 3 optimizations you told, and now it tells me 0 cycle... (for both ways, intrinsics or not) Do you know why ?

I suppose your code don't use the results: the compiler is very smart and when it detects useless code it remove it completly.

To force the compiler to generate the code use as destination variable a global variable.(non static).

I use a function like this:

unsigned int complex_simple_test(const my_complex_t a, const my_complex_t b, my_complex_t& r1)
{
unsigned int start=TSCL;
r1.re = a.re*b.re-a.im*b.im;
r1.im = a.re*b.im+b.re*a.im;
unsigned int t=TSCL-start;
return t;
}

I don't calibrate, but with high optimization level the TSCl loads can be placed in parralel with other operation (looking at the asm code I can see that my overhead should be 0).

Alexandre NGUYEN over 12 years ago in reply to Alberto Chessa

Intellectual 260 points

Thank you for your answer. I declared my variable "c" as a global variable and I got 14 cycles with intrinsics and 15 cycles without.

Another last question... With this little test we see that the multiplication with intrisics is quicker than without.

So, if I change all my complex multiplications in a bigger program, I should get better performance right ?

That seems logic to me, but I did it in a function in another program, and it's quicker without the intrisics. Is it because of some loads and stores, maybe something else ?

Thanks

Alberto Chessa over 12 years ago in reply to Alexandre NGUYEN

Mastermind 6670 points

Alexandre NGUYEN said:

Another last question... With this little test we see that the multiplication with intrisics is quicker than without.

So, if I change all my complex multiplications in a bigger program, I should get better performance right ?

Yes, in this case I think so but since there is a very little difference (only 1 cycles). maybe you'll not be able to appreciate the gain, fo rinstance if the data will never be ready in the cache.

If you simple put the intrinsic version in a loop, you appreciate the difference. On vectors of 100 complex:

non-instrinsic: 515 (5.1 cycles per multiplication)
intrinsic: 417 (4.1 cycles per multiplication - always only 1 cycle faster)

Anyway, in general I see that the compiler is very smart so sometimes it is easy to write an intrisic version that is slower then the non-intrinsic. As you say, thare are some dependency on the data format, memory alignment, usage of the delay slots and so on.

Alexandre NGUYEN over 12 years ago in reply to Alberto Chessa

Intellectual 260 points

Well thank you. In fact I was testing an algorithm of LU decomposition for a matrix 100x100.

There are two loops in it. In the first loop there is a division and in the second there's a multiplication.

By changing the non-intrinsic version of my function by the intrinsic version, I get a difference of 1,5 million cycles, which is kinda huge. That's why I posted the topic in the first place. I think I'll have to use the non intrinsic version to get better performance, otherwise it means I didn't code properly (I think).

Meanwhile I can't explain in my report (I'm in an internship) why I get this result despite the fact that I see with the little test that it's quicker with intrinsic.

Code Composer Studio™︎

Code Composer Studio forum

Complex multiplication and intrisics