OMAPL138 DSP vs ARM performance - complex multiplication <floating point>

Matt M

Other Parts Discussed in Thread: OMAPL138

Hello,

I have a small loop that I am using to compare the performance between the ARM and DSP on the OMAPL138. The complex number class is from the C++ library. The final code is meant to be portable, so I don't want to have to specialize too much(any really) of the code for the DSP specifically.

    const unsigned int NUM_SIGNAL_SAMPLES = 500;
   const unsigned int NUM_WEIGHTS = 31;
   const unsigned int NUM_ITERATIONS = 1000;

   std::complex<float> signal[NUM_SIGNAL_SAMPLES];
   std::complex<float> result[NUM_SIGNAL_SAMPLES - NUM_WEIGHTS + 1];
   float w[NUM_WEIGHTS];

   for (unsigned int j = 0; j < NUM_SIGNAL_SAMPLES; j++)
       signal[j] = std::complex<float>(j % 100, -float((j * 2) % 100));

   for (unsigned int k = 0; k < NUM_WEIGHTS; k++)
       w[k] = k * 1.25209;

   for (unsigned int i = 0; i < NUM_ITERATIONS; i++)
   {
       for (unsigned int j = 0; j < NUM_SIGNAL_SAMPLES - NUM_WEIGHTS + 1; j++)
       {
           std::complex<float> temp(0);
           for (unsigned int k = 0; k < NUM_WEIGHTS; k++)
               temp += w[k] * signal[j + k];
           result[j] = temp;
       }
   }

When I run said code on the ARM it takes about 10 seconds to complete, while when I run it on the DSP it takes about 5.5 seconds to complete. If I use doubles instead of floats it takes the ARM 17 seconds, and the DSP 6.5 seconds.

My question mainly is, does this performance make sense? I was expecting that the DSP would be much faster than the ARM for computation like this because of the hardware support..

If the performance of the DSP should be better, what should I do to increase it? I have gone through the cache settings and believe it to be correct (MAR bits set, cache areas set in the TCF file), and I have set compiler switches to increase performance(-mv6740 -O3 --opt_for_speed=5 --auto_inline=1000 --single_inline) but I hope the DSP can still run faster!

Thanks,

Matt

I have gone through the .asm file and I see that only the middle loop is getting software pipeline qualified, is this going to have a large effect on my run time?

over 12 years ago

0 Rahul Prabhu over 12 years ago

TI__Guru** 116180 points

Please refer to the application notes for some easy tips to optimize your code.

http://www.ti.com/lit/an/sprabf2/sprabf2.pdf

Some key C674x Benchmarks are published here.

http://www.ti.com/lsds/ti/dsp/c6000_dsp/c674x/benchmarks.page

Do you have compiler setting to indicate that you are compiling C++ code in language settings options provided in C6000 compiler settings in CCs. Have you tried to code this in C to see if you have any improvement in performance?I recommend this because the BIOS code is primarily C based so by introducing your C++ code you are basically so when you compile it you are essentially mixing the code written in two different languages. Please refer to this wiki article for performance penalties and the way the compiler expects C++ code in BIOS environment:

http://processors.wiki.ti.com/index.php/Overview_of_C%2B%2B_Support_in_TI_Compilers#BIOS_and_C.2B.2B

Regards,

Rahul

Processors

Processors forum

OMAPL138 DSP vs ARM performance - complex multiplication <floating point>