This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

OMAPL138 DSP vs ARM performance - complex multiplication <floating point>

Other Parts Discussed in Thread: OMAPL138

Hello,

I have a small loop that I am using to compare the performance between the ARM and DSP on the OMAPL138. The complex number class is from the C++ library. The final code is meant to be portable, so I don't want to have to specialize too much(any really) of the code for the DSP specifically.

    const unsigned int NUM_SIGNAL_SAMPLES = 500;
    const unsigned int NUM_WEIGHTS = 31;
    const unsigned int NUM_ITERATIONS = 1000;

    std::complex<float> signal[NUM_SIGNAL_SAMPLES];
    std::complex<float> result[NUM_SIGNAL_SAMPLES - NUM_WEIGHTS + 1];
    float w[NUM_WEIGHTS];

   for (unsigned int j = 0; j < NUM_SIGNAL_SAMPLES; j++)
        signal[j] = std::complex<float>(j % 100, -float((j * 2) % 100));

    for (unsigned int k = 0; k < NUM_WEIGHTS; k++)
        w[k] = k * 1.25209;

    for (unsigned int i = 0; i < NUM_ITERATIONS; i++)
    {
        for (unsigned int j = 0; j < NUM_SIGNAL_SAMPLES - NUM_WEIGHTS + 1; j++)
        {
            std::complex<float> temp(0);
            for (unsigned int k = 0; k < NUM_WEIGHTS; k++)
                temp += w[k] * signal[j + k];
            result[j] = temp;
        }
    }

When I run said code on the ARM it takes about 10 seconds to complete, while when I run it on the DSP it takes about 5.5 seconds to complete. If I use doubles instead of floats it takes the ARM 17 seconds, and the DSP 6.5 seconds.

My question mainly is, does this performance make sense? I was expecting that the DSP would be much faster than the ARM for computation like this because of the hardware support..

If the performance of the DSP should be better, what should I do to increase it? I have gone through the cache settings and believe it to be correct (MAR bits set, cache areas set in the TCF file), and I have set compiler switches to increase performance(-mv6740 -O3  --opt_for_speed=5 --auto_inline=1000 --single_inline) but I hope the DSP can still run faster!

Thanks,

Matt

I have gone through the .asm file and I see that only the middle loop is getting software pipeline qualified, is this going to have a large effect on my run time?