TMS320C6748: DSPF_sp_dotprod from c6748 library takes more cycles.

AAAA PPPP

Part Number: TMS320C6748

Hello,

I have done a small program to find dot product of two arrays having 100 elements using DSPF_sp_dotprod. As per programmers guide at following link

page 4-55 each time the function execute it must take only 75 cycles,whereas in my program when optimization is off it takes about 2488 cycles at optimization level 3 it takes 135 cycles. Also when I reduce array length to half or double the results are almost 133 cycles at o3. No cycle count comes at rate given in manual.My program is given below. I set a breakpoint at dummy=0; and another at dummy++; and was watching profile clock cycles to measure cycles.

#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <limits.h>
#include ""dsplibc674x.h"

/* ======================================================================= */
/* Interface header files for the natural C and optimized C code */
/* ======================================================================= */
#define notaps 100
float aarr[notaps];
float barr[notaps];
float res;
volatile int dummy=0;

#pragma DATA_ALIGN(aarr, 8);
#pragma DATA_ALIGN(barr, 8);

int main()
{

float in;

int i;

for(i=0;i<notaps;i++)
{
aarr[i]=1.0;
barr[i]=100.0;
}

dummy=0;
res=DSPF_sp_dotprod(aarr,barr,notaps);
dummy++;

return (1);
}

Thanks in advance

With Regards

Shalini

over 6 years ago

0 Victor Kazmirenko over 6 years ago

Guru 13042 points

Hello!

It's not that easy to guess, why performance number are different from expected. However, let me point suspicious place.

You mentioned, your cycle count is different depending on optimization level. That sounds strange. Library function was optimized at library compile time and its performance should not depend on optimization level you apply to your application program.

Secondly, these numbers heavily depend on code, data placement to memory and caching.

As a matter of curiosity I have tried to run the function of interest on my C6670. As a reference I have copied optimized C function from the sources and tested against it. I placed arrays and code to L2SRAM. Even with these precaution on the first run I saw my counts large. On the second and subsequent runs I saw much smaller. For instance, for 128 numbers I saw 114 vs 172 if we speak about DSPLIB and handmade copy. With 256 numbers I saw 136 vs 201. Notice, doubling array size does not double execution time.

If I speculate about absolute values, I would cry. Just imagine, C66 core is supposed to perform up to 8 SP multiplies per cycle. With that I should expect 128/8=16 cycles versus 114 in actual measurement? To my knowledge, there is function call overhead, there might be prologue/epilogue stages and so on. And again, please compare 136 with 114. You see, extra 128 numbers in array add just 22 cycles.

I guess, compiler and device experts could explain better, but I just want to illustrate that simple math does not work very well.

0 RandyP over 6 years ago

TI__Guru* 84110 points

Shalini,

Try putting the program and data in L1P/L1D as SRAM. This will give you the best possible memory performance. My guess is that is where the biggest variation would come from. But if your application requires the program and data to be somewhere else and to be cached, then you should benchmark the results you get in your configuration, accept that you are getting the best possible results although not the same as our documentation, and move forward.

You will have a few cycles variation because of storing the value res= and the overhead of running from dummy=0 to dummy++. But that will only be a small number of cycle and will not explain the large discrepancy unless the optimization makes a big difference there. Part of what I read into rrlagic's explanation, you also have overhead in the function call itself which may get reduced with optimization. Again, that would only be a few cycles.

The library performance numbers are valid, but we are not always very good about showing exactly how to duplicate those results. We have started to do a better job with the C66x-based devices, though.

Regards,
RandyP

0 AAAA PPPP over 6 years ago in reply to RandyP

Expert 1410 points

Hello,

Thanks for all replies.

I tried to execute the program from SHDSPL1DRAM and DSPL1DRAM. When I run the program in both these cases it is not even halting at the starting point for me to click resume button. Why this happen so?

When I tried to run the program from SHDSPL2RAM and DSPL2RAM I get the number of cycles as 195 for 104 term's dot product. So you mean to say this is best performance I could get from this function?

Thanks in advance

With Regards
Shalini K

0 Victor Kazmirenko over 6 years ago in reply to AAAA PPPP

Guru 13042 points

Hello!

I'm not very certain how to put program and data to L1 caches, so my method was to put them on L2SRAM, but try execution sequence repeatedly. Then on the first run you'll get both program and data cached, so second and subsequent runs will give you the best performance.

0 AAAA PPPP over 6 years ago in reply to Victor Kazmirenko

Expert 1410 points

Thank you sir

0 Rahul Prabhu over 6 years ago in reply to AAAA PPPP

TI__Guru** 114410 points

Shalini,

I would like to point you to the Introduction to DSP Optimization which can be found from the link below:
www.ti.com/.../sprabf2.pdf

It contains simple techniques to optimize code using compiler options and aligning and placing data in onchip memory sections. Please review the techniques and check to see if you are seeing any improvements.

I agree with the suggestions here that placing code in L2 and using optimized compiler option are the simplest techniques to improve code without spending additonal time with assembly coding .

Regards,
Rahul

Processors

Processors forum

TMS320C6748: DSPF_sp_dotprod from c6748 library takes more cycles.