C6Run: bad DSP performance

Anton Kapustin

Other Parts Discussed in Thread: OMAP3530, OMAP-L138

Hi!

Recently I started to use C6Run utility. So, I had very strange results, when I built included examples. It looks like performance of Arm core is mach faster then DSP core, around 10 times faster (for Fourier transform). How It can be happened? I want to understand - whats wrong in my actions? Where the bottleneck? Should I modify some configure files, or use any keys during compillation? Maybe increase memofy size for DSP?

I use last C6Run, as well as other recommended dependent tools. Processor - OMAP3530, Linux distr - Angstrom on Tsunami Board from Technexion.
Below - results of using example (here - c6runlib, for c6runapp actually same results).

root@taodemo:~/TAO_INSTALL/examples/c6runlib/emqbit# ./cfft_arm
N=16,nTimes=100: 0.001342 s
N=32,nTimes=100: 0.002167 s
N=64,nTimes=100: 0.005249 s
N=128,nTimes=100: 0.012237 s
N=256,nTimes=100: 0.027558 s
N=512,nTimes=100: 0.062409 s
N=1024,nTimes=100: 0.138458 s
N=2048,nTimes=100: 0.307709 s
N=4096,nTimes=100: 0.675507 s
N=8192,nTimes=100: 1.4874 s
N=16384,nTimes=100: 3.2832 s
root@taodemo:~/TAO_INSTALL/examples/c6runlib/emqbit# ./cfft_dsp
N=16,nTimes=100: 0.084748 s
N=32,nTimes=100: 0.096069 s
N=64,nTimes=100: 0.120972 s
N=128,nTimes=100: 0.180298 s
N=256,nTimes=100: 0.317017 s
N=512,nTimes=100: 0.622894 s
N=1024,nTimes=100: 1.30252 s
N=2048,nTimes=100: 2.79202 s
N=4096,nTimes=100: 6.03702 s
N=8192,nTimes=100: 13.1281 s
N=16384,nTimes=100: 28.6032 s

over 12 years ago

0 Daniel Allred over 12 years ago

TI__Genius 17355 points

The issue here is that the example is not particularly suited for the OMAP3530 device, since this is a floating point benchmark. The DSP on the OMAP3530 is the C64x+, which is fixed point only, while the Cortex A8 ARM core does have floating point capabilities. This example was originally created for the OMAP-L138 part, which has the reverse - a fixed-point ARM and a floating-point DSP.

Regards, Daniel

0 Anton Kapustin over 12 years ago in reply to Daniel Allred

Prodigy 120 points

Thank you, Daniel!

And have I any possibilities to increase calculation speed for DSP core in OMAP 3530 for floating point math operations?

0 Daniel Allred over 12 years ago in reply to Anton Kapustin

TI__Genius 17355 points

The floating point operations on the fixed-point C64x+ core are emulated in software as part of the run-time support (RTS) libraries. One possibility for improving the performance is to use the FastRTS library described here. It would replace the functions in the standard RTS library with versions that are faster, but possibly less accurate.

I should point out that all current and future ARM+C6000 DSP devices will contain DSP cores that are both floating and fixed point capable (either C674x or C66x or later). The OMAP3530 was likely the last ARM+DSP part with an older fixed-point only C6000 core.

Regards, Daniel

0 giorgos tsoumplekas over 12 years ago in reply to Daniel Allred

Expert 1840 points

Hello,

I have an omap3530 board and I have implemented some algorithms, which run on DSP-side. ARM and DSP communcate using DSPLink and shared memory. But the performance is not good. For example for motion estimation for a frame 320x240 and window search [-8 8], the algorithm takes 12!! secs to finish. I believe that is a problem with memory, because the frame has been saved to external memory and the data copy from the shared memory to external and then process them. Is there any manner to transfer data to a local memory and improve performance or any other way to improve performance ?? The DSP/BIOS version is 5.x. Could anyone help me??

This is the code

#pragma DATA_SECTION( FrmBuf, "mySection" );
struct Image{
    unsigned char r,g,b;
    unsigned char y,u,v;
    unsigned char Dy;//for every block//
}FrmBuf[344064];#pragma DATA_SECTION( blk, "mySection" );
struct Block {
    unsigned char r,g,b;
    unsigned char y,u,v;

}blk[256];//define the max block which examine for fire 16*16//

int SAD(int offset_hor,int offset_vert,int mvx,int mvy,struct Block *blk,struct Image *img,unsigned int BLOCK_X,unsigned int BLOCK_Y,int imgWidth){
    int sum=0,i=0,j=0;
    int val1=0,val2=0,diff=0;
    for(j=0;j<BLOCK_Y;j++){
   for(i=0;i<BLOCK_X;i++) {
       val1=blk[j*BLOCK_X+i].y;
       val2=img[(offset_vert+(j+mvy)*imgWidth)+i+offset_hor+mvx].y;
   // printf("reference FrmBuf[%d]->%d\n",(offset_vert+(j+mvy)*imgWidth)+i+offset_hor+mvx,FrmBuf[(offset_vert+(j+mvy)*imgWidth)+i+offset_hor+mvx].y);
   // printf("current blk[%d]->%d\n",j*BLOCK_X+i,blk[j*BLOCK_X+i].y);
       //printf("values::[%d %d]][%d %d]\n",j*BLOCK_X+i,(offset_vert+(j+mvy)*imgWidth)+i+offset_hor+mvx,val1,val2);
       diff=(val1-val2);
       diff=(diff ^ (diff>>31)) - (diff>>31);

       sum+=diff ;
   }

    }
    return sum;
}

void   MotionEstimation(int *min_mvy,int *min_mvx,int offset_hor,int offset_vert,struct Block *blk,struct Image *img,unsigned int BLOCK_X,unsigned int BLOCK_Y,int imgWidth)
{
    //
     int temp_SAD=0;
     int mvy=0;
     int mvx=0;
     int min_SAD=65281;//the max value for diff +1 (255-0)*256
     int Y=(Y_BLOCK<<1);
     int X=(X_BLOCK<<1);
     int diff1=0;
     int diff2=0;
     for(mvy=-Y;mvy<Y;mvy++){
   for(mvx=-X;mvx<X;mvx++){
       //   printf("MotionVector[%d %d] :\n",mvy,mvx);
          temp_SAD=SAD(offset_hor,offset_vert,mvx,mvy,blk,img,BLOCK_X,BLOCK_Y,imgWidth);
          diff1=temp_SAD-min_SAD;
          diff2=min_SAD-temp_SAD;
          diff1=diff1>>31;//take the sign
          diff2=diff2>>31;//take the sign
          diff1=((diff1^1)&1)*0xFFFF;
           diff2=((diff2^1)&1)*0xFFFF;

          min_SAD=((min_SAD&diff1)|(temp_SAD&diff2));

          *min_mvx=((*min_mvx&diff1)|(mvx&diff2));
          *min_mvy=((*min_mvy&diff1)|(mvy&diff2));
   }

    }

   // printf("MV::[%d %d]\n",*min_mvy,*min_mvx);
}

Processors

Processors forum

C6Run: bad DSP performance