DSPLink performance + Cache influence

Almohanad Fayez

Other Parts Discussed in Thread: OMAP3530

I remember reading in some documentation before that the DSPLink transfer rate from the GPP->DSP is faster than the DSP->GPP because the GPP-> DSP transfer is cacheable but the DSP-> is not cacheable, I don't remeber where that was?? Is that because the DSP program probably resides on the DSP's cache and the GPP is able to see the DSP's memory so the transfer can occur between the GPP and DSP cache while from the DSP side the DSP's MMU can't see the GPP's memory so it has to write it back to the DSPLink memory region and the GPP then needs to read the data from the external memory?

If my previous summary is correct, is it possible to map the GPP buffers so the DSP MMU can see them and if yes would that resule in DSP->GPP buffer transfer performance smilar to that of the GPP->DSP?

over 14 years ago

0 zaheer sheriff over 14 years ago

Intellectual 410 points

Fayez,

I need to know the data transfer rate dependencies b/w ARM and DSP core using DSPLink and guide me to customise the rate according to our need.

0 giorgos tsoumplekas over 13 years ago in reply to zaheer sheriff

Expert 1840 points

Hello,

I have an omap3530 board and I have implemented some algorithms, which run on DSP-side. ARM and DSP communcate using DSPLink and shared memory. But the performance is not good. For example for motion estimation for a frame 320x240 and window search [-8 8], the algorithm takes 12!! secs to finish. I believe that is a problem with memory, because the frame has been saved to external memory and the data copy from the shared memory to external and then process them. Is there any manner to transfer data to a local memory and improve performance or any other way to improve performance ?? The DSP/BIOS version is 5.x. Could anyone help me??

This is the code

#pragma DATA_SECTION( FrmBuf, "mySection" );

struct Image{
    unsigned char r,g,b;
    unsigned char y,u,v;
    unsigned char Dy;//for every block//
}FrmBuf[344064];#pragma DATA_SECTION( blk, "mySection" );
struct Block {
    unsigned char r,g,b;
    unsigned char y,u,v;

}blk[256];//define the max block which examine for fire 16*16//

int SAD(int offset_hor,int offset_vert,int mvx,int mvy,struct Block *blk,struct Image *img,unsigned int BLOCK_X,unsigned int BLOCK_Y,int imgWidth){
    int sum=0,i=0,j=0;
    int val1=0,val2=0,diff=0;
    for(j=0;j<BLOCK_Y;j++){
   for(i=0;i<BLOCK_X;i++) {
       val1=blk[j*BLOCK_X+i].y;
       val2=img[(offset_vert+(j+mvy)*imgWidth)+i+offset_hor+mvx].y;
   // printf("reference FrmBuf[%d]->%d\n",(offset_vert+(j+mvy)*imgWidth)+i+offset_hor+mvx,FrmBuf[(offset_vert+(j+mvy)*imgWidth)+i+offset_hor+mvx].y);
   // printf("current blk[%d]->%d\n",j*BLOCK_X+i,blk[j*BLOCK_X+i].y);
       //printf("values::[%d %d]][%d %d]\n",j*BLOCK_X+i,(offset_vert+(j+mvy)*imgWidth)+i+offset_hor+mvx,val1,val2);
       diff=(val1-val2);
       diff=(diff ^ (diff>>31)) - (diff>>31);

       sum+=diff ;
   }

    }
    return sum;
}

void   MotionEstimation(int *min_mvy,int *min_mvx,int offset_hor,int offset_vert,struct Block *blk,struct Image *img,unsigned int BLOCK_X,unsigned int BLOCK_Y,int imgWidth)
{
    //
     int temp_SAD=0;
     int mvy=0;
     int mvx=0;
     int min_SAD=65281;//the max value for diff +1 (255-0)*256
     int Y=(Y_BLOCK<<1);
     int X=(X_BLOCK<<1);
     int diff1=0;
     int diff2=0;
     for(mvy=-Y;mvy<Y;mvy++){
   for(mvx=-X;mvx<X;mvx++){
       //   printf("MotionVector[%d %d] :\n",mvy,mvx);
          temp_SAD=SAD(offset_hor,offset_vert,mvx,mvy,blk,img,BLOCK_X,BLOCK_Y,imgWidth);
          diff1=temp_SAD-min_SAD;
          diff2=min_SAD-temp_SAD;
          diff1=diff1>>31;//take the sign
          diff2=diff2>>31;//take the sign
          diff1=((diff1^1)&1)*0xFFFF;
           diff2=((diff2^1)&1)*0xFFFF;

          min_SAD=((min_SAD&diff1)|(temp_SAD&diff2));

          *min_mvx=((*min_mvx&diff1)|(mvx&diff2));
          *min_mvy=((*min_mvy&diff1)|(mvy&diff2));
   }

    }

   // printf("MV::[%d %d]\n",*min_mvy,*min_mvx);
}

Best Regards

Giorgos Tsoumplekas

Postgraduate student, University of Athens

0 Almohanad Fayez over 13 years ago in reply to giorgos tsoumplekas

Intellectual 510 points

I haven't found a good answer on how to improve data rate performance using DSPLink. I'm starting to believe that DSPLink just has way too much overhead for signal processing applications with real time constraints. I've been meaning to look into either CMEM, if you decide to do this you need to know that CMEM doesn't implement cache coherence such as the case with DSPLink so you'll need to do that manually, or dig up code for OMAP3530 VirtIO support. I've scoured the Internet and TI blogs for toooo long and that's my conclusion but again I might be wrong and missed some document or post which deals with this issue.

0 Almohanad Fayez over 13 years ago in reply to giorgos tsoumplekas

Intellectual 510 points

giorgos I just took a skimmed your SAD code, while C-code will compile on the DSP and is a good starting point, you really need to make use of some of the specialized instructions on the C64x+ processor to see maximal gain. In addition, you should avoid indexing arrays arr[index] you should use

temp_ptr = arr;

temp_ptr = temp_ptr + 1;

instead. You should download and look at the C64x+ DSPLib library as a reference point also you need to look into compiler optimizations

http://www.google.com/url?sa=t&rct=j&q=c64x%2B%20compiler%20optimization&source=web&cd=3&sqi=2&ved=0CDkQFjAC&url=http%3A%2F%2Fwww.ti.com%2Fgeneral%2Fdocs%2Flit%2Fgetliterature.tsp%3FliteratureNumber%3Dspru187o%26fileType%3Dpdf&ei=RnxzT9W8GYq-0QGqt_nsAg&usg=AFQjCNEcH0ettU4VzK1HDwR_qjjlrAwZgQ&cad=rja

C-code is just the starting point the real fun/pain is what follows.

If you do all of the above and you're still not happy with the performance then DSPLink buffers might be the issue.

0 giorgos tsoumplekas over 13 years ago in reply to Almohanad Fayez

Expert 1840 points

Thanks for the answer. I used some of the special instructions but I have not seen difference !!

Best Regards

Giorgos

0 Almohanad Fayez over 13 years ago in reply to giorgos tsoumplekas

Intellectual 510 points

Look up the compiler optimization document, it shows you how to make the compiler output your processing unit utilization, I think that's what TI calls them, you cam have up to four of them in parallel if memory me correctly and your goal is to try and utilize all four at a time and you can do that by loading multiple data points at a time using some of the specialized memory load functions.

Setup the output, read it, and make sure you're truly utilizing your C64x+ to its full capacity.

Processors

Processors forum

DSPLink performance + Cache influence