This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

How could I make high performance code in Image binarization?



1).I copy a 320*256 image to DDR2,and processing it with a image binary function,and I found this procesing takes 12ms

while(1)

{

in = MEM_alloc(DDR2,320*256,4);

/*here I use CCS to load data to DDR2*/

TIME_DECODE = 0;
preTime=CLK_getltime();

/*image binary processing,thi is threshold*/
test_binary_image(in,320*256,thi);
curTime=CLK_getltime();
TIME_DECODE=curTime-preTime;

}

2).I copy the same image to L2 SRAM,and procesing it with the same function,it also takes 12ms.

while(1)

{

in = MEM_alloc(IRAM,320*256,4);

/*here I use CCS to load image data to IRAM*/

TIME_DECODE = 0;
preTime=CLK_getltime();

/*image binary processing,thi is threshold*/
test_binary_image(in,320*256,thi);
curTime=CLK_getltime();
TIME_DECODE=curTime-preTime;

}

I suppose the (2) method that data in L2 could take much less time,but actually it's not so.

I use 6431(297MHZ),it takes 12ms in  processing a 320*256 image,I can't accept this result.

Do I have some mistake or ignore some details?

/*test_binary_image()function code*/

void test_binary_image(Uint8 *restrict pimg,short w,short h,Uint8 th)
{
 int w_h = w * h;
 while(w_h--)
 {
  if(*pimg > th)
   *pimg = 0xff;
  else
   *pimg = 0;
  pimg++;
 }
}

thx

teddy

  • The 6431 has a maximum of 64k byte L2RAM, so I don't see how you can allocate 320*256~80kbytes into L2.  Try reducing your image size and see if that helps at all.  Also, remember that data has to get into L1D to be used by the core, and you will still get cache miss penalties moving from L2 sram into L1D.

    I can give you 2 or 3 suggestions to speed up your processing time.  First, you can configure L1D as 32k cache and exploit miss pipelining by "touching" the array and working in 32k or smaller chunks.  I would read up on the cache to understand this process.  Second, you could go whole hog and make L1D more RAM heavy, then just EDMA blocks in & out of DDR2 & L1D.  Third, and this is outside of my experience, the compiler might not be optimizing your loop as much as possible.  Right now I would bet that your bottleneck is data starvation, so I would invest my time understanding cache & dma before I began reading up on compiler optimization.

  • I have seen the 6431datasheet,and find you are right,there is only 64k byte in L2RAM,but it confuse me."in = MEM_alloc(IRAM,320*256,4);"return the right address 0x10810000,and I load a 320*256 image into this region with CCS.

    After that ,I use graph->image to see the image,it's perfect,there is no data miss,maybe 6431 has 128kb L2RAM. You can try that.

  • I would be weary of using internal memory beyond the datasheet specification for a device like this, as it is specified for 64k anything beyond that internal memory space is not guaranteed to exist or function properly.

  • OK, I don't have a 6431, but the obvious question is are you running on the actual hardware or are you running a 643x simulator?  If not, you'll have to get one of the TI guys to explain that one.