problem with double memory access

sergey manuhin

I need to create a function. Input arguments of the function are Image and mask. The Function must copy to the output only unmasked pixels. Something like this:

for(i=0;i<1024;i++) if(mask[i])*ouput++=input[i]

The bottle neck of this function is a memory access.

I have optimized this function:

void copy_image(

unsigned short *restrict input_image,

unsigned short *restrict ouput_image1,

unsigned short *restrict ouput_image2,

unsigned short *restrict _input_map)

{

int i,j;

double pix1234,pix5678;

unsigned short pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8;

const double *restrict part1 = (const double*)&input_image[0];

const double *restrict part2 = (const double *)&input_image[1024];

unsigned short map;

_nassert(((unsigned)input_image) % 8 == 0);

_nassert(((unsigned)part1) % 8 == 0);

_nassert(((unsigned)part2) % 8 == 0);

_nassert(((unsigned)ouput_image1) % 8 == 0);

_nassert(((unsigned)ouput_image2) % 8 == 0);

for(i = 0; i < 1024; i += 1)

{

map=*_input_map++;

pix1234= _amemd8_const((void *)&(part1[i]));

pixel1 = ((_extu(_hi(pix1234), 0, 16)) );

pixel2 = ((_extu(_hi(pix1234), 16,16)) );

pixel3 = ((_extu(_lo(pix1234), 0, 16)) );

pixel4 = ((_extu(_lo(pix1234), 16,16)) );

if((map & 0x1)==0) *ouput_image1++=pixel1;

if((map & 0x2 )==0) *ouput_image2++=pixel2;

if((map & 0x4)==0) *ouput_image1++=pixel3;

if((map & 0x8)==0) *ouput_image2++=pixel4;

pix5678=_amemd8_const((void *)&(part2[i]));

pixel5 = ((_extu(_hi(pix5678), 0, 16)) );

pixel6 = ((_extu(_hi(pix5678), 16,16)) ); /

pixel7 = ((_extu(_lo(pix5678), 0, 16)) );

pixel8 = ((_extu(_lo(pix5678), 16,16)) );

if((map & 16)==0 ) *ouput_image1++=pixel5;

if((map & 32)==0 ) *ouput_image2++=pixel6;

if((map & 64)==0 ) *ouput_image1++=pixel7;

if((map & 128)==0 ) *ouput_image2++=pixel8;

}

In the file of analysis (*.asm) is written that for each pixel processor will spend five cycles.

But when I do a profiling I see that my function was done for 21700 cycles. 6*1024 – expected and 15000 - “L1D.Stall.write_buf_full”.

Then I have modified this code and deleted the second output buffer :

if((map & 0x1)==0) *ouput_image1++=pixel1;

if((map & 0x2 )==0) *ouput_image1++=pixel2;

if((map & 0x4)==0) *ouput_image1++=pixel3;

if((map & 0x8)==0) *ouput_image1++=pixel4;

if((map & 16)==0 ) *ouput_image1++=pixel5;

if((map & 32)==0 ) *ouput_image1++=pixel6;

if((map & 64)==0 ) *ouput_image1++=pixel7;

if((map & 128)==0 ) *ouput_image1++=pixel8;

Now there are 8 cycles per pixel and no any double writing to memory. But number of writing bytes is the same. So now I expect 8*1024 + 15000 cycles(L1D.Stall.write_buf_full).

But when I do a profiling I see that function was done for 8*1024 cycles. And I don’t understand why ???? Can’ you help me? How can I use double access without a huge memory writing buffer stall?

I can do a prefetch but it will be a waste of memory because I don’t know how many pixels will be masked.

over 14 years ago

0 Алексей over 14 years ago

TI__Prodigy 50 points

Sergey,
following Twiki page provides more information and a test example that shows how to optimize C code:
http://wiki.davincidsp.com/index.php/C6000_CGT_Optimization_Lab_-_1
http://wiki.davincidsp.com/index.php/Category:Compiler
http://processors.wiki.ti.com/index.php/Optimization_Techniques_for_the_TI_C6000_Compiler
So, maybe this examples should be useful to you in this particular case, and probably in the future.

Processors

Processors forum

problem with double memory access