I need to create a function. Input arguments of the function are Image and mask. The Function must copy to the output only unmasked pixels. Something like this:
for(i=0;i<1024;i++) if(mask[i])*ouput++=input[i]
The bottle neck of this function is a memory access.
I have optimized this function:
void copy_image(
unsigned short *restrict input_image,
unsigned short *restrict ouput_image1,
unsigned short *restrict ouput_image2,
unsigned short *restrict _input_map)
{
int i,j;
double pix1234,pix5678;
unsigned short pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8;
const double *restrict part1 = (const double*)&input_image[0];
const double *restrict part2 = (const double *)&input_image[1024];
unsigned short map;
_nassert(((unsigned)input_image) % 8 == 0);
_nassert(((unsigned)part1) % 8 == 0);
_nassert(((unsigned)part2) % 8 == 0);
_nassert(((unsigned)ouput_image1) % 8 == 0);
_nassert(((unsigned)ouput_image2) % 8 == 0);
for(i = 0; i < 1024; i += 1)
{
map=*_input_map++;
pix1234= _amemd8_const((void *)&(part1[i]));
pixel1 = ((_extu(_hi(pix1234), 0, 16)) );
pixel2 = ((_extu(_hi(pix1234), 16,16)) );
pixel3 = ((_extu(_lo(pix1234), 0, 16)) );
pixel4 = ((_extu(_lo(pix1234), 16,16)) );
if((map & 0x1)==0) *ouput_image1++=pixel1;
if((map & 0x2 )==0) *ouput_image2++=pixel2;
if((map & 0x4)==0) *ouput_image1++=pixel3;
if((map & 0x8)==0) *ouput_image2++=pixel4;
pix5678=_amemd8_const((void *)&(part2[i]));
pixel5 = ((_extu(_hi(pix5678), 0, 16)) );
pixel6 = ((_extu(_hi(pix5678), 16,16)) ); /
pixel7 = ((_extu(_lo(pix5678), 0, 16)) );
pixel8 = ((_extu(_lo(pix5678), 16,16)) );
if((map & 16)==0 ) *ouput_image1++=pixel5;
if((map & 32)==0 ) *ouput_image2++=pixel6;
if((map & 64)==0 ) *ouput_image1++=pixel7;
if((map & 128)==0 ) *ouput_image2++=pixel8;
}
}
In the file of analysis (*.asm) is written that for each pixel processor will spend five cycles.
But when I do a profiling I see that my function was done for 21700 cycles. 6*1024 – expected and 15000 - “L1D.Stall.write_buf_full”.
Then I have modified this code and deleted the second output buffer :
if((map & 0x1)==0) *ouput_image1++=pixel1;
if((map & 0x2 )==0) *ouput_image1++=pixel2;
if((map & 0x4)==0) *ouput_image1++=pixel3;
if((map & 0x8)==0) *ouput_image1++=pixel4;
if((map & 16)==0 ) *ouput_image1++=pixel5;
if((map & 32)==0 ) *ouput_image1++=pixel6;
if((map & 64)==0 ) *ouput_image1++=pixel7;
if((map & 128)==0 ) *ouput_image1++=pixel8;
Now there are 8 cycles per pixel and no any double writing to memory. But number of writing bytes is the same. So now I expect 8*1024 + 15000 cycles(L1D.Stall.write_buf_full).
But when I do a profiling I see that function was done for 8*1024 cycles. And I don’t understand why ???? Can’ you help me? How can I use double access without a huge memory writing buffer stall?
I can do a prefetch but it will be a waste of memory because I don’t know how many pixels will be masked.