This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
Hello friends!
I am using the following function to get the difference between two blocks, which are both in cachable external memory.
int sad_16x16(uint8_t * restrict pix1, int i_stride_pix1, uint8_t * restrict pix2, int i_stride_pix2)
{
int i_sum = 0;
int x, y;
for( y = 0; y < 16; y++ )
{
for( x = 0; x < 16; x += 4)
{
i_sum += _dotpu4(_subabs4(_mem4(&pix1[x]), _mem4(&pix2[x])), 0x01010101);
}
pix1 += i_stride_pix1;
pix2 += i_stride_pix2;
}
return i_sum;
}
However, it runs over 100 times slower than for( x = 0; x < 8; x += 4) or for( x = 8; x < 16; x += 4). The parameters are the same, with pix1 = 0xE4A81ECC, i_stride_pix1 = 32, pix2 = 0xE4A81D0C, i_stride_pix2 = 16.
Profile shows it causes a lot of L1D Read Miss when x from 0 to 15, but none when x from 0 to 7 or from 8 to 15.
It is all the same when using IMG_sad_16x16 (a lot of L1D Read Miss) and IMG_sad_8x8 (none) from TI IMG Library instead.
Why does it cause so many L1D Read Miss, and how to avoid?
Thanks a lot!
DSP C6455 1GHz
L1P 32KB
L1D 32KB
L2 256KB
DDRII Address from 0xE000000 to 0xEFFFFFFF
Code Composer Studio 3.3.82.13
Integrated Development 5.98.0.393
BIOS 5.31.02
Code Gernaration Tools v6.0.8
Board Revision (00.00.608)
Target Silicon Revision (00.00.01)