Hi Champs,
Below code ran very slowly in shared memory
unsigned short *IN1,*IN2,*IN3,*IN4,*IN5,*IN6,*IN7,*IN8,*IN9,*IN10,*IN11;
IN1=imgin_ptr;
IN2=IN1+640;
IN3=IN2+640;
IN4=IN3+640;
...
IN11=IN10+640;
OUT=imgout_ptr;
for (i=0;i<256;i++)
{
for(j=0;j<640;j++)
{
sum=(IN1[0]+IN2[0]+...IN11[0])/11;
* OUT++sum;
IN1++;
IN2++;
...
IN11++;
}
}
1. Only Core0 run the code
2. enable cache, L1P, L1D 32K, L2cache 128K.
3. Code is on LL2
4. use -O3
Processing 640x512 image 11x1 need about 3ms, 1x11 will less time.
How to optimize the code for better performance.
Thanks.
Rgds
Shine