This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

cache Optimization on dm6446?????

the question is about cache on dm6446,
the same code, Time-consuming is 2ms run on dm642 platform,but 6-7ms on dm6446. 
//example code
int len = 352*288; 
char * restrict img1 = malloc(len);
char * restrict img2 = malloc(len);
char * restrict img3 = malloc(len);
char * restrict img4 = malloc(len);
char * restrict img5 = malloc(len);
char * restrict img6 = malloc(len);
char * restrict img7 = malloc(len);
char * restrict img8 = malloc(len);

for(int i=0;i<len;i+=4)
{
/*****
some code 

******/ 
img1 +=4; 
img2 +=4; 
img3 +=4; 
img4 +=4; 
img5 +=4; 
img6 +=4; 
img7 +=4;
img8 +=4;
}

//I check the asm code, the software pipeline is OK.
Reference to spru862b.pdf, I think there are  many cache miss caused by "Capacity Misses".
when L1miss occur,it will read a line of 128byte from L2 cache ,and L2 cache read the line from DDR.

I changed the L1D and L2 to smaller size. the Time-consuming didn't change.

I do not know how to use the bigger L2 cache. If every time the L1 read miss occur, the data

it needed is on L2cache. It should be more effective.

but,I don't know how to achieve it.
because, there are many buffer need to be deal,I don't think ping-pong buffer is a good suggestion.
somebody can help me, or there is a sample code for this kind of problem.
I only want it run fast on dm6446 like on dm642.

  • The DM642 has 256K L2 Memory/Cache whereas the DM6446 has 64K L2 Memory/Cache.

    Memory intensive algorithms will be slower on DM6446.

    Please check if you can organize your data structures in order to reduce the capacity misses

     

  • thanks Cesar.

    I changed the L2 cache from 64k to 32k,  the time-consuming didn't change, so I think the cache size is not the key.

    I don't have a more effective utilization in L2 cache.

    when a L1 read miss occur, the data is not exist in L2 cache too, so L2 read miss occur.

    if when we load a 128byte line  to L1 cache, we load more than 128bytes length data  to L2 cache. we can reduce the L2 read miss.

    Is it a basic property for L2 cache we can set?

     

  • The L1D Cache line size is 64 bytes and the L2 line size is 128 bytes. These values are fixed and can not be set. L2 cache is 4 way set associative.

    If possible, please modify your application optimize the cache usage.

     

  • thanks Cesar

    I optimize my code like this, and do a simple test.

    eight images are stored in this way,

    |---64bytes img0------||---64bytes img1------||---64bytes img2------||---64bytes img3------|
     ptotal = (unsigned char *)memalign(352*288*8,128);

    void testspeed(unsigned char * _pbgTotal)
    {
     uchar * restrict pimg0 = _pbgTotal;
     uchar * restrict pimg1 = pimg0+64;
     uchar * restrict pimg2 = pimg0+64*2;
     uchar * restrict pimg3 = pimg0+64*3;

     uchar * restrict pimg4 = pimg0+64*4;
     uchar * restrict pimg5 = pimg0+64*5;
     uchar * restrict pimg6 = pimg0+64*6;
     uchar * restrict pimg7 = pimg0+64*7;

     int i,j;
     int len = 352*288;
     int steplen = 64;
     int deltastep = 64*7;
     int addone = 1<<24 | 1<<16 | 1<<8 | 1;
     
     for(j=0;j<(len/steplen);j++)
     {
      for(i = 0;i<steplen;i+=4)
      {
       _amem4(pimg0) = _add4(_amem4_const(pimg0),addone);
       _amem4(pimg1) = _add4(_amem4_const(pimg1),addone);
       _amem4(pimg2) = _add4(_amem4_const(pimg2),addone);
       _amem4(pimg3) = _add4(_amem4_const(pimg3),addone);
     
       _amem4(pimg4) = _add4(_amem4_const(pimg4),addone);
       _amem4(pimg5) = _add4(_amem4_const(pimg5),addone);
       _amem4(pimg6) = _add4(_amem4_const(pimg6),addone);
       _amem4(pimg7) = _add4(_amem4_const(pimg7),addone);

       pimg0+=4;pimg1+=4;pimg2+=4;pimg3+=4;
       pimg4+=4;pimg5+=4;pimg6+=4;pimg7+=4;
      }
      pimg0+=deltastep;pimg1+=deltastep;pimg2+=deltastep;pimg3+=deltastep;
      pimg4+=deltastep;pimg5+=deltastep;pimg6+=deltastep;pimg7+=deltastep;
     }
    }

    I think the conflict miss and capacity miss are not existing,  but compulsory miss can't be avoided.

    The total size of eight images is (352*288*8) bytes. so there will be (352*288*8)/128 times compulsory miss.

    The original code of my arithmetic is like this below,  I think these two functions "testspeed" and "testspeed2" are similar.

    I test these two functions on dm642 and dm6446. the result is interesting.

    dm642 dm6446
    testspeed 4-5ms 4-5ms
    testspeed2 1ms 4-5ms

    so, why the testspeed2 runs so fast on dm642???

    void testspeed2(unsigned char * _pbgTotal)
    {
     int len = 352*288;
     uchar * restrict pimg0 = _pbgTotal;
     uchar * restrict pimg1 = pimg0+len;
     uchar * restrict pimg2 = pimg0+len*2;
     uchar * restrict pimg3 = pimg0+len*3;

     uchar * restrict pimg4 = pimg0+len*4;
     uchar * restrict pimg5 = pimg0+len*5;
     uchar * restrict pimg6 = pimg0+len*6;
     uchar * restrict pimg7 = pimg0+len*7;

     int i,j;

     int steplen = 64;
     int deltastep = 64*7;

     int addone = 1<<24 | 1<<16 | 1<<8 | 1;
     
     for(j=0;j<(len/steplen);j++)
     {
      for(i = 0;i<steplen;i+=4)
      {
       _amem4(pimg0) = _add4(_amem4_const(pimg0),addone);
       _amem4(pimg1) = _add4(_amem4_const(pimg1),addone);
       _amem4(pimg2) = _add4(_amem4_const(pimg2),addone);
       _amem4(pimg3) = _add4(_amem4_const(pimg3),addone);
     
       _amem4(pimg4) = _add4(_amem4_const(pimg4),addone);
       _amem4(pimg5) = _add4(_amem4_const(pimg5),addone);
       _amem4(pimg6) = _add4(_amem4_const(pimg6),addone);
       _amem4(pimg7) = _add4(_amem4_const(pimg7),addone);

       pimg0+=4;
       pimg1+=4;pimg2+=4;pimg3+=4;
       pimg4+=4;pimg5+=4;pimg6+=4;pimg7+=4;
      }
     }
    }

     

  • Hello,
    Did you know how to enable the cache and fix this because i am facing the same problem
    Best regards
  • The cache works fine,
    the Dm642 has bug on timer, so it can't get the right time consume
  • Thanks for replying

    I am asking about enabling cache in DM6446

    The L2 is set to 64k in the tcf file.

    Is this way the cache is enabled or I should do something else

    Regards