This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

dm642 questions about cache

there are two simple functions, used for test.

the test result is very interesting.

dm642 dm6446
testspeed 4-5ms 4-5ms
testspeed2 1ms 4-5ms

I don't know why the testspeed2 is so fast on dm642, and how can I achieve it on dm6446. I used -o3 option.

ptotal = (unsigned char *)memalign(352*288*8,128);
void testspeed(unsigned char * _pbgTotal)
{
uchar * restrict pimg0 = _pbgTotal;
uchar * restrict pimg1 = pimg0+64;
uchar * restrict pimg2 = pimg0+64*2;
uchar * restrict pimg3 = pimg0+64*3;

uchar * restrict pimg4 = pimg0+64*4;
uchar * restrict pimg5 = pimg0+64*5;
uchar * restrict pimg6 = pimg0+64*6;
uchar * restrict pimg7 = pimg0+64*7;

int i,j;
int len = 352*288;
int steplen = 64;
int deltastep = 64*7;
int addone = 1<<24 | 1<<16 | 1<<8 | 1;

for(j=0;j<(len/steplen);j++)
{
  for(i = 0;i<steplen;i+=4)
  {
   _amem4(pimg0) = _add4(_amem4_const(pimg0),addone);
   _amem4(pimg1) = _add4(_amem4_const(pimg1),addone);
   _amem4(pimg2) = _add4(_amem4_const(pimg2),addone);
   _amem4(pimg3) = _add4(_amem4_const(pimg3),addone);

   _amem4(pimg4) = _add4(_amem4_const(pimg4),addone);
   _amem4(pimg5) = _add4(_amem4_const(pimg5),addone);
   _amem4(pimg6) = _add4(_amem4_const(pimg6),addone);
   _amem4(pimg7) = _add4(_amem4_const(pimg7),addone);

   pimg0+=4;pimg1+=4;pimg2+=4;pimg3+=4;
   pimg4+=4;pimg5+=4;pimg6+=4;pimg7+=4;
  }
  pimg0+=deltastep;pimg1+=deltastep;pimg2+=deltastep;pimg3+=deltastep;
  pimg4+=deltastep;pimg5+=deltastep;pimg6+=deltastep;pimg7+=deltastep;
}



///////////////////////////////////////////////////////////////////////////////////////
void testspeed2(unsigned char * _pbgTotal)
{
int len = 352*288;
uchar * restrict pimg0 = _pbgTotal;
uchar * restrict pimg1 = pimg0+len;
uchar * restrict pimg2 = pimg0+len*2;
uchar * restrict pimg3 = pimg0+len*3;

uchar * restrict pimg4 = pimg0+len*4;
uchar * restrict pimg5 = pimg0+len*5;
uchar * restrict pimg6 = pimg0+len*6;
uchar * restrict pimg7 = pimg0+len*7;

int i,j;
int steplen = 64;
int deltastep = 64*7;

int addone = 1<<24 | 1<<16 | 1<<8 | 1;

for(j=0;j<(len/steplen);j++)
{
  for(i = 0;i<steplen;i+=4)
  {
   _amem4(pimg0) = _add4(_amem4_const(pimg0),addone);
   _amem4(pimg1) = _add4(_amem4_const(pimg1),addone);
   _amem4(pimg2) = _add4(_amem4_const(pimg2),addone);
   _amem4(pimg3) = _add4(_amem4_const(pimg3),addone);

   _amem4(pimg4) = _add4(_amem4_const(pimg4),addone);
   _amem4(pimg5) = _add4(_amem4_const(pimg5),addone);
   _amem4(pimg6) = _add4(_amem4_const(pimg6),addone);
   _amem4(pimg7) = _add4(_amem4_const(pimg7),addone);

   pimg0+=4;
   pimg1+=4;pimg2+=4;pimg3+=4;
   pimg4+=4;pimg5+=4;pimg6+=4;pimg7+=4;
  }
}
}


}

  • What compiler version are you using? Can you send the generated ASM file for the two devices (-k –mw compiler options)?
    How is cache enabled? Are you using BIOS for that?
    Do you have an example project that you can provide?
    Where are these buffers, internal or external memory?

    Regards, Srirami.

  • thanks Srirami,I list these different below, and attach the asm file.

    dm642 dm6446
    compiler version 6.0.8 6.0.14 and 6.0.21
    L2 128k 64k
    cache external 128to159 = 0x0000ffff 128to159 = 0x0000ffff
    project on BOIS videocopy  project in code engine

    these buffers are external buffer. 

    if you can test it, you can simply add this two function in your project.

    regards, balance

     

    asm.rar
  • Anybody can help me?

    Is it a hardware limitation of dm6446???

  • ASM code shows no performance loss. Both kernels are generated similar for both devices. The only difference that can impact performance is the cache sizes.
    Can you make cache sizes similar so that an apples to apples comparison can be done?
    • Make L1D : 32KB
    • Make L2: 64KB

    Also, how are you measuring the time? Are you using DSP-BIOS APIs?

    Regards, Srirami.

  • thanks, Srirami

    I make changes on dm642

    L1D: 16k

    L2: 64k, 32k,

    when I reduce the L2 cache size, the time have a little change. the testspeed2() on dm642 is also fast.

    so, the L2 cache size is not the key.

    I measure the time on dm642 by CLK_getltime(), and measure the time on dm6446  by test the time videnc_process() spent in arm-linux app.

    If anybody can test it on board?

     Regards, balance. 

     

  • Balance,

    Is this issue still open or able to fix the issue?

    Regards, Srirami.

     

  • Srirami,

    It's still open. I'm not sure whether it is caused due to hardware defects.

    Regards, Balance.

  • Balance, from your response it looks like you are benchmarking on DM642 by calling the BIOS CLK API. Whereas, on DM6446 you seem to be benchmarking from ARM application. In which case, won't the DM6446 benchmark also include overheads of calling the function from ARM?
    Is it possible to benchmark on DM6446 also using the CLK API on the DSP?


    Gagan

     

  • Thanks Gagan,

    When I used  scratch memory  in L1 Data cache shared with h264enc,   the trace module  can't print the debug info which I added.  so I can't do the test on the DSP.

    The overheads  of calling the function from ARM is less than 1ms. so the test result I reported is subtracted 1ms from the time got from ARM application.

    Balance. 

  • Balance, I looked at the results and the code more carefully. I can't understand how the DM642 results for testspeed2 get better (that to 4-5x). As a matter of fact, the cache activity in testspeed2 seem to be more demanding. I think DM6446 results are more understandable. I looked at the cache architecture and also based on your results of trying different cache sizes, I think this is not related to the memory architecture on the two devices.
    The only option I have is to run the code on DM642 and try and recreate your benchmark. Will it be possible for you to provide your test project for DM642? If not, I'll go ahead and write my own, but it will be better if you can provide. That way, I will be sure to see what you are observing.

    Sorry for long time to close this. We will try and resolve this soon

    Regards,
    Gagan

  • Thanks Gagan, you are so kind.

    The test project include some other  libs which another company provide, so It can't run on your device.

    Regards,
    Balance

  • Balance, I created project to benchmark the code on the hardware. I think I see the issue.
    Can you recheck if your code is correct? In testspeed2, aren't you missing the step to recalculate the pointers after the inner loop is done?

          pimg0+=deltastep;pimg1+=deltastep;pimg2+=deltastep;pimg3+=deltastep;
          pimg4+=deltastep;pimg5+=deltastep;pimg6+=deltastep;pimg7+=deltastep;

    Regards,
    Gagan

  • Gagan Maur, the testspeed2 code is right, 

    the testspeed and testspeed2 did the same thing. 

    in testspeed

    the data is arranged line by line, likes below:

    (pimg0  0byte----63byte) (pimg1  64byte----127byte(pimg2  128byte----191byte)  .........................

    (pimg0  0byte+deltastep----63byte+deltastep) (pimg1  64byte+deltastep----127byte+deltastep(pimg2  128byte+deltastep----191byte+deltastep)  .........................

     

    but in testspeed2, the data is arranged frame by frame, likes below: 

    len = 352*288;

    (pimg0  0byte----(len-1)byte) (pimg1  len byte----(2*len-1)byte(pimg2  2*len byte---- (3*len -1)byte).........

    maybe I write testspeed2 fun simply like this, it will be clearer.

    for(j=0;j<(len);j+=4)
    {
       _amem4(pimg0) = _add4(_amem4_const(pimg0),addone);
       _amem4(pimg1) = _add4(_amem4_const(pimg1),addone);
       _amem4(pimg2) = _add4(_amem4_const(pimg2),addone);
       _amem4(pimg3) = _add4(_amem4_const(pimg3),addone);

       _amem4(pimg4) = _add4(_amem4_const(pimg4),addone);
       _amem4(pimg5) = _add4(_amem4_const(pimg5),addone);
       _amem4(pimg6) = _add4(_amem4_const(pimg6),addone);
       _amem4(pimg7) = _add4(_amem4_const(pimg7),addone);

       pimg0+=4;pimg1+=4;pimg2+=4;pimg3+=4;
       pimg4+=4;pimg5+=4;pimg6+=4;pimg7+=4;
    }

    these two functions did the same thing,but testspeed2 on DM642 is so fast.

    Regards,

    Balance.

  • Balance, I understand. Let me try now on DM6446 and see the performance I get. By the way, I do see the performance you report on DM642.

    Regards,
    Gagan

  • Hello Balance, I finally figured out the issue. Sorry it took some time. I have been working for many years now on C6000 cores that are later generations than the C64x core. So I couldn't immediately catch the issue that now seems so obvious. Please let me explain what is going on.

    In the older C6000 cores like C64x, the compiler would disable interrupts across tightly schedules kernels. This was done because in tight kernels it was not 'safe' to take interrupts. In tight loops CPU instructions are in different state of pipeline and can't be abruptly interrupted. If you are not clear on this behavior, please  let me know and I will provide more information.

    In the newer cores like C64x+, the core supports SPLOOP hardware buffer. The SPLOOP buffer enables newer cores to be interrupted even during tight loops. Please look up information on SPLOOP in the CPU User's guide. If you have questions, please let me know and I can help answer.

    The way BIOS CLK module handles measuring time is by triggering periodic interrupts. The period is configurable and is generally 1msec. The CLK functions to get CPU time calculate the time by looking at how many interrupts have been triggered and the current count of the timer counters. Thus, time = number of interrupts * interrupt period + current count of timer counter register. As you can imagine, for the BIOS CLK APIs to correctly work, the timer interrupts need to happen timely.

    So now you can understand that in C64x if the tight loop runs for a very long time, the BIOS CLK interrupts may be prevented from happening. This can result in incorrect timing measurements. The test that you are performing runs for many msecs. In the case of testspeed function, there is some code that is executed between the inner loop and the outer loop.
    ...
          }
          pimg0+=deltastep;pimg1+=deltastep;pimg2+=deltastep;pimg3+=deltastep;
          pimg4+=deltastep;pimg5+=deltastep;pimg6+=deltastep;pimg7+=deltastep;
        }
    ...

    The tight loop that is generated by the compiler enabled the interrupts between the innerloop and the outerloop. This gives a window for interrupts to happen. So the numbers for testspeed are correct.

    In the case of testspeed2, there is no code between the innerloop and the outerloop. This causes the compiler to do the optimizations. As part of this optimization, the compiler doesn't enable interrupt across the entire processing! The processing happens for many msec. Thus, many timer interrupts are lost. This results in incorrect timing measurements for testspeed2 in case of DM642. But in the case of DM6446, due to SPLOOP, the interrupts are not lost and hence the timing measurements are correct.

    There are many solutions. First and the easiest is to use new function that I wrote:

    void testspeed3 (
                        unsigned char * restrict _pbgTotal, 
                        int len,
                        int deltastep,
                        int countOuter,
                        int countInner
                     )
    {
        unsigned char * restrict pimg0 = _pbgTotal;
        unsigned char * restrict pimg1 = pimg0+len;
        unsigned char * restrict pimg2 = pimg0+len*2;
        unsigned char * restrict pimg3 = pimg0+len*3;
        unsigned char * restrict pimg4 = pimg0+len*4;
        unsigned char * restrict pimg5 = pimg0+len*5;
        unsigned char * restrict pimg6 = pimg0+len*6;
        unsigned char * restrict pimg7 = pimg0+len*7;

        int i,j;

        int addone = 1<<24 | 1<<16 | 1<<8 | 1;

     

        for(j=0;j<(countOuter);j++)

        {

          for(i = 0;i<countInner;i+=4)

          {
           _amem4(pimg0) = _add4(_amem4_const(pimg0),addone);
           _amem4(pimg1) = _add4(_amem4_const(pimg1),addone);
           _amem4(pimg2) = _add4(_amem4_const(pimg2),addone);
           _amem4(pimg3) = _add4(_amem4_const(pimg3),addone);
           _amem4(pimg4) = _add4(_amem4_const(pimg4),addone);
           _amem4(pimg5) = _add4(_amem4_const(pimg5),addone);
           _amem4(pimg6) = _add4(_amem4_const(pimg6),addone);
           _amem4(pimg7) = _add4(_amem4_const(pimg7),addone);

           pimg0+=4;
           pimg1+=4;pimg2+=4;pimg3+=4;
           pimg4+=4;pimg5+=4;pimg6+=4;pimg7+=4;
          }

          pimg0+=deltastep;pimg1+=deltastep;pimg2+=deltastep;pimg3+=deltastep;
          pimg4+=deltastep;pimg5+=deltastep;pimg6+=deltastep;pimg7+=deltastep;
        }

    }

    Then you can make calls like this:

        start = CLK_gethtime ();
        testspeed3 (ptotal, 64, 64*7, 1584, 64);
        end = CLK_gethtime ();
        total1 = end- start;
        LOG_printf (&trace, "CPU Clocks testspeed3 mimicing testspeed 1 = %d", total1);

        start = CLK_gethtime ();
        testspeed3 (ptotal, 352*288, 0, 1584, 64);
        end = CLK_gethtime ();
        total2 = end- start;
        LOG_printf (&trace, "CPU Clocks testspeed3 mimicing testspeed 2 = %d", total2);

    I think the above solution will work for you so I won't list others.
    Using the above method you can see correct timings on DM642 as well.

    I know this has taken a while to resolve this issue. Thank you for your patience during this time

    Regards,
    Gagan

     

  • Gagan, I think you are right. Thank you for your work

    I found the disable interrupts in some tight loops. 

    Regards,
    Balance.