dm642 questions about cache

balanceren

there are two simple functions, used for test.

the test result is very interesting.

dm642 dm6446

testspeed 4-5ms 4-5ms

testspeed2 1ms 4-5ms

I don't know why the testspeed2 is so fast on dm642, and how can I achieve it on dm6446. I used -o3 option.

ptotal = (unsigned char *)memalign(352*288*8,128);
void testspeed(unsigned char * _pbgTotal)
{
uchar * restrict pimg0 = _pbgTotal;
uchar * restrict pimg1 = pimg0+64;
uchar * restrict pimg2 = pimg0+64*2;
uchar * restrict pimg3 = pimg0+64*3;
uchar * restrict pimg4 = pimg0+64*4;
uchar * restrict pimg5 = pimg0+64*5;
uchar * restrict pimg6 = pimg0+64*6;
uchar * restrict pimg7 = pimg0+64*7;
int i,j;
int len = 352*288;
int steplen = 64;
int deltastep = 64*7;
int addone = 1<<24 | 1<<16 | 1<<8 | 1;

for(j=0;j<(len/steplen);j++)
{
  for(i = 0;i<steplen;i+=4)
  {
_amem4(pimg0) = _add4(_amem4_const(pimg0),addone);
_amem4(pimg1) = _add4(_amem4_const(pimg1),addone);
_amem4(pimg2) = _add4(_amem4_const(pimg2),addone);
_amem4(pimg3) = _add4(_amem4_const(pimg3),addone);

_amem4(pimg4) = _add4(_amem4_const(pimg4),addone);
_amem4(pimg5) = _add4(_amem4_const(pimg5),addone);
_amem4(pimg6) = _add4(_amem4_const(pimg6),addone);
_amem4(pimg7) = _add4(_amem4_const(pimg7),addone);
pimg0+=4;pimg1+=4;pimg2+=4;pimg3+=4;
pimg4+=4;pimg5+=4;pimg6+=4;pimg7+=4;
  }
  pimg0+=deltastep;pimg1+=deltastep;pimg2+=deltastep;pimg3+=deltastep;
  pimg4+=deltastep;pimg5+=deltastep;pimg6+=deltastep;pimg7+=deltastep;
}

///////////////////////////////////////////////////////////////////////////////////////
void testspeed2(unsigned char * _pbgTotal)
{
int len = 352*288;
uchar * restrict pimg0 = _pbgTotal;
uchar * restrict pimg1 = pimg0+len;
uchar * restrict pimg2 = pimg0+len*2;
uchar * restrict pimg3 = pimg0+len*3;
uchar * restrict pimg4 = pimg0+len*4;
uchar * restrict pimg5 = pimg0+len*5;
uchar * restrict pimg6 = pimg0+len*6;
uchar * restrict pimg7 = pimg0+len*7;
int i,j;
int steplen = 64;
int deltastep = 64*7;
int addone = 1<<24 | 1<<16 | 1<<8 | 1;

for(j=0;j<(len/steplen);j++)
{
  for(i = 0;i<steplen;i+=4)
  {
_amem4(pimg0) = _add4(_amem4_const(pimg0),addone);
_amem4(pimg1) = _add4(_amem4_const(pimg1),addone);
_amem4(pimg2) = _add4(_amem4_const(pimg2),addone);
_amem4(pimg3) = _add4(_amem4_const(pimg3),addone);

_amem4(pimg4) = _add4(_amem4_const(pimg4),addone);
_amem4(pimg5) = _add4(_amem4_const(pimg5),addone);
_amem4(pimg6) = _add4(_amem4_const(pimg6),addone);
_amem4(pimg7) = _add4(_amem4_const(pimg7),addone);
pimg0+=4;
pimg1+=4;pimg2+=4;pimg3+=4;
pimg4+=4;pimg5+=4;pimg6+=4;pimg7+=4;
  }
}
}

}

over 15 years ago

0 srirami over 15 years ago

TI__Expert 8260 points

What compiler version are you using? Can you send the generated ASM file for the two devices (-k –mw compiler options)?
How is cache enabled? Are you using BIOS for that?
Do you have an example project that you can provide?
Where are these buffers, internal or external memory?

Regards, Srirami.

0 balanceren over 15 years ago in reply to srirami

Prodigy 240 points

thanks Srirami,I list these different below, and attach the asm file.

	dm642	dm6446
compiler version	6.0.8	6.0.14 and 6.0.21
L2	128k	64k
cache external	128to159 = 0x0000ffff	128to159 = 0x0000ffff
project	on BOIS	videocopy project in code engine

these buffers are external buffer.

if you can test it, you can simply add this two function in your project.

regards, balance

asm.rar

0 balanceren over 15 years ago in reply to balanceren

Prodigy 240 points

Anybody can help me?

Is it a hardware limitation of dm6446???

0 srirami over 15 years ago in reply to balanceren

TI__Expert 8260 points

ASM code shows no performance loss. Both kernels are generated similar for both devices. The only difference that can impact performance is the cache sizes.
Can you make cache sizes similar so that an apples to apples comparison can be done?
• Make L1D : 32KB
• Make L2: 64KB

Also, how are you measuring the time? Are you using DSP-BIOS APIs?

Regards, Srirami.

0 balanceren over 15 years ago in reply to srirami

Prodigy 240 points

thanks, Srirami

I make changes on dm642

L1D: 16k

L2: 64k, 32k,

when I reduce the L2 cache size, the time have a little change. the testspeed2() on dm642 is also fast.

so, the L2 cache size is not the key.

I measure the time on dm642 by CLK_getltime(), and measure the time on dm6446 by test the time videnc_process() spent in arm-linux app.

If anybody can test it on board?

Regards, balance.

0 srirami over 15 years ago in reply to balanceren

TI__Expert 8260 points

Balance,

Is this issue still open or able to fix the issue?

Regards, Srirami.

0 balanceren over 15 years ago in reply to srirami

Prodigy 240 points

Srirami,

It's still open. I'm not sure whether it is caused due to hardware defects.

Regards, Balance.

0 Gagan Maur over 14 years ago in reply to balanceren

TI__Expert 8150 points

Balance, from your response it looks like you are benchmarking on DM642 by calling the BIOS CLK API. Whereas, on DM6446 you seem to be benchmarking from ARM application. In which case, won't the DM6446 benchmark also include overheads of calling the function from ARM?
Is it possible to benchmark on DM6446 also using the CLK API on the DSP?

Gagan

0 balanceren over 14 years ago in reply to Gagan Maur

Prodigy 240 points

Thanks Gagan,

When I used scratch memory in L1 Data cache shared with h264enc, the trace module can't print the debug info which I added. so I can't do the test on the DSP.

The overheads of calling the function from ARM is less than 1ms. so the test result I reported is subtracted 1ms from the time got from ARM application.

Balance.

0 Gagan Maur over 14 years ago in reply to balanceren

TI__Expert 8150 points

Balance, I looked at the results and the code more carefully. I can't understand how the DM642 results for testspeed2 get better (that to 4-5x). As a matter of fact, the cache activity in testspeed2 seem to be more demanding. I think DM6446 results are more understandable. I looked at the cache architecture and also based on your results of trying different cache sizes, I think this is not related to the memory architecture on the two devices.
The only option I have is to run the code on DM642 and try and recreate your benchmark. Will it be possible for you to provide your test project for DM642? If not, I'll go ahead and write my own, but it will be better if you can provide. That way, I will be sure to see what you are observing.

Sorry for long time to close this. We will try and resolve this soon

Regards,
Gagan

0 balanceren over 14 years ago in reply to Gagan Maur

Prodigy 240 points

Thanks Gagan, you are so kind.

The test project include some other libs which another company provide, so It can't run on your device.

Regards,
Balance

0 Gagan Maur over 14 years ago in reply to balanceren

TI__Expert 8150 points

Balance, I created project to benchmark the code on the hardware. I think I see the issue.
Can you recheck if your code is correct? In testspeed2, aren't you missing the step to recalculate the pointers after the inner loop is done?

pimg0+=deltastep;pimg1+=deltastep;pimg2+=deltastep;pimg3+=deltastep;
pimg4+=deltastep;pimg5+=deltastep;pimg6+=deltastep;pimg7+=deltastep;

Regards,
Gagan

0 balanceren over 14 years ago in reply to Gagan Maur

Prodigy 240 points

Gagan Maur, the testspeed2 code is right,

the testspeed and testspeed2 did the same thing.

in testspeed

the data is arranged line by line, likes below:

(pimg0 0byte----63byte) (pimg1 64byte----127byte) (pimg2 128byte----191byte) .........................

(pimg0 0byte+deltastep----63byte+deltastep) (pimg1 64byte+deltastep----127byte+deltastep) (pimg2 128byte+deltastep----191byte+deltastep) .........................

but in testspeed2, the data is arranged frame by frame, likes below:

len = 352*288;

(pimg0 0byte----(len-1)byte) (pimg1 len byte----(2*len-1)byte) (pimg2 2*len byte---- (3*len -1)byte).........

maybe I write testspeed2 fun simply like this, it will be clearer.

for(j=0;j<(len);j+=4)
{
_amem4(pimg0) = _add4(_amem4_const(pimg0),addone);
_amem4(pimg1) = _add4(_amem4_const(pimg1),addone);
_amem4(pimg2) = _add4(_amem4_const(pimg2),addone);
_amem4(pimg3) = _add4(_amem4_const(pimg3),addone);

_amem4(pimg4) = _add4(_amem4_const(pimg4),addone);
_amem4(pimg5) = _add4(_amem4_const(pimg5),addone);
_amem4(pimg6) = _add4(_amem4_const(pimg6),addone);
_amem4(pimg7) = _add4(_amem4_const(pimg7),addone);
pimg0+=4;pimg1+=4;pimg2+=4;pimg3+=4;
pimg4+=4;pimg5+=4;pimg6+=4;pimg7+=4;
}

these two functions did the same thing,but testspeed2 on DM642 is so fast.

Regards,

Balance.

0 Gagan Maur over 14 years ago in reply to balanceren

TI__Expert 8150 points

Balance, I understand. Let me try now on DM6446 and see the performance I get. By the way, I do see the performance you report on DM642.

Regards,
Gagan

0 Gagan Maur over 14 years ago in reply to Gagan Maur

TI__Expert 8150 points

Hello Balance, I finally figured out the issue. Sorry it took some time. I have been working for many years now on C6000 cores that are later generations than the C64x core. So I couldn't immediately catch the issue that now seems so obvious. Please let me explain what is going on.

In the older C6000 cores like C64x, the compiler would disable interrupts across tightly schedules kernels. This was done because in tight kernels it was not 'safe' to take interrupts. In tight loops CPU instructions are in different state of pipeline and can't be abruptly interrupted. If you are not clear on this behavior, please let me know and I will provide more information.

In the newer cores like C64x+, the core supports SPLOOP hardware buffer. The SPLOOP buffer enables newer cores to be interrupted even during tight loops. Please look up information on SPLOOP in the CPU User's guide. If you have questions, please let me know and I can help answer.

The way BIOS CLK module handles measuring time is by triggering periodic interrupts. The period is configurable and is generally 1msec. The CLK functions to get CPU time calculate the time by looking at how many interrupts have been triggered and the current count of the timer counters. Thus, time = number of interrupts * interrupt period + current count of timer counter register. As you can imagine, for the BIOS CLK APIs to correctly work, the timer interrupts need to happen timely.

So now you can understand that in C64x if the tight loop runs for a very long time, the BIOS CLK interrupts may be prevented from happening. This can result in incorrect timing measurements. The test that you are performing runs for many msecs. In the case of testspeed function, there is some code that is executed between the inner loop and the outer loop.
...
      }
      pimg0+=deltastep;pimg1+=deltastep;pimg2+=deltastep;pimg3+=deltastep;
      pimg4+=deltastep;pimg5+=deltastep;pimg6+=deltastep;pimg7+=deltastep;
    }
...

The tight loop that is generated by the compiler enabled the interrupts between the innerloop and the outerloop. This gives a window for interrupts to happen. So the numbers for testspeed are correct.

In the case of testspeed2, there is no code between the innerloop and the outerloop. This causes the compiler to do the optimizations. As part of this optimization, the compiler doesn't enable interrupt across the entire processing! The processing happens for many msec. Thus, many timer interrupts are lost. This results in incorrect timing measurements for testspeed2 in case of DM642. But in the case of DM6446, due to SPLOOP, the interrupts are not lost and hence the timing measurements are correct.

There are many solutions. First and the easiest is to use new function that I wrote:

void testspeed3 (
                    unsigned char * restrict _pbgTotal,
                    int len,
                    int deltastep,
                    int countOuter,
                    int countInner
                 )
{
    unsigned char * restrict pimg0 = _pbgTotal;
    unsigned char * restrict pimg1 = pimg0+len;
    unsigned char * restrict pimg2 = pimg0+len*2;
    unsigned char * restrict pimg3 = pimg0+len*3;
    unsigned char * restrict pimg4 = pimg0+len*4;
    unsigned char * restrict pimg5 = pimg0+len*5;
    unsigned char * restrict pimg6 = pimg0+len*6;
    unsigned char * restrict pimg7 = pimg0+len*7;

int i,j;

int addone = 1<<24 | 1<<16 | 1<<8 | 1;

for(j=0;j<(countOuter);j++)

{

for(i = 0;i<countInner;i+=4)

      {
       _amem4(pimg0) = _add4(_amem4_const(pimg0),addone);
       _amem4(pimg1) = _add4(_amem4_const(pimg1),addone);
       _amem4(pimg2) = _add4(_amem4_const(pimg2),addone);
       _amem4(pimg3) = _add4(_amem4_const(pimg3),addone);
       _amem4(pimg4) = _add4(_amem4_const(pimg4),addone);
       _amem4(pimg5) = _add4(_amem4_const(pimg5),addone);
       _amem4(pimg6) = _add4(_amem4_const(pimg6),addone);
       _amem4(pimg7) = _add4(_amem4_const(pimg7),addone);

       pimg0+=4;
       pimg1+=4;pimg2+=4;pimg3+=4;
       pimg4+=4;pimg5+=4;pimg6+=4;pimg7+=4;
      }

      pimg0+=deltastep;pimg1+=deltastep;pimg2+=deltastep;pimg3+=deltastep;
      pimg4+=deltastep;pimg5+=deltastep;pimg6+=deltastep;pimg7+=deltastep;
    }

}

Then you can make calls like this:

    start = CLK_gethtime ();
    testspeed3 (ptotal, 64, 64*7, 1584, 64);
    end = CLK_gethtime ();
    total1 = end- start;
    LOG_printf (&trace, "CPU Clocks testspeed3 mimicing testspeed 1 = %d", total1);

    start = CLK_gethtime ();
    testspeed3 (ptotal, 352*288, 0, 1584, 64);
    end = CLK_gethtime ();
    total2 = end- start;
    LOG_printf (&trace, "CPU Clocks testspeed3 mimicing testspeed 2 = %d", total2);

I think the above solution will work for you so I won't list others.
Using the above method you can see correct timings on DM642 as well.

I know this has taken a while to resolve this issue. Thank you for your patience during this time

Regards,
Gagan

0 balanceren over 14 years ago in reply to Gagan Maur

Prodigy 240 points

Gagan, I think you are right. Thank you for your work

I found the disable interrupts in some tight loops.

Regards,
Balance.

Processors

Processors forum

dm642 questions about cache

there are two simple functions, used for test.

dm642 dm6446

testspeed 4-5ms 4-5ms

testspeed2 1ms 4-5ms

Processors

Processors forum

dm642 questions about cache

there are two simple functions, used for test.

dm642 dm6446 testspeed 4-5ms 4-5ms testspeed2 1ms 4-5ms

dm642 dm6446

testspeed 4-5ms 4-5ms

testspeed2 1ms 4-5ms