Starterware/AM3359: L1 add L2 cache performance problem

Samir Mistry

Part Number: AM3359

Tool/software: Starterware

Hi,

I am using TI staterWare with ICE board.

Created One project in CCS7 with bare metal application.While debuggingmy project ,It is showin 31 clock cycles to execute single instruction.

I have enabled mmu and cache using

int main(void) {

int success = 0;

MMUConfigAndEnable();
CACHEEnable(CACHE_IDCACHE, CACHE_INNER_OUTER);
CACHEEnable(CACHE_ICACHE, CACHE_INNER_OUTER);

printf("Platform Initialization !!! \n");

volatile long a,b,ccount=0;

a=0;

b=0;

for(count=0;count <=100;count++)

{

a=b+c;

}

I want to know anything else need to do for enabing Cache and MMU

over 7 years ago

0 Lalindra Jayatilleke over 7 years ago

TI__Mastermind 30365 points

Samir,
We are looking into this and shall get back to you.

Lali

0 Lalindra Jayatilleke over 7 years ago

TI__Mastermind 30365 points

Samir,

Could you please try to benchmark the below code and let me know your number?

volatile  long a[100] ;

int main(void) {

int success = 0;

MMUConfigAndEnable();
CACHEEnable(CACHE_IDCACHE, CACHE_INNER_OUTER);
CACHEEnable(CACHE_ICACHE, CACHE_INNER_OUTER);

printf("Platform Initialization !!! \n");
printf("size of long  %d\n",sizeof(long) ) ;

 long b[100],c[100], count=0;


 for(count=0;count <=100;count++)
 {
	 a[count] = count  ;

 }

 for(count=0;count <=100;count++)
  {

 	 b[count] = a[count] + 3 ;

  }

 for(count=0;count <=100;count++)
  {

 	 c[count] = b[count] -1  ;
  }

for(count=0;count <=100;count++)

{

     a[count]=b[count]+c[count];

}

}

I wonder if you define something as volatile, the compiler invalidates the cache before reading the value from memory.

Lali

0 Samir Mistry over 7 years ago in reply to Lalindra Jayatilleke

Prodigy 30 points

Hi Lalindra,
Thanks for reply,But same issue.No improvement. instruction like cmp r3,#0x64 takes 31 cycles.

0 Lalindra Jayatilleke over 7 years ago in reply to Samir Mistry

TI__Mastermind 30365 points

Samir,

Could you please post on the thread the CCS example project you are trying ? Thanks.

Lali

0 Samir Mistry over 7 years ago in reply to Lalindra Jayatilleke

Prodigy 30 points

1884.gpio_test.zip

Hi,
It is GPIO LED blink project which i have imported from TI Staerware

0 Lalindra Jayatilleke over 7 years ago in reply to Samir Mistry

TI__Mastermind 30365 points

Hi Samir,

The 31 cycles observed is a result of putting hardware breakpoints between assembly instructions. This isnt a good way to benchmark instruction cycle count due to tool and pipeline overhead. Lets try to illustrate why this isn’t the actual cycle count of an instruction.

Let’s take the below code example that you were trying. I also based this on the gpio example in the Starterware package.

int main()
{
    MMUConfigAndEnable();
    CACHEEnable(CACHE_IDCACHE, CACHE_INNER_OUTER);
    CACHEEnable(CACHE_ICACHE, CACHE_INNER_OUTER);

    int success = 0;
    volatile long a,b,c, count=0;
    a=0;
    b=0;

    for(count=0;count <=100;count++)
    {
    a=b+c;
    }
}

The disassembly for this looks like this:

A break point was put before and after the FOR loop. So, the 1515 cycle count was to run 10 instructions (between the highlighted lines) for 100 iterations which approximates to about 15 cycles.

Now if you increase the FOR loop to 100,000 iterations, then the cycle count will go down. This is expected as you continue to increase the number of iterations.

If the Cache is DISABLED altogether in the code, then the cycle count to run the same 100 iterations of the FOR loop will increase greatly to 78494. This shows that indeed enabling the cache has an effect on performance.

Setting hardware breakpoints to profile each instruction isn’t a robust way to check instruction efficiency due to pipeline and emulation tool overhead. Also, using volatile for the variables will further degrade performance, which is something else to keep in mind.

Hope this clarifies.

Lali

Processors

Processors forum

Starterware/AM3359: L1 add L2 cache performance problem