This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Starterware/AM3359: L1 add L2 cache performance problem

Part Number: AM3359

Tool/software: Starterware

Hi,

I am using TI staterWare with ICE board.

Created One project in CCS7 with bare metal application.While debuggingmy project ,It is showin 31 clock cycles to execute single instruction.

I have enabled mmu and cache using 

int main(void) {

int success = 0;

MMUConfigAndEnable();
CACHEEnable(CACHE_IDCACHE, CACHE_INNER_OUTER);
 CACHEEnable(CACHE_ICACHE, CACHE_INNER_OUTER);

printf("Platform Initialization !!! \n");

volatile long a,b,ccount=0;

a=0;

b=0;

for(count=0;count <=100;count++)

a=b+c;

}

I want to know anything else need to do for enabing Cache and MMU

  • Samir,
    We are looking into this and shall get back to you.

    Lali
  • Samir,

    Could you please try to benchmark the below code and let me know your number?

    volatile  long a[100] ;
    
    int main(void) {
    
    int success = 0;
    
    MMUConfigAndEnable();
    CACHEEnable(CACHE_IDCACHE, CACHE_INNER_OUTER);
    CACHEEnable(CACHE_ICACHE, CACHE_INNER_OUTER);
    
    printf("Platform Initialization !!! \n");
    printf("size of long  %d\n",sizeof(long) ) ;
    
     long b[100],c[100], count=0;
    
    
     for(count=0;count <=100;count++)
     {
    	 a[count] = count  ;
    
     }
    
     for(count=0;count <=100;count++)
      {
    
     	 b[count] = a[count] + 3 ;
    
      }
    
     for(count=0;count <=100;count++)
      {
    
     	 c[count] = b[count] -1  ;
      }
    
    for(count=0;count <=100;count++)
    
    {
    
         a[count]=b[count]+c[count];
    
    }
    
    }

    I wonder if you define something as volatile, the compiler invalidates the cache before reading the value from memory.

    Lali

  • Hi Lalindra,
    Thanks for reply,But same issue.No improvement. instruction like cmp r3,#0x64 takes 31 cycles.
  • Samir,

    Could you please post on the thread the CCS example project you are trying ? Thanks.

    Lali
  • 1884.gpio_test.zip

    Hi,
    It is GPIO LED blink project which i have imported from TI Staerware

  • Hi Samir,

    The 31 cycles observed is a result of putting hardware breakpoints between assembly instructions. This isnt a good way to benchmark instruction cycle count due to tool and pipeline overhead. Lets try to illustrate why this isn’t the actual cycle count of an instruction.

    Let’s take the below code example that you were trying. I also based this on the gpio example in the Starterware package.

    int main()
    {
        MMUConfigAndEnable();
        CACHEEnable(CACHE_IDCACHE, CACHE_INNER_OUTER);
        CACHEEnable(CACHE_ICACHE, CACHE_INNER_OUTER);
    
        int success = 0;
        volatile long a,b,c, count=0;
        a=0;
        b=0;
    
        for(count=0;count <=100;count++)
        {
        a=b+c;
        }
    }

    The disassembly for this looks like this:

    A break point was put before and after the FOR loop. So, the 1515 cycle count was to run 10 instructions (between the highlighted lines) for 100 iterations which approximates to about 15 cycles.

    Now if you increase the FOR loop to 100,000 iterations, then the cycle count will go down. This is expected as you continue to increase the number of iterations.

    If the Cache is DISABLED altogether in the code, then the cycle count to run the same 100 iterations of the FOR loop will increase greatly to 78494. This shows that indeed enabling the cache has an effect on performance.

    Setting hardware breakpoints to profile each instruction isn’t a robust way to check instruction efficiency due to pipeline and emulation tool overhead. Also, using volatile for the variables will further degrade performance, which is something else to keep in mind.

    Hope this clarifies.

    Lali