This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Understanding of Cache

Hello All,

I was trying to do profiling int tmsc6670 processor, I first configured the L1D as SRAM and checked the number of cycles it takes to write and read from the memory location. Next i configured the L1D as 4K cache and checked the number of cycles. The number of cycles had reduced when i configured the core as Cache.

i like to know why this happened since my data was in the L1D itself in both the tests.

Please could anybody guide.

Regards,

Denzil.

  • Hello Atsushi,

    I tried the simulation on the simulator version of CCSv5.2.



    Regards,


    Denzil.


  • Denzil,

    Thank you for the information.  I will reproduce it by myself tomorrow.  (Some experts in the U.S. time zone may resolve it instead. :)

    Regards,
    Atsushi

  • Denzil,

    I will upload a template project to evaluate L1D SRAM configuration for C6670 which was generated by CCS 5.2.0.

    Could you let me know what do you want to compare performance exactly?  For example,

    • "Memory mapped L1D" v.s. "memory mapped L2 with L1D cache enabled"
    • Memory mapped L1D when L1D cache "enabled" v.s. "disabled"

    If you can provide simplified project and code, I will evaluate it by my side.

    I will attach two files.

    • cache_L1D.zip: A template project for memory-mapped L1D evaluation
    • cache_L1D_4k.zip: An RTSC Platform for the project (simply assigned (32 - 4 = 28) KB for memory mapped L1D)

    When you unzip the cache_L1D_4k.zip, and if you see (Windows XP case) for example,

    C:\Documents and Settings\Administrator\myRepository\packages\cache_L1D_4k\package.xdc

    the following image shows how to choose the RTSC platform.



    Regards,
    Atsushi

    3683.cache_L1D.zip
    4643.cache_L1D_4k.zip

  • Hello Atsushi San,

    I was just just trying to enable the cache and disable it and trying to find the performance just to get an understanding of what really happens when a Cache is enabled.

    The code i used was general, i just ran a for loop for 1024 times and wrote to the address of the L1D enabling and disabling the cache configuring the L1D Cache with different sizes.

    I used a RTSC poject for this.

    Regards,

    Denzil.

  • Denzil,

    OK, thank you for your explanation.

    I understood that you were turning on/off L1D cache and comparing the performances.  But once again, could you please let me know the following?

    Which memory area are you evaluating?  By default RTSC configuration, C6000 linker (i.e. CCS software development environment) try to put any variables (or arrays) on L2 memory (i.e. 0x00800000 - 0x008fffff).  if L1D is enabled (4K, 8K, 16K or 32K), read access from (or write access to) L2 memory is cached by L1D in general.  So access performance will be effective.  If all L1D area is completely memory-mapped (or 0K cache), the performance will be lower than the former.

    Does this answer to your question?

    The following documentation may be a good starting point to understand C66x cache.  (A difference between C64x and C66x is, the latter has capability to memory-map a part of L1D.  In the C64x era, L1D was dedicated for cache and we could not use L1D as regular memory.)

    • http://www.ti.com/lit/an/spra756/spra756.pdf

    The chapter 3 of the following document describes detail of the C66x L1D cache controller.

    • http://www.ti.com/lit/ug/sprugw0b/sprugw0b.pdf

    In addition, the following presentation covers some advanced information and may be helpful.

    • http://learningmedia.ti.com/public/hpmp/KeyStone/02_Memory/index.html

    Please don't hesitate to ask us when you have additional questions.

    Regards,
    Atsushi

  • Hello Atsushi San,

    I was writing the data to the Corepac0 L1D memory i.e. initializing  a pointer to the L1D memory address and writing to it directly.



    Regards,


    Denzil

  • Denzil,

    I understood you were writing and reading L1D directly.

    May I ask you to provide the actual code?  I think we can resolve the problem faster if we see the actual code.

    Regards,

    Atsushi

  • Hello Atsushi San,

    I tried had done the below mentioned program.

    #include<stdio.h>
    #include<stdlib.h>

    #include "cycle_measure.h"

    int t1,t2;
    int tsc_overhead ,Cold_cycles;

    int  main()

    {

          int  *p,*q;

          int i;

           p=NULL;

           q=NULL;

           q=(int *)0x01840040;

           *q=0x00000003;

           p=(int *)0x10f00000;

           t1=ranClock();

           for (i=0; i<1024;i++)

           {

                p++;

                *p=*p+i;

           }

        t2=ranClock();
        Cold_cycles=t2-t1-tsc_overhead;
        printf("\n No of Cycles=%d",Cold_cycles );
        return 0;

    }

    Regards,

    Denzil.

  • Denzil,

    Thank you for providing the actual code.  I have reproduced it and understood (1) what you intended and (2) why the execution cycle was reduced when a part of L1D was assigned as cache.

    First, please let me confirm what you intended.  (Please correct me if my understanding is incorrect.)

    • The array (pointed by the variable 'p') to read/write is located in L1D.
    • You want to compare read/write access cycles between "L1D cache enable" v.s. "L1D cache disable."
    • We understand that the data in the array (in L1D) should NOT be cached by L1D.

    You are correct that data in L1D memory is NOT cached by L1D cache.

    So why the cycles are different between the two test cases?  The reason is, even if the array is placed in the L1D, other working variables (namely the variable 'i' etc.) are still placed in the L2 (the default memory area where C6000 linker places).

    To clarify this, we can generate intermediate assembly source by applying --keep_asm option to compiler.  We will see the following assembly code which corresponds to the inner-loop.  (Line numbers seen at the rightmost "; |nn|" may vary.)

        .dwpsn    file "../main.c",line 31,column 3,is_stmt,isa 0
               LDW     .D2T2   *+SP(12),B4       ; |31|
               NOP             4
               ADD     .L2     4,B4,B4           ; |31|
               STW     .D2T2   B4,*+SP(12)       ; |31|
        .dwpsn    file "../main.c",line 32,column 3,is_stmt,isa 0

               MV      .L2     B4,B5
    ||         LDW     .D2T2   *+SP(20),B4       ; |32|

               LDW     .D2T2   *B5,B6            ; |32|
               NOP             4
               ADD     .L2     B4,B6,B4          ; |32|
               STW     .D2T2   B4,*B5            ; |32|

    Here, the register B5 corresponds to the variable 'p' and holds an address in the L1D area.  However, SP (stack pointer) is still pointing an address in L2, so it will be affected by L1D cache availability.

    It will be also helpful to trace the code by "Assembly Step Into" on the debugger.

    I hope it helps.

    Regards,
    Atsushi

  • Hello Atsushi San,

    Thankyou very much for the explanation, but if the stack pointer is pointing to the address in L2 how is the cycle count affected by the L1D cache.

    I do not know how to see it.

    Please guide.

    Regards,

    Denzil.

  • Denzil,

    First, please remember that local variables in C functions are usually stored in stack area.  (By some optimizing option, compiler tries to avoid stack utilization as much as possible.)  The stack area is located in L2 memory by default and the stack pointer is always pointing an address in the stack area.

    In the case L1D cache is disabled, whenever core accesses stack area in L2 memory, a certain level of access overhead is seen because L2 memory is "further" than L1D.

    On the other hand, if L1D cache is enabled, once core read data (i.e. a local variable) in the stack area, the data is copied (or cached) into the L1D.  Hereafter, whenever core needs to update the data at the same address (or pointer), core simply updates the cached data in the L1D without access to L2.  It will greatly reduce access overhead.  As a result, CPU cycle count is reduced (or affected).

    Does this answer your question?

    Regards,
    Atsushi

  • When cache is turned off, every access to the stack pointer has to go through L2SRAM, which will slow your loop down, even though memory access to L1D doesn't differ performance wise.

    My tests have showh, when stack is located in L2SRAM, at least 4KB (beter 8) L1D cache is a must.

    lg

  • Hello Atsushi San,

    Thank you very much it is very clear now.

    Regards,

    Denzil.

  • Dear Atsushi,

    That explanation makes perfect sense. However, shouldn't stack accesses show up as memory reads and writes in Trace? They don't seem to, which is what threw me off for some time on this issue. Stack accesses seem to show up as "Load Address", similar to code access. That led me to think that stack accesses would go through L1P cache, not L1D cache, but that's not the case. Can you confirm my understanding?

    Thanks,
    Manu 

  • Hi Manu,

    I almost forgot what we discussed in the thread.  Sorry.

    Simply speaking, accessing stack is a data access.  Not a program fetch.  What exact problem do you have?  Could you please elaborate it?

    Regards,
    Atsushi

  • Hi Atsushi,

    My question is this: if stack accesses are data accesses, why are they not shown in Memory Read/Memory Write in the output of Trace when using a hardware target?

    Manu

  • Hi Manu,

    It should be in the output of Trace.  Did I say not?  (I'm afraid that I might be misunderstanding in the previous post.)

    Regards,
    Atsushi