This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

L1D Stalls

Other Parts Discussed in Thread: CCSTUDIO

Hi,

I've wrote a C implementation of an algorithm on C6678, inputs and outputs are allocated in L1D (L1D cache is set to 0K), however, i'm having L1D stalls of 9000 cycles ! which represents 25% of the total number of cycles ..

Are we supposed to have stalls when using L1D ? or am i missing something ?

Compiler options : optimization level 3 ; optimize for speed 5

What could be the problem ? I wish someone could help ..

Thanks !

  • Nios Ensa,

    Welcome to the TI E2E forum. I hope you will find many good answers here and in the TI.com documents and in the TI Wiki Pages.

    In what context and with what tool are you determining that you have 9000 L1D stalls?

    From which core are you running this test? What address are you using for the L1D buffer?

    Regards,
    RandyP

  • Thanks RandyP for your reply,

    I tried the test for the core0, the same results are obtained with other cores .. there's 2 buffers : "inp" of 8 KBytes + "outp" of 8 KBytes ..

    "inp" starting at address 0x00F0000 and "outp" starting at address 0x00F02000 ..

    I mesured the L1D stalls on CCStudio v5.1/ CGT 7.3.2, using the "Clock" in the "Run" menu .. mesured event = CPU.stall.mem.L1D .. i made sure that no other code is involved in that mesure other than the wanted function ..

    Thanks

  • What code are you running?

    Most likely, you are not getting an accurate answer from the Profiler Clock feature. Please also specify exactly what your clock setup is.

    But if your readings are correct, there could be reasons for stalls to occur, other than normal access wait states. I will not be capable of teaching you about the internal architecture of the device beyond pointing you to the CorePac and Cache documentation. But I am happy to try to help you with an application.

    In this case, are you benchmarking the performance of the device, or is your application one in which the majority of the data will reside in the 32KB of L1D?

    I ask this because it is very rare that the best performance setting for an application is to have L1D set to 0K cache. We use that internally to get top performance numbers, but we try to make that clear in our benchmark results since it may not be practical for our clients' applications.

    Regards,
    RandyP

  • Thanks RandyP,

    I've done several experiments, including ASM tests, such that one :

     SPLOOP 1
    LDDW .D1 *A6++[16],A9:A8
    || LDDW .D2 *B6++[16],B9:B8
    SPKERNEL 0,0
     ; gets 1 data from each L1D line

    Even if that code runs fully on L1D data, it gives twice cycles as expected. Can you confirm that if the core accesses in 1 cycle, 2 data from different lines of L1D, there will be a stall of 1 cycle ?

    The next code provides no L1D stalls :

     SPLOOP 1
    LDDW .D1 *A6++[2],A9:A8
    || LDDW .D2 *B6++[2],B9:B8
    SPKERNEL 0,0
     ; gets 2 adjacent data each cycle

     That explained additional latencies in my ASM version of the algorithm i wanted to implement, it must be the same reason for the 9000 L1D stalls in the C version.

    The algorithm of our interest is a modified version of an FFT, we decided to activate 32K cache permanently, however, first i experiment my functions on a closer memory.

    Then, basically treating 2 butterflies at once, seems to be the only solution to avoid this.

    However, i couldn't find such information clearly in the the user guides .. can you confirm that ?

    Best Regards

  • Nios Ensa,

    If you want to understand the public portions of the L1D memory architecture, the CopePac User's Guide Chapter 3 will help you a lot. In particular, look at the discussions about Banks in the L1D memory. Bank stalls may be what you are running into. This is also discussion in the OP6000 C6000 Optimization Workshop, material for which you can find on the TI Wiki Pages.

    Are you running the loop 9000 times? What are the addresses in the two pointer registers?

    Regards,
    RandyP

     

    If you need more help, please reply back. If this answers the question, please click  Verify Answer  , below.

  • for example, for test, we take that assembly code :

    		.global _miss_pipe
    _miss_pipe:

    ; A4 = pointer
    ; B4 = nb iterations
    ; B6 = next pointer

    MVC .S2 B4,ILC
    MV .S2X A4,B6
    ADDAD .D2 B6,8,B6
    NOP 1

    SPLOOP 1
    LDDW .D1 *A4++[16],A9:A8
    || LDDW .D2 *B6++[16],B9:B8
    SPKERNEL 0,0

    B .S2 B3
    NOP 5
    that I call from C, that way :
    void _miss_pipe(double *src, int lgth);
    
    
    #define N 1024
    double inp[N];
    #pragma DATA_SECTION(inp,".lData"); // in the cmd file : .lData: load >> L1DSRAM
    #pragma DATA_ALIGN(inp, 8);
    
    
    int j; for (j=0;j<1000;j++)
    _miss_pipe(inp,N/16);
    
    
    That takes 152 018 Cycles, with a profiling result of 64 000 L1D stalls
    
    
    Regards
  • I get it now, it is due to accesses to the same L1D bank. Banks defined by low-order bits is not usual for me, however ..

    Thanks RandyP for your help

    Regards