This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

How the cache miss pipeline work?

Hi All

      The table 1-5 of document TMS320C66x DSP cache User Guide give us a performance data of the L1D cache miss.

      I wrote a program to test the miss pipeline.

      First time, I put a 4K data buffer in DDR3, unsigned char buffer[0x1000];

       for(i = 1; i < 0x1000; i++)

             buffer[i] = buffer[i-1] + i & 0xfe;

      Second, before this for loop, I use the touch function to achieve Parallel Read Miss.

           touch(buffer,0x1000)

            for(i = 1; i < 0x1000; i++)

               buffer[i] = buffer[i-1] + i & 0xfe;

      The execute cycles of the second time is biger than the first time. 

      I wanted to know how to use the cache miss pipeline to decrease the miss stall?

      Thank you!

  

    

     

  • Hi Eric,

    did you make DDR3 cacheable?

    You could e.g. use a CSL function for that:

    CACHE_enableCaching(128);

    Kind regards,

    one and zero

  • Hi one and zero

       Thank you for you reply.

        The MAR bit for the whole DDR3 is all set as 1.  In my main function, the first task is to set the cacheable of DDR3. 

        I set the L1P, L1D, L2  as all cache.

     

     3487.touchTest.rar

     Eric

  • Hi Eric,

    I ran your test on an C6678 EVM and I see a cycle count decrease. The decrease is only about 5% (including the touch loop cycles).

    You can improve this number in two ways:

    1. only put the buffer you're measuring on into DDR3 and the rest of the sections in L2 (Configure only half of L@ as cache)

    2. Enable optimization. Use the "release" settings (you can also increase optimization to -o3, no symbolic debug)

    This will reduce the non-memory access cycle count portion in your test-case and therefor giving a better speedup ratio ...

    I hope this helps ...

    Kind regards,

    one and zero

  • Hi one and zero

     I run the test  program on my evm board, the timeBuf[1] + timeBuf[2] is larger than the timeBuf[0].  And you test decrease about 5%.  Both of the two results have not reached the performance described in the cache user guide.

      

  • Hi Eric,

    as I already described in my previous post your test case does not only measure the load store performance of the buffer you're accessing.

    If you apply my proposed changes you get much better speedup - but still you include cycles not related to load and store operations. That means there's a constant amount of cycles not related to caching effects you include in your comparison.

    Kind regards,

    one and zero

     

  • Hi one and zero

        I run the program as you suggest, but the  the timeBuf[1] + timeBuf[2] is still larger than the timeBuf[0].

        The following is the three case I tested.

        case 1:  L2 0K sram/512K cache           all section in DDR3   -g  (no optimize)

        case 2:  L2 256K sram/256K cache     all section in L2 except the buffer section   -g (no optimize)

        case 3: L2 256K sram/256K cache     all section in L2 except the buffer section   -O3 (no symdebug)

                           timeBuf[0]        timeBuf[1]        timeBuf[2]       timeBuf[1+2]

         case 1:     99056,              1908,                 98374,               100282
         case 2:     98871,              1713,                 98304,               100017
         case 3:     29498,              1696,                 28669,                30365

     

     Eric

  • Hi Eric,

    here are my results:

    with no optimization and symbolic debug

    Cycles cold:   107121

    Cycles touch: 106482

    with -03 and symbolic debug off:

    Cycles cold:   6808

    Cycles touch: 6161

    The touch loop consumes 65 cycles in both cases.

    I attached the project so I hope you can reproduce.

    Kind regards,

    one and zero

    test.zip
  • Hi one and zero

          Thank you for you reply!

          I read you program and know why in my test program the cycle of touch is larger than the cold. I do the cache invalid and writeback before touch function.

          Now the cycle of touch is smaller than the cold.

     

       Eric