How the cache miss pipeline work?

Eric Mao

Genius 4330 points

Hi All

The table 1-5 of document TMS320C66x DSP cache User Guide give us a performance data of the L1D cache miss.

I wrote a program to test the miss pipeline.

First time, I put a 4K data buffer in DDR3, unsigned char buffer[0x1000];

for(i = 1; i < 0x1000; i++)

buffer[i] = buffer[i-1] + i & 0xfe;

Second, before this for loop, I use the touch function to achieve Parallel Read Miss.

touch(buffer,0x1000)

for(i = 1; i < 0x1000; i++)

buffer[i] = buffer[i-1] + i & 0xfe;

The execute cycles of the second time is biger than the first time.

I wanted to know how to use the cache miss pipeline to decrease the miss stall?

Thank you!

over 13 years ago

0 one and zero over 13 years ago

TI__Mastermind 18146 points

Hi Eric,

did you make DDR3 cacheable?

You could e.g. use a CSL function for that:

CACHE_enableCaching(128);

Kind regards,

one and zero

0 Eric Mao over 13 years ago in reply to one and zero

Genius 4330 points

Hi one and zero

Thank you for you reply.

The MAR bit for the whole DDR3 is all set as 1. In my main function, the first task is to set the cacheable of DDR3.

I set the L1P, L1D, L2 as all cache.

3487.touchTest.rar

Eric

0 one and zero over 13 years ago in reply to Eric Mao

TI__Mastermind 18146 points

Hi Eric,

I ran your test on an C6678 EVM and I see a cycle count decrease. The decrease is only about 5% (including the touch loop cycles).

You can improve this number in two ways:

1. only put the buffer you're measuring on into DDR3 and the rest of the sections in L2 (Configure only half of L@ as cache)

2. Enable optimization. Use the "release" settings (you can also increase optimization to -o3, no symbolic debug)

This will reduce the non-memory access cycle count portion in your test-case and therefor giving a better speedup ratio ...

I hope this helps ...

Kind regards,

one and zero

0 Eric Mao over 13 years ago in reply to one and zero

Genius 4330 points

Hi one and zero

I run the test program on my evm board, the timeBuf[1] + timeBuf[2] is larger than the timeBuf[0]. And you test decrease about 5%. Both of the two results have not reached the performance described in the cache user guide.

0 one and zero over 13 years ago in reply to Eric Mao

TI__Mastermind 18146 points

Hi Eric,

as I already described in my previous post your test case does not only measure the load store performance of the buffer you're accessing.

If you apply my proposed changes you get much better speedup - but still you include cycles not related to load and store operations. That means there's a constant amount of cycles not related to caching effects you include in your comparison.

Kind regards,

one and zero

0 Eric Mao over 13 years ago in reply to one and zero

Genius 4330 points

Hi one and zero

I run the program as you suggest, but the the timeBuf[1] + timeBuf[2] is still larger than the timeBuf[0].

The following is the three case I tested.

case 1: L2 0K sram/512K cache all section in DDR3 -g (no optimize)

case 2: L2 256K sram/256K cache all section in L2 except the buffer section -g (no optimize)

case 3: L2 256K sram/256K cache all section in L2 except the buffer section -O3 (no symdebug)

timeBuf[0] timeBuf[1] timeBuf[2] timeBuf[1+2]

     case 1:     99056,              1908,                 98374,               100282
     case 2:     98871,              1713,                 98304,               100017
     case 3:     29498,              1696,                 28669,                30365

Eric

0 one and zero over 13 years ago in reply to Eric Mao

TI__Mastermind 18146 points

Hi Eric,

here are my results:

with no optimization and symbolic debug

Cycles cold: 107121

Cycles touch: 106482

with -03 and symbolic debug off:

Cycles cold: 6808

Cycles touch: 6161

The touch loop consumes 65 cycles in both cases.

I attached the project so I hope you can reproduce.

Kind regards,

one and zero

test.zip

0 Eric Mao over 13 years ago in reply to one and zero

Genius 4330 points

Hi one and zero

Thank you for you reply!

I read you program and know why in my test program the cycle of touch is larger than the cold. I do the cache invalid and writeback before touch function.

Now the cycle of touch is smaller than the cold.

Eric

Processors

Processors forum

How the cache miss pipeline work?