Hi,
I'm doing a performance critical project on 6678. Here is the thing I observed. I am doing an array dot product(array size 2048) on a single core.The code had been optimized. With the array data cached (i.e. in L1D, I disabled L2 cache),the operation took 551 cycles which was I expected. However, when I put the array data in L2 SRAM, the same operation took 1228 cycle (two times the former case). The C66x core's data bus width is 128bit (64bit per side) .The optimized code read 128bit to do four multiplies per cycle .So I take that the *bandwidth* of L1D is bigger than 128 bit and the *bandwidth* of L2 is 64 bit so that only half throughput is achived. Is it so?
PS: where can I find more details about the differences between L1 and L2 SRAM?
Thanks,
Roy