C6657 CPU stall

Kzk

Hello,

I'd like to know a specification about CPU stall cycles. CPU need 10.5 cycles when CPU reads L2SARM via L1D Cache missed. What stall cycles does CPU have when it reads L2SRAM directly if L1D is disabled(SRAM)?

C66x Cache User Guide (SPRUGY8): Table 3-2 L1D Performance Parameters (Number of Stalls)
e2e.ti.com/.../462966

Regards,
Kazu

over 9 years ago

0 Raja over 9 years ago

TI__Guru* 81335 points

Hi Kazu Kon,

Please refer the 3.1 Memory Read Performance numbers on Throughput Performance Guide for C66x KeyStone Devices(sprabk5).

http://www.ti.com/lit/sprabk5

Thank you

0 Kzk over 9 years ago in reply to Raja

Genius 5785 points

Hello Rajasekaran,

Thank you for your information. But I'm a little confused because each of the document is different description about DSP stall. The case is Single Read, Source L2RAM and L1DCache Miss. Could you tell me the difference?

1) Cache User Guide (SPRUGY8): 10.5 cycles (10 or 11)
2) Throughput application note (SPRABK5A1): 7 cycles

Actually, the information I want is not L1D missed but L1D disabled. I measured these cases. I think that L1D disabled is faster than L1D missed. I want accurate information during L1D disabled.

Regards,
Kazu

0 Kzk over 9 years ago in reply to Kzk

Genius 5785 points

Hello Rajasekaran,

Could you please give me an answer?

Regards,
Kazu

0 ran35366 over 9 years ago in reply to Kzk

TI__Genius 12805 points

Kazu

I think that you are right. L1D miss may requires loading new value to L1D and evicted the previous value that might take more time. You do not need to disable L1d, all you need is disable the cache in L1d (I assume this is what you mean) and use L1D as SRAM. In that case, the core can access vectors that are stored in L1D with 0 wait-states

Does it make sense? If this is enough for you close the thread

Ran

0 Kzk over 9 years ago in reply to ran35366

Genius 5785 points

Hello Ran,

Thank you for your reply. I'm sorry for asking the same question many times. Which is correct cpu stall cycles? In this case, L1D-Cache missed and reading from L2-SRAM.

1) Cache User Guide (SPRUGY8): 10.5 cycles (10 or 11)

2) Throughput application note (SPRABK5A1): 7 cycles

I'd also like to know accurate cpu stall cycles during reading from L2-SRAM directly. In this case, L1D is SRAM configuration. (Not Cache)

Regards, Kazu

0 ran35366 over 9 years ago in reply to Kzk

TI__Genius 12805 points

In general, the Throughput Performances are measured numbers while the user’s guide is theoretical numbers, but I am not sure if this is the case here.

What I suggest is that you test it. Write a small program that reads data from L2 as follows:

First define a volatile float global variable say XX
Next have a float pointer P
Next set the L2 to SRAM only (no cache) and the P pointer to the beginning of the cache
Invalidate the L1 cache
Record the timer time stamp (TSCL and TSCH)
Run a do loop for 1024 times. Read 1024 values from L2 with the pointer, increase the pointer by 32 values (that is 256 bytes since this is a float). This will ensure that every value is in a new cache line (for L1 and for L2 as well) so that it looks like you do a single read every time. Add the read value to XX, that is do something like the following

XX = 0.0 ;

for (i=9; i<1024; i++)

{

XX = XX + *p ;

P = p+32 ;

}

7. At the end of the loop read the TSCL/H values again and after it printf the value of XX and the time difference between the start of the loop and the end of it

8. Compile the code with full optimization and make sure that the loop takes only a single cycle, so all delay are from reading the value from L2. If it takes more than one cycle understand why and improve your code (it should not take more than 1 cycle)

9. This will give you a real indication what is the delay in reading from L2 when there is L1 miss

Ran

Processors

Processors forum

C6657 CPU stall