This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

C6657 CPU stall

Genius 5785 points

Hello,

I'd like to know a specification about CPU stall cycles. CPU need 10.5 cycles when CPU reads L2SARM via L1D Cache missed. What stall cycles does CPU have when it reads L2SRAM directly if L1D is disabled(SRAM)?

C66x Cache User Guide (SPRUGY8): Table 3-2 L1D Performance Parameters (Number of Stalls)
e2e.ti.com/.../462966

Regards,
Kazu

  • Hi Kazu Kon,

    Please refer the 3.1 Memory Read Performance numbers on Throughput Performance Guide for C66x KeyStone Devices(sprabk5).

    http://www.ti.com/lit/sprabk5

    Thank you

  • Hello Rajasekaran,

    Thank you for your information. But I'm a little confused because each of the document is different description about DSP stall. The case is Single Read, Source L2RAM and L1DCache Miss. Could you tell me the difference?

    1) Cache User Guide (SPRUGY8): 10.5 cycles (10 or 11)
    2) Throughput application note (SPRABK5A1): 7 cycles

    Actually, the information I want is not L1D missed but L1D disabled. I measured these cases. I think that L1D disabled is faster than L1D missed. I want accurate information during L1D disabled.

    Regards,
    Kazu

  • Hello Rajasekaran,

    Could you please give me an answer?

    Regards,
    Kazu

  • Kazu

    I think that you are right.  L1D miss may requires loading new value to L1D and evicted the previous value that might take more time.  You do not need to disable L1d, all you need is disable the cache in L1d  (I assume this is what you mean) and use L1D as SRAM.  In that case, the core can access vectors that are stored in L1D with 0 wait-states 


    Does it make sense?  If this is enough for you close the thread

    Ran

  • Hello Ran,

    Thank you for your reply. I'm sorry for asking the same question many times. Which is correct cpu stall cycles? In this case, L1D-Cache missed and reading from L2-SRAM.

    1) Cache User Guide (SPRUGY8): 10.5 cycles (10 or 11)

    2) Throughput application note (SPRABK5A1): 7 cycles

    I'd also like to know accurate cpu stall cycles during reading from L2-SRAM directly. In this case, L1D is SRAM configuration. (Not Cache)

    Regards, Kazu

  • In general, the Throughput Performances are measured numbers while the user’s guide is theoretical numbers, but I am not sure if this is the case here.

    What I suggest is that you test it.   Write a small program that reads data from L2 as follows:

     

    1. First define a volatile float global variable say XX
    2. Next have a float pointer P
    3. Next set the L2 to SRAM only (no cache) and the P pointer to the beginning of the cache
    4. Invalidate the L1 cache
    5. Record the timer time stamp (TSCL and TSCH)
    6. Run a do loop for 1024 times. Read 1024 values from L2 with the pointer, increase the pointer by 32 values (that is 256 bytes since this is a float). This will ensure that every value is in a new cache line (for L1 and for L2 as well) so that it looks like you do a single read every time. Add the read value to XX, that is do   something like the following

    XX = 0.0 ;

    for (i=9; i<1024; i++)

    {

    XX = XX + *p ;

    P = p+32   ;

    }

    7.   At the end of the loop read the TSCL/H values again and after it printf the value of XX and the time difference between the start of the loop and the end of it

    8.   Compile the code with full optimization and make sure that the loop takes only a single cycle, so all delay are from reading the value from L2. If it takes more than one cycle understand why and improve your code (it should not take more than 1 cycle)

    9.   This will give you a real indication what is the delay in reading from L2 when there is L1 miss

    Ran