This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

A basic question about Cache Enable/Disabled Memory Access !



Hi,  My name is peterson, I have a basic question about how processor access the memory elements in cache-enabled versus cache disabled scenarios.

In spru656a, Chapter-1,  Page 1-4, the C6000 DSP Cache user guide, the cache-hit/miss scenario is described.

It tells that when cache is enabled, the DSP brings the off-chip slower memory data into the faster on-chip L2 or L1 Cache based on principle of locality. It gives us the faster access to our required data.

However when we see or review the assemly (.asm) file generated by the code composer studio against our C/C++ code (using -k option), we see that all memory reads/writes through LDW, LDDW, STW, LDBU etc, still take 4-cycles to complete.

Should not  they have had taken lesser cycles for reading the required data ?, since its in on-chip memory now.

Even more, if the cache is not enebled, these memory accesses still take 4-cycles to complete. Similarly when we write our own assembly code, we always have to wait for the 4-cycles, before our required data is available in the desired CPU register.

Then what makes it different between cached versus non-cached memory accesses, because cache-enabled applications are very fast and deliver more performance than non-cached application.

Whats the magic behid it ?

  • Yes cache is by design the fastest memory available to the DSP core. On c6x devices L1P and L1D caches only take one cycle per access. It is a bit tricky to measure cache performance on a real device because you need to "set up" the right conditions to verify the cache performance results. These conditions include the program size, its alignment in memory, and whether it is in the cache or not at the point of measurement, etc.

    It is easier to study cache performance with a "device simulator" and the profiler (comes with CCS), and the new "Cache Tune" tool will greatly reduce the cache performance impact on the algorithm or application.

    For a good intro on simulation refer to http://processors.wiki.ti.com/index.php/Category:Simulation. and specifically about cache analysis at http://processors.wiki.ti.com/index.php/Cache_Analysis_Using_Simulator

    The "Cache Tune" tutorial is available on the SPRAA01 app note.

  • peterson,

    I would like to add some clarification for one subtle part of your question:

    Bilal_BumbleBee said:
    we see that all memory reads/writes through LDW, LDDW, STW, LDBU etc, still take 4-cycles to complete.

    Although STW probably should not be in this list, it is my assumption that you are referring to the 4 delay slots required after each of these instructions before the result of the LDx lands in the target register. This 4-cycle delay is a requirement of the DSP's pipeline architecture and is independent from the delay in accessing the memory component. If L1D is being access or an L1D cache hit occurs, then the 4-cycle delay will be all that is required. However, if the source address is farther away than L1D or a cache miss occurs there will be a processor stall until the memory results are delivered to the DSP's pipeline.

    You can read more about the pipeline architecture in the CPU & Instruction Set Guide for the DSP you are using. There may be additional information in other documents, such as the Cache User's Guide or Two-Level Memory User's Guide.

    Regards,
    RandyP

     

    If Loc's reply above answered your question, or this one, please click  Verify Answer  for his post. If we have not answered your question, please tell us more.

  • Hi RandyP,

    yes u r right, infact the DSP needs four cycles to to get a memory location read into the CPU register, infact I did also try wide-memory-accesses for reading four-bytes all at once insteading getting them as single bytes, making it smarter reads and also making all the BEx active simulataneously. However I think the real solution is to actually perform some rework at the PCB level, that will bring freedom to future modifications being free from taking these considerations in mind. I hope I will be able to complete it soon, thanks again for the insight.

    Regards,

    Bilal