This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

About the C6474 multicore cache consideration



I read from the 'Multicore programming guide' application report that there's no cache coherency between L1/L2 on one core and L1/L2 on another core.  So if Core 0 read from or write to L1D SRAM in core1, cache coherency must be kept through the  software, am I right? It is really ridiculous cache is not coherent when reading or write L1D SRAM.

In order to avoid the cache coherency problem, if I turn off the cache mapping for the other core's L1/L2 SRAM, I'd like to know the performance of reading the other core's L1D/L2 SRAM, i.e. the cycles needed to read/write one byte from/to the other core's L1D/L2. Could it be the same speed as the core reading the external memory such as SDRAM.

Thanks for any reply.

  • touse said:

    I read from the 'Multicore programming guide' application report that there's no cache coherency between L1/L2 on one core and L1/L2 on another core.  So if Core 0 read from or write to L1D SRAM in core1, cache coherency must be kept through the  software, am I right? It is really ridiculous cache is not coherent when reading or write L1D SRAM.

    <AVM> Yes, the software developer has to configure the cache coherency. This is to give the developer to efficiently code cache coherency and save power simultaneously.

    touse said:
    In order to avoid the cache coherency problem, if I turn off the cache mapping for the other core' s L1/L2 SRAM, I'd like to know the performance of reading the other core's L1D/L2 SRAM, i.e. the cycles needed to read/write one byte from/to the other core's L1D/L2. Could it be the same speed as the core reading the external memory such as SDRAM.

    <AVM> I don't have the exact numbers, but i think it would be much faster than SDRAM, since L1/L2 memory are internal to the megamodule and we can access any core L1/L2 through their global addresses.

    touse said:
    Thanks for any reply.

     

  • My answer are above with <AVM> tags.

  • touse said:
    So if Core 0 read from or write to L1D SRAM in core1, cache coherency must be kept through the  software, am I right? It is really ridiculous cache is not coherent when reading or write L1D SRAM.

    You probably already understand this quite well, but I want to be clear about what is non-coherent. When Core 0 reads from a global memory address that is Core1's L1D, Core 0 goes out through its Master Direct Memory Access (MDMA) port to an SCR to a bridge to another SCR and to the Slave Direct Memory Access (SDMA) port which goes inside Core1 to get the data. This is a communications path and not a direct memory bus. Inside each core, there are direct memory bus paths to their local L1 and L2 memories, but to get to another core's memory requires a pretty long logical path.

    When Core0 reads Core1's L1D, Core0 makes a cache line in Core0's L1D and/or L2 with that data. The next time Core0 reads that same or a nearby location from Core1's L1D, Core0 does not have to go out through that long communication path but can just read from Core0's local cache copy. If Core1 writes to its own L1D at that same location, there is no direct memory path or snoop signalling to go tell Core0 and Core2 that something was written; this would be required to maintain coherency in this memory architecture. The converse is that if Core0 writes to its locally cached copy of what was in Core1's L1D, Core0 does not forward that data to Core1 immediately but just works with its local cached copy until that cache line gets evicted.

    These L1D memories have <1ns access times. They are very fast, very expensive, and have to be very close physically to their respective cores. Adding more logic paths to send information everywhere would add complexity and would hurt performance. And we think you will prefer that performance in this particular tradeoff. At least we have found that many critical functions for many customers are improved by this tradeoff.

    touse said:
    I'd like to know the performance of reading the other core's L1D/L2 SRAM, i.e. the cycles needed to read/write one byte from/to the other core's L1D/L2. Could it be the same speed as the core reading the external memory such as SDRAM.

    I thought the same thing as ArunMani about the relative performance of using another core's memory compared to external DDR2 memory. To get the true answer for whatever you exactly intend, you will want to run benchmark testing; the results will vary depending on cache size, cache enabling, and the amount of data being transferred. Because of the long communication path that I listed above for accessing another core's memory, and because of the shorter logical path to reach the DDR EMIF (just one SCR), the access speeds can be very comparable and some test cases will actually find DDR accesses to be faster than those to another core's memory.

    Again, tradeoffs were made to optimize this high-performance device for being able to get large amounts of data (which means DDR) and process that as fast as possible (copied or cached locally) rather than pushing data from one core to another internally.

    In every case, though, the best solution is to use EDMA resources to move data between one core's memory and whatever other resource needs to be accessed. For only a few words of data, direct CPU accesses are best, but for any medium to large buffers you will want to use EDMA, sometimes even if the CPU has to wait for the transfer to complete.

  • Thanks, Randy. As you said, When Core0 reads Core1's L1D, Core0 makes a cache line in Core0's L1D and/or L2 with that data. I wonder if that is always the case. From the datasheet of C6474, the MAR register will control whether one area of memory is cached or not. So if I set MAR16/17/18 zero, which mean the L1D/L2 memory isn't cacheable,  so I think there's no cache line mapped for the other core's L1/L2 memory, is that correct? As if this area is cached, cache coherency is needed to be maintained if even only a few words has to be read from another core.

     

  • touse said:
    So if I set MAR16/17/18 zero, which mean the L1D/L2 memory isn't cacheable,  so I think there's no cache line mapped for the other core's L1/L2 memory, is that correct?

    Yes, that is correct. If Core0's MAR17=0, then Core1's L1D/L2 memories are not cached when accessed by Core0.

    touse said:
    if this area is cached, cache coherency is needed to be maintained if even only a few words has to be read from another core.

    This is also correct. If cache coherency is an issue, then it must be maintained manually by the caching core.

    If you are only reading a few words, then turning off cache may be a very helpful thing to do. When cache is enabled, caching occurs in CACHE_LINE widths. If there is data in that CACHE_LINE width that is not directly accessed on the caching side but it does get modified on the distant core's side, there can be other problems with maintaining coherency.