This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

best way to allocate RAM vs Cache on DM6437



I was wondering if someone could shed some light on deciding how to allocate L1P/L1D/L2 memory between cache & SRAM.  What are the tradeoffs between cache vs SRAM?  What type of use cases would want to maximize one or the other?  If I have a high speed external memory like DDR2, should I just be using the dma to move data back and forth between SRAM ping pong buffer style?  Where should I look for the speed bottleneck in the various configurations?  A little design philosophy insight would be appreciated.

Matt

  • You might want to start by putting everything in external memory and maximizing the cache sizes.  This can give you a baseline performance estimate.

    Once you have your baseline there are several different routes you could take:

    • put most used data (smaller than 48KB) in L1D 
    • use L1D as "scratch" and use DMA to bring in a chunk of data while you are processing another chunk.  You would then ping pong in this manner.

    I think the only time I would actually recommend reducing the size of the cache in order to have more SRAM would be for cases where you know with certainty that some particular code/data gets used MUCH more than anything else.  In that case it can make sense to reduce the performance of everything else by reducing the cache, such that you maximize the performance of your algorithm that you know takes up most of the processing.

    Of course, this is all highly dependent on your application.  If you describe in more detail what your application is and what kind of algorithms/operations you perform most frequently then perhaps we can give you some more tailored advice.

  • Alright, after some thought I have a more specific situation.

    In one of my image processing steps, I want to apply an x & y pixel offset correction to an image frame (1280 by 300 pixels, too big to fit entirely in memory anywhere on my chip) using a lookup table (1280 by 300 by 2) to translate each pixel from uncorrected source to corrected destination buffers.  Assume that the image source, destination, and correction tables are in DDR2. 

    Because I will be sequentially accessing the lookup table, but only accessing each entry once,  my thought was to ping pong dma the lookup table into L1D RAM.  There would not be much advantage to cacheing this data since it is never reused.

    Because the offsets are slowly changing, I thought that allocating a "cloud" of the source buffer into L1D cache would be the best approach.  For example, if my "average" table y offset was -5 for the current row and the spread was +-2, I could "touch" lines from -3 to -7 into the cache.  Then I would already have much of the data cached for the next row offset, and the touch operation would only stall out for a few lines of the new range.  The thought being that because I know that much of the data will be reused but can't really separate it into blocks I opted for cache instead of dma + ram.

    Now I am not really sure where to put my destination buffer.  I will be writing to the entries sequentially and only once.  If I had no need to reuse this buffer for now and wanted it to be stored in DDR2, how should I handle declaring the output?  Directly in external memory, L1D RAM sharing time with the DMA for export, L2 cache, L2 RAM, something else?

    I think I am slowly starting to wrap my head around the memory architecture with these devices... complicated to say the least. =)

     

  • MattLipsey said:

    Now I am not really sure where to put my destination buffer.  I will be writing to the entries sequentially and only once.  If I had no need to reuse this buffer for now and wanted it to be stored in DDR2, how should I handle declaring the output?  Directly in external memory, L1D RAM sharing time with the DMA for export, L2 cache, L2 RAM, something else?

    I would recommend putting the destination buffer in DDR.  Note, however, that the data may "land" in L2 cache so you would need to do a block-writeback on that buffer before any other DMAs/masters touch that memory in DDR.  I don't see any benefit to putting that data in internal memory since writes are "fire and forget".  That is, whether it takes one cycle to land in memory or 100 cycles doesn't matter to your algorithm.  You just do the write and move on.  Therefore writing directly to DDR would be more efficient.

    Brad