This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Copying Data in external memory

 

Hi all,


    I have some queries wrt to copying data in external memory on DM6437.
    
    Lets assume we have two buffers A & B placed in external memory.
    
    i want to copy the data in buffer A to buffer B.
    
    i have assumed the buffer size to be in multiples of cache line size and
    
    are aligned to cache line size and have made external memory cacheable.

      
    i have written an assembly routine which copies 8 bytes of data every cpu cycle.
    
    so on flat memory, to copy a buffer size of 25,600..it would require only 3,200 cycles.
    
    but on board, i get a measurement around 73,300 cycles for the same transfer.
    
    and i assume this is because of data cache misses. if not do let me know what could be the reason?
    
    
    How could i bring this measurement closer to the flat memory case?
    
    (should the code flow be changed or how to minimize the cache misses)
    
    (or)
    
    would DMA transfer yield a better performance??
    
    Waiting for your replies,
    
Regards,
Sandeep
    
    
    
    
   

  • Even if cache is enabled, if you have never accessed buffer A before (or if you have accessed enough other data for buffer A to be evicted from the cache) than there will still have to be an initial cache miss for all of the data to be read in from DDR, so that is likely the delay you are seeing, or at least part of it.

    If your goal is to get buffer A to buffer B in external memory than DMA will be the most efficient way to do it, since DMA will allow more efficient bursting memory accesses than a multitude of individual CPU transfers, in addition to allowing the CPU to do other work while the transfer is in progress. Of course with DMA you also have to worry about cache coherency, so you need to do an invalidate for the buffer B region before reading it from the CPU.

  • Questions for you:

    1. Which parts of the device will access buffer A and when?  For example, is it written there by the video capture interface?  Does the CPU directly operate on that data?
    2. Same question but for buffer B.

    There are lots of cache coherence issues "lurking" in this simple scenario so if you can provide me specific usage info I can provide you specific cache operations to perform.

    No matter how you write your code you'll never be able to move 8 bytes per cpu cycle.  There are several factors affecting the speed at which you can transfer data in external memory:

    • External memory runs at a slower clock speed than CPU.
    • The width of the external bus is less than the internal bus widths so you can transfer fewer bytes per DDR cycle.  (i.e. a STDW instruction will translate to multiple DDR stores)
    • DDR is a type of DRAM which inherently requires operations like refreshes, page open, page close, etc. that further degrades overall throughput.

    Brad

  • Hi Bernie and Brad,

     

    Thanks for your replies.

     

    Considering the Video Decoder process,Buffer A is my reference buffer which holds the

    reference frame (I or P frames) and are operated only by the CPU.Buffer B is my output

    buffer, in case of B frames it is directly filled by the process while for I and P frames

    it is copied from the corresponding reference buffers(Buffer A).This buffer later on,is

    required by the peripheral to display.

    Waiting for your replies,

    Regards,

    Sandeep

     

  • Optimally you would want to avoid the copy all together and just point the display driver to the buffer that currently contains the frame you want to display, i.e. if buffer A contains the frame you want to display give a pointer to buffer A to the video driver and have your video decoder process fill a different buffer.

    If you did want to keep the copy than the fastest way to get from A to B would be to use a DMA transfer, something like:

    1. Perform cache writeback on buffer A space (maybe BCACHE_wb(...), this ensures that the latest data is in external memory)
    2. Configure a DMA transfer to go from A to B and initiate it with CPU with a completion interrupt (or you could poll on it, but this would waste CPU cycles)
    3. Wait for the DMA to complete (block the thread and do something else or poll)
    4. In the completion interrupt unblock the thread and than pass buffer B to the display driver

    This should get the data from A to B in the fastest way possible since the DMA allows for bursting memory accesses in addition to allowing your CPU to keep processing while the transfer goes on.

    EDIT: This is all assuming that all the data is being generated by the CPU, since you mention this is a video decoder (h.264?). It also assumes that buffer B is only used to display video, such that the CPU never needs to read back from buffer B (if you did you would need to perform a cache invalidate on buffer B).