This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

C674x SDRAM to IRAM transfers

Hello,

I need to transfer chunks of data from the external SDRAM to on-chip internal memory for fast processing. The chunks are around 1kB. What is the most efficient way to do it. Can I use a DMA transfer for that purpose? I am afraid a C-loop will burn too many cycles because of the cache misses. I am not a bit expert on TI architecture but I would appreciate if you could point me toward the right solution.

Thanks.

  • Yes, you can use DMA to transfer between external memory and internal RAM.  If you're going to go to that much effort then you should do the transfers to L1D for processing. In order to be really efficient you would need to pipeline the transfers such that the next buffer is being transferred while you process the previous.

    Before you go through this effort I HIGHLY advise that you try out your algorithm directly on the external memory to see how the performance is.  The cache generally works wonders with most algorithms.  Only algorithms that randomly access huge buffers of data and with little/no re-use are NOT good for cache.  So check out this wiki page first and hopefully you won't even need to bother with DMA.  In general I find most customers are very pleased with the performance of their algorithms just compiling straight C code and utilizing the cache.  Hopefully that will save you lots of time.

  • Thanks,

    The cache is not working well for my ap but it helps. I did some simple prototyping by preloading some data in the IRAM (twiddles coefs for my FFTs for example) and that helped a lot.

    Why are you saying that I should transfer to the L1D instead of just transferring into a scratchpad in IRAM? I am still planning to have a L1D for other non so critical data.

    Yes I will pipeline the transfers. Is there a tutorial somewhere to get me started?

  • Actually a better solution would be to prefetch the data into cache but I do not think the c674x has a prefetch instruction.

  • pascal said:
    Why are you saying that I should transfer to the L1D instead of just transferring into a scratchpad in IRAM?

    If you're going to go through the effort of having the DMA transfer the data into faster memory, you should put it into the fastest memory.  Otherwise you're only getting half of the benefit!

    pascal said:
    I am still planning to have a L1D for other non so critical data.

    I'm a little confused why you would put your less critical data into L1D.  The L1D memory is the fastest and most "precious" memory in the device.  You should use this memory either as cache or for your most critical data (or often a mix).

    pascal said:
    Yes I will pipeline the transfers. Is there a tutorial somewhere to get me started?

    Sorry, not that I know of.

  • pascal said:

    Actually a better solution would be to prefetch the data into cache but I do not think the c674x has a prefetch instruction.

    There's no pre-fetch instruction so effectively you will be doing the pre-fetch with DMA.

  • I agree with everything you say and I prefer to let the L2/L1 cache do their job. I look at the cache documentation for the C674x and they talk about the 'touch' loop to bring data into cache. But I am not clear why that works. For example why is:

       touch(data_in, sizeof(data_in));

       process(data_out, data_in, sizeof(data_in));

    faster than just a call to "process" without the "touch" call. If data_in is not in cache, "touch" is going to stall the same way "process" is going to stall on cache misses?

     

     

     

  • pascal said:

    I agree with everything you say and I prefer to let the L2/L1 cache do their job. I look at the cache documentation for the C674x and they talk about the 'touch' loop to bring data into cache. But I am not clear why that works. For example why is:

       touch(data_in, sizeof(data_in));

       process(data_out, data_in, sizeof(data_in));

    faster than just a call to "process" without the "touch" call. If data_in is not in cache, "touch" is going to stall the same way "process" is going to stall on cache misses?

     

    When you call touch() the data is accessed in a very specific way.  Only a single byte is accessed per cache line.  Furthermore, two cache lines are accessed in parallel and new cache lines are "touched" on every single instruction.  This allows you to exploit "cache miss pipelining" where you pay a larger penalty for the first cache miss, but subsequent cache misses overlap the first and so the overall cache miss is much less.  If you were to execute your code without touch() then you would instead have a single cache miss, then you would have a bunch of hits (assuming sequential data access), then you would have another miss 128 bytes later.  Since these are so far apart in time the misses would not overlap and you would not get this benefit.

    So did you add the assembly file to your function and try it out?  If so, please report back the numbers for the 2 cases.  I've not actually tried the touch loop so I'm interested to hear your results.

  • I have definitely seem improvements when 'touching'. I will need to experiment a little bit more and will get back.

    Thanks,

    Pascal

  • Pascal,

    Were you ever able to quantify your improvement?  Just curious.