How to optimize accesses to external memory

Ti Eval

Other Parts Discussed in Thread: OMAP-L137

Hi,

I am currently working on the optimization of a specific routine on the OMAP-L137 evaluation board. All the optimization hints that I found in the optimization tutorials have really helped a lot to reduce complexity without switching from C to assembler.

However, one problem is left which I could not really work around so far: In my algorithm I read entries from a huge matrix. This matrix - due to its size - is located in external memory. Reading one entry seems to take the average time of 27 cycles and slows down the overall performance. I have understood that the cache should be used to speed up processing. But how can I actively influence the use of the cache? I read through the cache related documents from TI and learned what to do but I am missing out on concrete strategies a little bit.

What should I do?

1) What I saw from the documents: If working on buffers located in external memory, work on segments "locally" so that memory segments can be read from the cache in case of the second/third read access: This does not work for my case since always I read all elements one after the other.

Is there a recipe what to do in this case? Should I use DMA transfers to move content from external to internal memory in parallel to other processing tasks? This would make the programming of my routine significantly more complex.

Thank you for any assistance in advance,

best regards,

over 13 years ago

0 AartiG over 13 years ago

TI__Guru**** 173310 points

Since you are looking for ways to optimize performance based on memory types/cache, I think the folks in the device forum might have better advise, hence I will move your post to the device forum.

0 Rahul Prabhu over 13 years ago

TI__Guru** 116170 points

Hi HK,

Thats a great question. Though lot of tutorials talk about system level and memory optimization, many of them don`t put a strategy around these optimization steps. This generally might be due to the fact that different levels of optimization requires different amount of effort and different applications work well with different strategies. For Eg, In an image processing algorithm that has block based neighborhood processing a DMA transfer would make more sense than a cache based implementation, since the entire image is not required by the processing function and block sizes are generally small enough that they can be moved to internal memory. More over processor MIPs are freed as it no longer needs to fetch data. We have tried to demonstrate this in the article here:

http://processors.wiki.ti.com/index.php/C64x%2B_iUniversal_Codec_Creation_-_from_memcpy_to_Canny_Edge_Detector

This may not be true in an audio algorithm where you have to apply a filter to an audio data and the audio samples are stored in external memory. In this case you would configure some part of internal memory as data for the filter taps that remain constant over the processing cycle and for the buffer to move windowed audio data, while the rest of the internal memory would need to be configured as cache for all the code that is in external memory. When the filter has to be applied to the data, the samples from windowed audio data are moved from external memory to internal memory using an DMA so that the DSP can apply the filer on the the signal locally.

I do not have a flow chart for you regarding how you should go about your system level optimization but in theory I agree with the flow chart that is mentioned here:

http://www.eetimes.com/design/signal-processing-dsp/4017594/DSP-performance-tuning-part-1-Cache-DMA-and-frameworks?pageNumber=3

I generally refer to that article when I am faced with a similar question. Many a times it is also a question managing the performance expectation based on effort required which is something the flowchart captures accurately.

If you are looking for specific examples on OMAPL137, you will find both an EDMA as well as a cache based example in the Quick start rCSL package.

http://processors.wiki.ti.com/index.php/QuickStartOMAPL1x_rCSL

Good Luck

Regards,

Rahul

0 Ti Eval over 13 years ago in reply to Rahul Prabhu

Intellectual 340 points

Dear Rahul,

thank you for the valuable answer! Unfortunately the first reference ("article here") does not work, could you please verify it?

To become a little more detailed: Regarding the use of a DMA transfer to copy data from external memory to internal memory, is the following the way what you would propose to speed up an algorithm?

1) Start with the algorithm with external memory used and find out that processing complexity is too high.

2) To speed up, I start the DMA transfer BEFORE the moment at which I would actually use the data to be transferred in my algorithm. In the best case, the start would be SIGNIFICANTLY ahead of the moment when I use the data.

3) Before actually using the data which was located in external memory to be transferred to internal memory, I would have to wait to make sure that the DMA transfer is complete by means of - polling?

Benefit would be that I can transfer memory and do processing in parallel but the efficiency depends on how well I can predict at one moment in time which data is required for processing at a later moment in time regarding the evolution of my algorithm.

So, this will reduce the portability of my algorithm since this is very platform specific, right? But I think this would be acceptable.

From your point of view, do you recommend to use the LLC library or the ACPY3 library? I had the feeling that these are a little bit overkill in most cases. I realized DMA transfers to feed data to and from hardware (MCASP) by programming the paRAM structs and registers directly. Is the DMA transfer part of the XDAIS programming model?

Thank you very much!

Best regards

0 Rahul Prabhu over 13 years ago in reply to Ti Eval

TI__Guru** 116170 points

Hi HK,

I think I messed up with my hyperlink while pointing you to the articel. I have editted the post and put the link to the article explicitly. Yes, DMA transfer is part of the XDAIS programming model and the usage of allocating internal memory, scratch memory and DMA channels is all part of the source code distributed from the Canny Edge detection wiki link I have provided.

As far as LLC vs ACPY3 is concerned, if you are using your algorithm on the DSP using multimedia frame work components like codec engine, then ACPY3 might be a better option. ACPY3 abstracts the configuration of EDMA registers and LLD (low level driver usage for EDMA) functionality but using LLD driver might give you greater flexibility.

The development flow you have described seems okay to me. Just make sure cache is turned on in step 1. For step 2 you may need to partition DSP internal memory as data and cache. To avoid data corruption you need to ensure that the two regions don`t overlap. For Step3 you can use DMA in either polled or in interrupt mode. I don`t remember what options are available in the ACPY3 interface.

Regards,

Rahul

Processors

Processors forum

How to optimize accesses to external memory