Hi,
I am currently working on the optimization of a specific routine on the OMAP-L137 evaluation board. All the optimization hints that I found in the optimization tutorials have really helped a lot to reduce complexity without switching from C to assembler.
However, one problem is left which I could not really work around so far: In my algorithm I read entries from a huge matrix. This matrix - due to its size - is located in external memory. Reading one entry seems to take the average time of 27 cycles and slows down the overall performance. I have understood that the cache should be used to speed up processing. But how can I actively influence the use of the cache? I read through the cache related documents from TI and learned what to do but I am missing out on concrete strategies a little bit.
What should I do?
1) What I saw from the documents: If working on buffers located in external memory, work on segments "locally" so that memory segments can be read from the cache in case of the second/third read access: This does not work for my case since always I read all elements one after the other.
Is there a recipe what to do in this case? Should I use DMA transfers to move content from external to internal memory in parallel to other processing tasks? This would make the programming of my routine significantly more complex.
Thank you for any assistance in advance,
best regards,
HK