This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM335x caching slows down execution

I'm using a beaglebone black with StarterWare and need to enable the L1 and L2 cache for performance reasons. When I do so using CacheEnable(CACHE_ALL) and the MMU being configured as in the uartEdma_Cache.c example, performance goes down a lot.

I get a lot of interrupts from the DMA controller (McASP and Ethernet) so there are many unpredictable branches, which hurt the instruction cache, I guess (is factor 10 realistic?). I also tried using instruction cache only and using data cache only but I get about the same results. This even happens with L2 cache being disabled when only using L1 instruction cache what doesn't make any sense to me. How can loading single instructions from DDR be that much more efficient than the L1 instruction cache, even with all those cache misses?

Thats where I begin to doubt the assumption about the DMA interrupts being the source of the problem. Unfortunately the debugger doesn't help because you cannot inspect the cache. I ran out of ideas about what to test to find the source of the problem.

Has anybody seen similar problems using the cache on AM335x and found a way to fix it, or any ideas what I could try next?

Best Regards!

For reference: I asked this question in the Sitara Forum and was asked to ask in the StarterWare forum.

e2e.ti.com/.../1924182

  • For my application I moved the Ethernet packet buffers and CPPI structures to non-cached memory. I did the same for the McASP audio buffers. Unless you're doing a lot of processing in place, there's no reason to have any DMA source or sink in cached memory. This also eliminates those hard-to-debug cache coherency glitches. Once the DMA data is no longer trashing the cache you should see a big performance improvement since the stack and global variables will much more likely remain in the data cache.

    I think your instincts about the interrupts are correct. I'm using polling to service the Ethernet and McASP with good results. You might wish to try this if your application allows.
  • Thank you for your reply, James.

    I split the DDR in two halves in the linker command file and set one to caching and one to non-caching. I moved everything to the non-cached memory, except the ISRs and the McASP/Ethernet buffers because there are many calculations to do on the audio buffer. There is only a hand full of cache-specific function calls (invalidate and write), things are working pretty good now. I still don't really know (for sure) at what point the cache fails in a way that would explain a slow-down by such a great amount, but I don't have the time to further inspect it because it's running smoothly now.

    For now my conclusion is: When there are many interrupts and I see a big slow-down when activating the instruction cache, I either move the ISRs or everything else to a non-cached part of the memory. It's too bad that you cannot lock parts of the cache on this architecture.

    Best Regards.