This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Experiences executing from RAM

Using a TM4C123 and running at 80MHz I have a performance bottleneck.

 

I am, of course, doing all the usual of moving work out of the bottleneck changing algorithms etc but I also am looking at the possibilitiy of moving it into on-board RAM.

 

The flash is noted as running at 40MHz (1/2 clock speed above 40MHz) but with 64 bit access so that it fetches 2 instructions per cycle with branch prediction and speculative pre-fetch.  The RAM is zero wait access if sequential fetches are to alternate banks.

I can easily envision a range of results from moving a particular piece of code to RAM (including a slowdown) and so I was wondering if anyone had done this and what result did they experience.

 

Robert

  • The flash is noted as running at 40MHz (1/2 clock speed above 40MHz) but with 64 bit access so that it fetches 2 instructions per cycle with branch prediction and speculative pre-fetch.

    64 bits do not represent two Thumb-2 instruction. Most of the Thumb-2 instructions are 16 bit, only few are 32 bit. That means, one prefetch usually can refill the pipeline. Only branches are fully affected by the waitstates.

    The RAM is zero wait access if sequential fetches are to alternate banks.

    You might want to look a the memory map and internal bus layout of your MCU, but "zero waitstate" holds not necessarily true. DMA accesses could steal you instruction fetch cycles in an asynchronuous manner.

    Executing code from RAM is more popular on a competitor's MCU for two reasons:

    • It has "CCM" (core coupled memory), which is unaccessible by DMA.
    • the Flash has just one bank, and execution from Flash stalls completely when using IAP.

    Depending on your code, you might need to move the vector table, too.

  • I keep forgetting these are running thumb code but that just fits in with my thought that straight line code will not be sped up in RAM.  That just leaves open the question of whether branches could be faster or slower in RAM.

    Should not need to move the vector table to RAM.  This is within the highest priority I/O vector but it is not occurring at a high enough frequency for that initial indirection to be my worry. 

    I measure the inner loop as taking around 144 instruction cycles and I could roughly 100 instructions so that's in ballpark agreement especially considering there is a couple of divides in the mix.  I suspect I would be doing well to get a few percent.

    I did get a few percent by refactoring, also better code as a result.

    No DMA yet.  The I/O overhead is about 5 to 10% so it's not a large factor either.  I am seeing something unexpected in the I/O reading but that is for a separate thread. I'll need to read up on how this micro's DMA works before making use of it.

     

    Robert

  • Well, that straightens up your ideas behind that question.

    As you might know, the Flash interface is not part of the ARM package, meaning the connection to the IBus is vendor-dependant. ST, for instance, has a small cache of (AFAIK) four 128 bit words, called ART, to "hide" the slowness of their Flash (which needs waitstates already at >25MHz). Actually I dont know much details about such mechanisms in Tiva MCUs. But with SRAM running at zero waitstates, no such thing is necessary, and no branch delay would be introduced.

    If you do not move interrupt vector routines directly, there is no need to relocate the vector table. But you will still need to translate that code as PIC (position independant), and take some indirection on places where the relocated code is called.

    But if you run DMA, it will share bandwidth with the instruction fetches on the DBus. More important, it will kick in in an asynchronuous manner, introducing significant jitter.

    Not that I want to steer you away from Tiva MCUs, but competitors have M4F MCUs with up to about 200MHz, leaving you some more cycles available. One competitor has an asymmetric dual core MCU with a M0 dedicated to IO, that could be of interest for you.