This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Flash memory cache possibly affecting CPU performance

Other Parts Discussed in Thread: TM4C129XNCZAD

Hi TI community. This is my first time posting here.

I have been investigating into CortexM4 ARM uCs that are available for immediate purchase in my country. I have narrowed down a choice between two uCs:

  • TM4C129XNCZAD
  • STM32F407VGT6 (in further text - STM or STM MCU)

While i mostly prefer TI uC - it is superior in almost every aspect, i am having second-thoughts when considering code execution performance.

Considering flash memory cache I have looked at several other TI uCs and can't understand why is cache so poor (in my opinion).

Having only 4x256bit (32 instructions) cache will most likely cause misses in almost all loops, and repeated function calls (in contrary with STM cache organization which will most likely NOT cause a miss in these situations). Also literal use degrades cache even more.  Suggestion that developer should align branch destinations to 8-words, is next to impossible when not writing code in assembler and thoroughly analyzing it. Even though STM would also (for best performance) require code to be aligned to 4-words, this could be overcomed if this branch code path is taken again before removing particular code from cache. This probability when there is only 4 entries in cache is smaller.

On the other hand STM cache is 64x128bit for instructions and 8x128bit for data (literals). Apart from obvious 8 times bigger cache (256 instructions) this cache also takes branches into consideration (not sure how - since it is not described in datasheet). In case of interrupt (since interrupt routines are usually tens of instructions), when returning to interrupted code there is a big probability that it is still in cache. I think that in practice this cache would perform significantly better.

Is there any TI MCU that is in this product range with better cache? And why cache hasn't been made better in this particular MCU? Because in my opinion this significantly slows down CPU. What is yours opinion on this subject?

Above, I have expressed my opinion and point of view, if something isn't correct please state that in your answer.

Looking forward for your answers, sincerely,

Filip.

  • Quite good for any, "first time post" I'd say.  Well thought, points detailed & justified - Bravo!

    Several here have requested a near, "Gold Standard" for such MCU performance comparisons - i.e. Benchmarks.  Note that there are many makers of Cortex M4 - and many such devices have been, "benchmarked."  Parts here - to my best/current knowledge - have escaped such benchmarking. 

    We happen to use MCUs from many vendors. (best/brightest devices - not always, at all times, resident w/in one house.)

    Perhaps your purchase of Eval board from both vendors - and your creation of a test program which best duplicates the bulk of your code requirement - proves best/fastest means to acquire, "real" insight...

  • One option to get around the cache issue altogether that may be worthwhile for the TM4C129 is to execute code from RAM.  Since it has 256kbytes of RAM there is often plenty of room for select functions and features.  Reserve this treatment for the parts of code that are either very time critical or are called often.  For example init code is probably not worth putting in SRAM.  Some data analysis function that does some heavy number crunching and is called once every 10 milliseconds would be a good candidate. 

     

    http://www.ti.com/lit/an/spraa46a/spraa46a.pdf  for a Code Composer application note on how to make the linker do most of the work for you.  The linker file understand how to store something in flash and execute from RAM.  Then generate copy tables that you call/copy during startup code in main.  After that it just works like normal code and you just call the function as normal but the function is executing from RAM.  This app note is a bit long and scary for beginners.

    The outline is as follows:

    1) in the linker file make a new section.  use the parameters  run=SRAM, load=FLASH, table

    .cpy_this: load=FLASH, align = 32, run=SRAM, table( _my_copy_table )

     

    2) use #pragma to force the linker to locate the functions of interest into this new section.

    #pragma LOCATION(my_fast_function, cpy_this);

    define my_fast_function here immediately after the pragma.

     

    3) #include <cpy_tbl.h> then in main call the copy function with pointers to the flash and sram address as provided by the copy table from the linker. 

    extern COPY_TABLE _my_copy_table;

    copy_in(&_my_copy_table);

     

    4) the rest of code calls my_fast_function with out change and is ignorant to the physical location of the function.

     

     

  • Thanks cb1.

    Dexter, your suggestion solves my problem altogether, thank you very much.

    Filip.

  • Do consider that minus proper benchmark comparisons - you've selected a, "premium priced, single sourced" device.  And - should availability and/or other issues arrive - what then?

    The "tightly focused" solution proposed evades/avoids your real issue (flash memory cache - does it not?) while restricting your freedom of choice and proper comparison.   Might it be reasonable to ask, "Why is that?"

    And - if the issue reduces to, "Flee from Flash" - cannot "any" TM4C/other M4 serve - not just the, "high-priced" spread?

    Sometimes such, "quick answer" may not prove, "best answer..."  (especially when your original concern was so well voiced/presented - and has been (here) rendered silent...)

    Properly devised/implemented benchmarks - kicked to the curb here - remain (I believe) your real/critical (and unmet) issue!

  • Hello Dexter. This is a great post, however, I have found some issues.

    INTEGRIS Dexter said:

    The outline is as follows:

    1) in the linker file make a new section.  use the parameters  run=SRAM, load=FLASH, table

    .cpy_this: load=FLASH, align = 32, run=SRAM, table( _my_copy_table )

    When I modify the linker file I get this warning:

    .cpy_this: load=FLASH, align = 32, run=SRAM, table( _my_copy_table )

    Description Resource Path Location Type
    #10199-D copy table operator (_my_copy_table) ignored for ".cpy_this": copy table operator cannot be associated with empty output section TEST_prog_ccs.cmd /TEST_prog_ccs line 94 C/C++ Problem

    Logically, an error is thrown later as cpy_this is not found.

    I am defining .cpy_this just below .init_array. Is it ok there?

    Thank you

  • PAk said:
    #10199-D copy table operator (_my_copy_table) ignored for ".cpy_this": copy table operator cannot be associated with empty output section

    Did you use #pragma CODE_SECTION(<func_name>, "<sect_name>") to force a function into the .cpy_this section? The section will be empty if you haven't, and that'll be why the copy table isn't being created.

    If you have put a function into .cpy_this it's possible that the compiler is helpfully "optimising" your code by inlining the function at the call site (ie putting it in flash where you don't want it). If that happens you'll need to use #pragma FUNC_CANNOT_INLINE to prevent the compiler from inlining the function that you want to place into RAM.