This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS570LS3137 - Dhrystone Low Score

Other Parts Discussed in Thread: HALCOGEN, RM57L843, TMS570LS3137

hello everybody,

I am using the TMS570LS3137 HDK with CCS  Version: 6.0.1.00040 and HALCoGen 04.01.00.

I want to run a Dhrystone at 180MHz but the best score I manage to get is 121.6 DMIPS (0.68 DMIPS/MHz) With ARM recommended compilation flags.
As you can see, it's far from the 1.66 DMIPS/MHz of the datasheet.

setup :

  • I generated the project files with HalCoGen with the basic configuration : 180MHz, all drivers desactivated.
  • Dhrystone code is directly from the official repository.
  • I use PMU for time measurement with overflow management => switch with first #define in dhry.h (slightly same results than native time())

Tries :

  • At first I thought this was because I run the code in CCS debug mode (to have the printf), so I managed to redirect the printf to the UART and monitor the output with the terminal out of the CCS debug session. There was no difference.
  • Lower optimization levels gave me 60 DMIPS at 180MHz.
  • I also tried to reproduce this guy setup, without better results.

Other Bugs :

  • my Dhrystone is called inside a while(1) loop.... but it crash after 16 runs trough the loop.

I'm kinda stuck here... I don't understand what I'm missing.

please find attached here my project files with the HalCoGen configuration files : 6724.Dhrystone_simple.zip

  • Hi Benjamin,

    I'll take a look. But I think it's been pretty difficult in the past to hit the CPU core # on real silicon.
    Still as you noted you are pretty far off.

    -Anthony
  • Hi Benjamin,

    I looked at the other post - and tried to recreate w. the RM57L843. I can get 330DMIPS out of it by turning on the optimizations.

    Make sure you try the 16-bit mode instead of 32-bit mode when you compile, as generally the Thumb2 instruction set performs better especially on the devices with flash and no instruction cache .. because the Flash wrapper is sized for 16-bit opcodes.

    You may also need to try running from SRAM on that device to get the highest number (at high frequency).

    Back to the RM57L843 - I hooked up a trace analyzer (XDSPROTRACE) and it's giving a pretty high number for the strcmp function from the runtime library. It's a bit hard to read all the output to be honest because with the heavy amount of optimization things are scrambled, but it may be that the runtime library has to be tweeked to make this particular benchmark score higher.

    But unless you are doing a lot of string comparisons, you might consider just using this benchmark as a relative comparison for say 'code running from RAM v.s. code running from EMIF...' rather than worrying about hitting the ARM #.

    -Anthony
  • Just found these links googling:
    dell.docjava.com/courses/cr346/.../DhrystoneMIPS-CriticismbyARM.pdf
    blog.riscv.org/.../
    They make me think that most of the difference IS probably in the string functions.

    I stepped through our optimized string function and it was four instructions, a 'load byte', a compare, branch, store.
    The reason it's simple is that you don't know how a generic string is aligned in memory and you don't know the size of the string, it's null terminated.

    Given that these references show a big change if you tweek the string function - and given that the trace data shows 30% of the PC trace within the strcmp runtime function - I'm inclined to conclude that if you want to produce the 'datasheet' number you have to put a lot of effort into optimizing the string functions - and that's probably not a useful exercise unless string functions happen to be what you plan to use the device for.

    There's another mention of divide being a big one in terms of performance - but the Cortex R has a hardware divide so I assume the compiler is using it. It probably is worth making sure that this is happening.

    Anyway - unless you really need to go through the exercise - I would stick with the generic string functions as they're going to be more robust the way they are - and they won't really impact performance if you're not doing a lot of string processing.
  • hello Anthony,

    Thank you for all your advices, the lack of performances is mainly due to the memory map choice. As you can see below (load : FLASH | run : RAM)

    This is not shown below but running the benchmark into the RAM without the proper ARM string.h is also a speed killer.

    Benchmark Chip Freq (MHz) Tested Area Parameters Optimisation Level opt_for_speed Bench runs DMIPS DMIPS/MHz µs/Dhry Dhry/s
    Dhrystone TMS570LS3137 180 Default linker parameters HalCoGen out-of-the-box 0 0 1 000 000 59,15 0,329 9,62218028 103926,55
    Dhrystone TMS570LS3137 180 Default linker parameters HalCoGen out-of-the-box 3 0 1 000 000 80,92 0,450 7,033514132 142176,44
    Dhrystone TMS570LS3137 180 Default linker parameters HalCoGen out-of-the-box 3 5 1 000 000 121,67 0,676 4,677833185 213774,19
    Dhrystone TMS570LS3137 180 Default linker parameters Addition of  #include <string.h> 3 5 1 000 000 132,7 0,737 4,289012536 233153,9
    Dhrystone TMS570LS3137 160 load : FLASH | run : RAM Addition of  #include <string.h> | code state : Thumb 16bit 3 5 1 000 000 216,3 1,352 2,6313082 380039,1
    Dhrystone TMS570LS3137 160 load : FLASH | run : RAM Addition of  #include <string.h> | code state : 32bit 3 5 1 000 000 208,9 1,306 2,724518734 367037,3

    best regards,

    Benjamin.

  • Hi Benjamin,

    Super. It looks like you are getting about as good a result as can be expected without really tweeking the string routines.
    To go from 208 to 160*1.6 = 256 you probably would need to mess with the string copy and then I'm not even sure you will hit the 256
    because you're sharing the RAM between instruction and data fetches while in theory the CPU can fetch an instruction on TCM A (flash) and data from TCM B (RAM) in parallel. It's just that the wait states of the flash have some impact on performance at 160MHz and it turns out you get better performance at this frequency running from RAM.

    If you wanted to see what the CPU can do at it's max you'd probably want to set the flash for zero wait state, run your code from flash and data in TCM. This would limit you to running with the pipeline mode of the flash disabled and at 45MHz but you could get the DMIPS/MHz that way if you wanted.

    I was testing out on the RM57L843 first myself - because these issues are largely taken care of by the large instruction and data caches on that device.

    Curious what your take is on the importance of string processing. I don't think it's worth trying to come up with optimized string processing functions just for the sake of gaming dhrystone .. but on the other hand if your application does a lot of string processing then having optimized string processing functions - even if they required alignment of strings in memory - might be worthwhile.

    Not exactly sure how these would be written but the USUB8 function looks like it could check 4 bytes at time for the null termination - for example.... Then maybe if the null isn't there you could do a 4x4 compare while if the null is there, you could do a serial compare (For the last iteration of the compare loop). Just fodder for thought.