TMS570LS3137 - Dhrystone Low Score

Benjamin GREFFE

Other Parts Discussed in Thread: HALCOGEN, RM57L843, TMS570LS3137

hello everybody,

I am using the TMS570LS3137 HDK with CCS Version: 6.0.1.00040 and HALCoGen 04.01.00.

I want to run a Dhrystone at 180MHz but the best score I manage to get is 121.6 DMIPS (0.68 DMIPS/MHz) With ARM recommended compilation flags.
As you can see, it's far from the 1.66 DMIPS/MHz of the datasheet.

setup :

I generated the project files with HalCoGen with the basic configuration : 180MHz, all drivers desactivated.
Dhrystone code is directly from the official repository.
I use PMU for time measurement with overflow management => switch with first #define in dhry.h (slightly same results than native time())

Tries :

At first I thought this was because I run the code in CCS debug mode (to have the printf), so I managed to redirect the printf to the UART and monitor the output with the terminal out of the CCS debug session. There was no difference.
Lower optimization levels gave me 60 DMIPS at 180MHz.
I also tried to reproduce this guy setup, without better results.

Other Bugs :

my Dhrystone is called inside a while(1) loop.... but it crash after 16 runs trough the loop.

I'm kinda stuck here... I don't understand what I'm missing.

please find attached here my project files with the HalCoGen configuration files : 6724.Dhrystone_simple.zip

over 9 years ago

0 Anthony F. Seely over 9 years ago

TI__Guru 68940 points

Hi Benjamin,

I'll take a look. But I think it's been pretty difficult in the past to hit the CPU core # on real silicon.
Still as you noted you are pretty far off.

-Anthony

0 Anthony F. Seely over 9 years ago in reply to Anthony F. Seely

TI__Guru 68940 points

Hi Benjamin,

I looked at the other post - and tried to recreate w. the RM57L843. I can get 330DMIPS out of it by turning on the optimizations.

Make sure you try the 16-bit mode instead of 32-bit mode when you compile, as generally the Thumb2 instruction set performs better especially on the devices with flash and no instruction cache .. because the Flash wrapper is sized for 16-bit opcodes.

You may also need to try running from SRAM on that device to get the highest number (at high frequency).

Back to the RM57L843 - I hooked up a trace analyzer (XDSPROTRACE) and it's giving a pretty high number for the strcmp function from the runtime library. It's a bit hard to read all the output to be honest because with the heavy amount of optimization things are scrambled, but it may be that the runtime library has to be tweeked to make this particular benchmark score higher.

But unless you are doing a lot of string comparisons, you might consider just using this benchmark as a relative comparison for say 'code running from RAM v.s. code running from EMIF...' rather than worrying about hitting the ARM #.

-Anthony

0 Anthony F. Seely over 9 years ago in reply to Anthony F. Seely

TI__Guru 68940 points

Just found these links googling:
dell.docjava.com/courses/cr346/.../DhrystoneMIPS-CriticismbyARM.pdf
blog.riscv.org/.../
They make me think that most of the difference IS probably in the string functions.

I stepped through our optimized string function and it was four instructions, a 'load byte', a compare, branch, store.
The reason it's simple is that you don't know how a generic string is aligned in memory and you don't know the size of the string, it's null terminated.

Given that these references show a big change if you tweek the string function - and given that the trace data shows 30% of the PC trace within the strcmp runtime function - I'm inclined to conclude that if you want to produce the 'datasheet' number you have to put a lot of effort into optimizing the string functions - and that's probably not a useful exercise unless string functions happen to be what you plan to use the device for.

There's another mention of divide being a big one in terms of performance - but the Cortex R has a hardware divide so I assume the compiler is using it. It probably is worth making sure that this is happening.

Anyway - unless you really need to go through the exercise - I would stick with the generic string functions as they're going to be more robust the way they are - and they won't really impact performance if you're not doing a lot of string processing.

0 Benjamin GREFFE over 9 years ago in reply to Anthony F. Seely

Prodigy 185 points

hello Anthony,

Thank you for all your advices, the lack of performances is mainly due to the memory map choice. As you can see below (load : FLASH | run : RAM)

This is not shown below but running the benchmark into the RAM without the proper ARM string.h is also a speed killer.

Benchmark	Chip	Freq (MHz)	Tested Area	Parameters	Optimisation Level	opt_for_speed	Bench runs	DMIPS	DMIPS/MHz	µs/Dhry	Dhry/s
Dhrystone	TMS570LS3137	180	Default linker parameters	HalCoGen out-of-the-box	0	0	1 000 000	59,15	0,329	9,62218028	103926,55
Dhrystone	TMS570LS3137	180	Default linker parameters	HalCoGen out-of-the-box	3	0	1 000 000	80,92	0,450	7,033514132	142176,44
Dhrystone	TMS570LS3137	180	Default linker parameters	HalCoGen out-of-the-box	3	5	1 000 000	121,67	0,676	4,677833185	213774,19
Dhrystone	TMS570LS3137	180	Default linker parameters	Addition of #include <string.h>	3	5	1 000 000	132,7	0,737	4,289012536	233153,9
Dhrystone	TMS570LS3137	160	load : FLASH \| run : RAM	Addition of #include <string.h> \| code state : Thumb 16bit	3	5	1 000 000	216,3	1,352	2,6313082	380039,1
Dhrystone	TMS570LS3137	160	load : FLASH \| run : RAM	Addition of #include <string.h> \| code state : 32bit	3	5	1 000 000	208,9	1,306	2,724518734	367037,3

best regards,

Benjamin.

0 Anthony F. Seely over 9 years ago in reply to Benjamin GREFFE

TI__Guru 68940 points

Hi Benjamin,

Super. It looks like you are getting about as good a result as can be expected without really tweeking the string routines.
To go from 208 to 160*1.6 = 256 you probably would need to mess with the string copy and then I'm not even sure you will hit the 256
because you're sharing the RAM between instruction and data fetches while in theory the CPU can fetch an instruction on TCM A (flash) and data from TCM B (RAM) in parallel. It's just that the wait states of the flash have some impact on performance at 160MHz and it turns out you get better performance at this frequency running from RAM.

If you wanted to see what the CPU can do at it's max you'd probably want to set the flash for zero wait state, run your code from flash and data in TCM. This would limit you to running with the pipeline mode of the flash disabled and at 45MHz but you could get the DMIPS/MHz that way if you wanted.

I was testing out on the RM57L843 first myself - because these issues are largely taken care of by the large instruction and data caches on that device.

Curious what your take is on the importance of string processing. I don't think it's worth trying to come up with optimized string processing functions just for the sake of gaming dhrystone .. but on the other hand if your application does a lot of string processing then having optimized string processing functions - even if they required alignment of strings in memory - might be worthwhile.

Not exactly sure how these would be written but the USUB8 function looks like it could check 4 bytes at time for the null termination - for example.... Then maybe if the null isn't there you could do a 4x4 compare while if the null is there, you could do a serial compare (For the last iteration of the compare loop). Just fodder for thought.

Arm-based microcontrollers

Arm-based microcontrollers forum

TMS570LS3137 - Dhrystone Low Score