I am trying to obtain repeatable benchmarks

Paul Lauzon

Other Parts Discussed in Thread: RM48L950

I ran several benchmark (maths, memory access, branching) and used the performance counter to get the timing results.

I disabled interrupts before running the tests.

The problem I have is that I get different results each time I run the benchmarks so I am wondering what could cause this difference.

I use the RM48L950 USB development board and I am running those benchmarks within Code Composer Studio 4.2.4 with all the recent updates.

Is it possible that something interrupts the benchmark?

Code Composer Studio is fetching data to update the GUI using jtag or other means

There are DMA accesses coming from USB or other peripherals accessing the bus

Or could it be related to the following?

The branch prediction is not 100% deterministic and is based on previously ran code

The source code or data alignment affects the timing (if I add a benchmark, it could move everything)

Any other explanations would be helpful.

over 12 years ago

0 Anthony F. Seely over 12 years ago

TI__Guru 68930 points

The branch prediction is the most likely cause; if the benchmark results stabilize after 3-4 runs. It takes about that long to train the branch prediction. What I've seen in the past is that run 1-2 has a big improvement, run 3 small, and run 4 is almost the same as run 3.

You can also disable branch prediction to see if this stabilizes the results.

EDIT:

The branch prediciton logic can be controlled through 3 bits in the System Control Coprocessor, Auxiliary Control Register.

MRC/MCR p15, 0, <Rd> c1, c0, 1

See the ARM Cortex R4 and R4F Technical Reference Manual (availalbe at infocenter.arm.com) .

0 Paul Lauzon over 12 years ago in reply to Anthony F. Seely

Intellectual 255 points

I forgot to mention that most of the benchmarks are loops performing something for over 1 million iterations.

0 Anthony F. Seely over 12 years ago in reply to Paul Lauzon

TI__Guru 68930 points

Paul,

I'd check the branch prediction anyway.

- You can do this simply by poking values into the Aux. Control Register (A) in the screen capture below.

- Don't confuse this with (B) :) Easy to do if the column isn't fully expanded.

- Item (C) is the continous refresh button. if you have this on, CCS might try to refresh while running.

- Last - do you have any IO functions like file IO or console IO (fopen(), printf() etc...) in your benchmark?

(Inside the loop you have the PMU around?)

If so the way this is implemented with CCS is using breakpoints and then you would definitely run into variability due to USB.

Otherwise, I can't think of a reason why you'd see an impact from USB.

0 Paul Lauzon over 12 years ago in reply to Anthony F. Seely

Intellectual 255 points

Thanks for your help. Since I am trying to benchmark the MCU, my goal is to get the best repeatable results possible so I guess disabling the branch prediction would make it slower on average.

Is there some alignment consideration for optimizing the device? If so, does it have to be done manually or the compiler automatically does it for us.

In the Technical Reference Manual I saw these bits:

[bit 21] DEOLP Disable end of loop prediction:

0 = Enable loop prediction. This is the reset value.

1 = Disable loop prediction.

[bit 20] DBHE Disable Branch History (BH) extension:

0 = Enable the extension. This is the reset value.

1 = Disable the extension.

[bits 16:15] BP This field controls the branch prediction policy:

b00 = Normal operation. This is the reset value.

b01 = Branch always taken.

b10 = Branch always not taken.

b11 = Reserved. Behavior is Unpredictable if this field is set to b11.

So I could set bit 20 and 21 to 1 and bits 16-15 to either 01 or 01?

0 Paul Lauzon over 12 years ago in reply to Anthony F. Seely

Intellectual 255 points

No, there is no file or communication I/O in the benchmark.

Just computations and algorithms.

I read the PMU, perform iterations, read the PMU again and compute the difference.

0 Anthony F. Seely over 12 years ago in reply to Paul Lauzon

TI__Guru 68930 points

Yes. You might try 'branch always not taken' since the flash wrapper is optimized for sequential accesses.

Regarding alignment - are you using double precision float?

One thing I should remember to state that may not be obvious. If you are benchmarking, the algo you benchmark *must* be compiled without debug enabled.

If you enable the debug option with the TI ARM compiler, it constantly pushes/pops to the stack rather than optimizing variables to registers; this makes debug easier but kills performance. You should be -O2 or -O3 at least.

You can still compile your benchmark 'wrapper' code for debug but the stuff inside the PMU you need to do optimized to get a decent result.

EDIT: Also regarding *code* alignment - the major improvement you will see in performance comes in a very special case. If you can optimize a loop kernel so that it completely fits on one line of flash (x128 bits) then you'll see a major lift v.s. if this is unaligned and split across flash lines. But you can only fit 8 16-bit or 4 32-bit instructions on one flash line so the loop kernel has to be pretty tight for this to matter. This isn't the easiest to do through the compiler but if you think you have any candidates we can talk about how to get it done. I've done it before by inserting NOP's in the optimized assembly versions of functions but I'm assuming you're compiling....

EDIT 2: BTW which compiler *version* are you using. The TI ARM version 5.x series greatly outperforms the 4.x series on optimization for our product.

0 Paul Lauzon over 12 years ago in reply to Anthony F. Seely

Intellectual 255 points

Yes some of the tests are using double precision float.

Thanks for all the information.

I had tried -O4 but got some random results so I was sticking with -fm5.

I am using Code Composer Studio 4.2.4. I tried to install a version 5.4 but there are network errors displayed.

0 Anthony F. Seely over 12 years ago in reply to Paul Lauzon

TI__Guru 68930 points

Paul,

Ok. Double precision float is going to give poor results (on most embedded processors in fact) so please keep that in mind. The FPU MAC is optimized for SP, it will run DP but that takes multiple cycles (usually around 4x as long as SP). I would just ask if you really need DP in your application. Very few embedded applications do; mainly high quality audio apps.

What network errors are you getting when you try to update?

Best Regards,

Anthony

Arm-based microcontrollers

Arm-based microcontrollers forum

I am trying to obtain repeatable benchmarks