Code execution speed

Martin Beaucage

Other Parts Discussed in Thread: TMS570LS20216

Hello,

I am working with a TMS570LS20216ASPGE USB stick developpement board, CCS5 and compiler v4.9.0.

I am having problems with execution speed, my software seams to run slower than expected. I attached a simple example program that illustrates the problem.

In my main I call the following function:

void DoMainLoop( void )
{
    volatile unsigned long i = 0;

    while( 1 )
    {
        gioSetBit( gioPORTA, 0, 1 );
        i++;
       48 times i++;
        i++;
        gioSetBit( gioPORTA, 0, 0 );
        i--;
       48 times i--;
        i--;
    }
}

In my example I toggle GIO0 to 1, execute 50 times i++, toggle GIO0 to 0, execute 50 times i-- and start again.

Because of the volatile, I++ and i-- are not optimized and translate in assembly to LDR, ADD and STR for each i++ and i--.

This makes 50*3 = 150 instructions and my CPU runs at 140MHz so I would expect GIO0 to toggle every 1us (150/140e6) or less.

In fact with my oscilloscope it toggles every 3.5us (you can try it with the example program). I tried a lot of stuff like changing wait states, setting optimization level, running the code from RAM and nothing changes the timing significantly.

Can anyone tell me if this is normal and how to raise the execution speed.

I expected this DSP to execute code around 4 times faster.

2766.speedTest.zip

over 14 years ago

0 Michael Sherman over 14 years ago

TI__Prodigy 400 points

Martin,

I am forwarding your question and hope to have a response ASAP.

Regards,

Michael Sherman

0 steveg over 14 years ago

Intellectual 270 points

gioSetBit( gioPORTA, 0, 1 ); takes 40+ cycles, you cannot use GPIO for accurate timing (it's good for msec range), use an internal counter

0 Martin Beaucage over 14 years ago in reply to steveg

Intellectual 860 points

Hello,

I followed your advice and done measurements differently.

First, I measured the time it takes gioSetBit( gioPORTA, 0, 1 ) to change the output value. By doing a loop that only changes the output state. The output changed state every 348ns. So I think the gioSetBit function delay is negligable.

Then I raised the number of i++ from 50 to 500 to minimise the effect of the gioSetBit function. So this should translate to 500*3=1500 instructions and the output changed every 32.6us which is close to 10 times the time for 50 i++.

Then I used the RTI module to measure how much time the 500 i++ take. I got 1124 ticks with a clock of 37.5MHz giving 29.97us. This is fairely consistant with what I measured with the oscilloscope.

Then I used the performance monitoring module of the DSP to measure the number of instructions executed for the 500 i++. I got 1505 executed instructions which is consistant with the 1500 instructions expected.

Then using the perfromance monitoring module I measured the number of cycles required to execute the 500 i++. I got 4504 cycles and 4504 cycles / 140MHz = 32.2us.

So from those experimentations, I find that it takes 3 clock cycles to execute an instruction and I was under the impression that the TMS570 could execute one or more instruction per cycle. Is this caused by the 3 data wait states configured for FLASH access? If so, why are all those numbers the same when executing this code from RAM which has no wait states? Do you have an explanation?

Regards,

Martin B.

0 steveg over 14 years ago in reply to Martin Beaucage

Intellectual 270 points

Martin,

I am running the TMDX570LS20SUSB at 100Mhz. I guess I need to look at the GPIO timing again.

The only thing I would suggest is to unroll your loop so it is doing a very low percentage of branches. I'm guessing with a deep 8 stage pipeline the Cortex R4 suffers greatly with incorrect branch predictions, this is not dependent on wait states (maybe why it makes no difference when running in FLASH or RAM.) I am able to get near 1 cycle/instruction on unrolled loops while running in FLASH (register and memory based instructions). When my code accesses FLASH data it slows down considerably. A lot of things have to be done correctly to get more than one instruction/cycle.

steve

0 Martin Beaucage over 14 years ago in reply to steveg

Intellectual 860 points

Steve,

In the example program I am using, I am not doing a loop, I put 500 lines with i++; so there are no branches. The assembly code only does LDR, ADD, STR 500 times.

Could you send me a code example that runs at near 1 cycle/instruction. I believe I may have a configuration problem somewhere.

Thanks

Martin

0 steveg over 14 years ago in reply to Martin Beaucage

Intellectual 270 points

Martin

The assembly below runs at 1 cycle/instruction on my board, include it as inline assembly or create a .asm file.

steve

MOV R0, #0
MOV   R1, #0
; Copy the below 12 times
MOV R2, #12443
MOV R3, #56342
SMLALD R0,R1,R2,R3
MOV R2,#2534
MOV R3,#2843
SMLALD R0,R1,R2,R3
MOV R2,#13
MOV R3,#44
SMLALD R0,R1,R2,R3
MOV R2,#023
MOV R3,#23
SMLALD R0,R1,R2,R3
; end

BX LR

0 Anthony F. Seely over 14 years ago in reply to steveg

TI__Guru 68930 points

Hi Martin, Steve,

I think one assumption that needs to be re-examined is that all instructions will execute in a single cycle.

The best place to get the gory details on instruction execution timing is from ARM's documentation for the Cortex R4F CPU.

You're looking for the document: ARM DDI 0363E hopefully this link to PDF takes you there directly... http://infocenter.arm.com/help/topic/com.arm.doc.ddi0363e/DDI0363E_cortexr4_r1p3_trm.pdf

If the above link doesn't work, then start at the main page http://infocenter.arm.com/help/index.jsp and navigate down in the left panel into the "Cortex-R series processors" section ... or try searching by their literature number 'ddi0363'.

There is a chapter called "Cycle Timings and Interlock Behavior" in this manual, and it explains how many cycles instructions take to execute.

I think you probably have some data dependencies / register interlocks that are adding cycles in the LRD, ADD, STR repeated sequence.

I might be misusing the terminology - but I believe the 'Cycles' column is basically how many cycles you would see just to issue the instruction, assuming there are no interlocks on it's result. So if you had a load instruction but didn't need the result in the very next couple of cycles, you should see it just take '1' cycle (or appear to).

But there's also 'latency' involved - this is how long it takes to actually get the result of the load instruction so you can use it in another operation. You'll see that this is more than one cycle depending on factors like alignment and what type of indexing is going on. Then there are some sort of what I think you can call 'correction factors' depending on whether the next instruction needs the result earlier or later than 'normal' - I think this is what the early and late reg mean.

The best place to get a crisp answer for your quesiton is probably ARM - if you really want to get a 100% understanding of the pipeline in this particular example. They have a support forum - and I'd encourage you to post the pipeline question there to get the most accurate answer. Also their docs might need some adjustment because Cache and TCM RAM aren't exactly the same (docs mention Cache but we've got TCM RAM on TMS570LS20216). So that could be good fodder for an ARM forum post.

At least I hope this gives some understanding of what's going on.... I'm speculating a bit because I haven't actually been able to rebuild your project to check out the actually assembly code that's generated - (installing CCSv5 now to do this...)

Also, I wouldn't say that there is 'no' impact to running out of flash but our flash on the TMS570 is very wide (256 bits) so many instructions can be fetched at once, there's also some pipelining that helps when you've just got a linear sequence of code. So I doubt you're seeing much delay at all in this example. RAM itself is single cycle but that doesn't mean you avoid the latency of the CPU doing a LDR; it just means that the RAM doesn't add any extra cycles to that latency on it's own because it's slow.

Last - for 'normal' code the compiler understands the architecture / pipeline and should be trying to hide as much of these latencies as it can - that is it's job. But in this case you're actually going out of your way a bit to create a delay and forcing the compiler to read back what it writes each time (volatile) while there's nothing else it can really do at the same time. So while you've got a good delay loop - for real code you should see much better performance.

Last Last - you were asking about issuing more than one instruction per cycle. There's a description in the same ARM doc explaining what the Cortex R4F is capable of issuing in parallel in section 14.23 if you want to get an idea of when dual-issue can happen.

Best Regards,

Anthony

0 steveg over 14 years ago in reply to Anthony F. Seely

Intellectual 270 points

Thanks Anthony, lots of good info.

My assembly example used only single cycle instructions, so it can and does run at 1 cycle/instruction. SMLALD does have a latency of 2 cycles, but the results are not used for 2 cycles, so no stalls are issued by the CPU. I assume since the FLASH has 256 bit bus, that is why the FLASH wait states are not slowing it down.

The dual issue feature of this chip is very limited.

this article is very good concerning how to optimize the Cortex R4 code

http://www.eetimes.com/design/signal-processing-dsp/4017562/Using-the-ARM-Cortex-R4-for-DSP-part-2-Software-optimization?pageNumber=1

0 Martin Beaucage over 14 years ago in reply to steveg

Intellectual 860 points

Hi,

Thanks guys for all the info, I will need a few days to analyse all that info.

I will let you know my conclusion.

Thanks,

Martin B.

0 Anthony F. Seely over 14 years ago in reply to steveg

TI__Guru 68930 points

Thanks Steve! I hadn't seen that article. Definitely need to spend some time with it.

I do see that this article is for the fixed point R4; where the TMS570LSxxxx devices with the R4F CPU should offer some additional dual issue capability since the FPU has it's own pipeline. You also pick up additional working registers with the FPU which can help improve performance.

0 Christophe Beausoleil over 14 years ago in reply to Anthony F. Seely

Prodigy 245 points

Hi Martin,

As far as I experimented, you can reach 1 ARM instruction/cycle and 2 Thumb instructions/cycle @160MHz. I will also experiment at 36MHz without any wait state (I hope to reach 2 ARM instructions/cycle...)

But in your case, you have to consider latency due to register dependency : when you load a value in a register, you can not use that register immediately !

As Anthony told, carefully read §14 of Cortex-R4F Reference Manual (ARM DDI0363E) for a complete understanding...

Regards

Christophe

Arm-based microcontrollers

Arm-based microcontrollers forum

Code execution speed