tiva instruction rate very slow?

Richard Bland

Expert 2050 points

Other Parts Discussed in Thread: TM4C123GH6PGE

Hi,

I have a Tiva TM4C123GH6PGE clocked by an external 25MHz clock module.

I am initialising the clock with...

ROM_SysCtlClockSet(SYSCTL_SYSDIV_4 | SYSCTL_USE_PLL | SYSCTL_OSC_MAIN | SYSCTL_XTAL_25MHZ);

If I run...

int speed;

speed = ROM_SysCtlClockGet();

speed is set to 50,000,000 as expected, and hoping for approximately 50MIPS

so...

If I then time how long it takes to do 20 integer increments...

int number;

pinset; //for scope

        number++;
        number++;
        number++; etc etc

pinclear;

these 20 increments take approx 4uS, which means each increment takes 200nS, this is 5MIPS

there seem to be no references to clock cycles per instruction in the documentation

what have I done wrong?

Thanks, Richard

over 12 years ago

0 cb1_mobile over 12 years ago

Guru 117855 points

Believe that ARM documentation and Joseph Yiu's book, "The Definitive Guide to ARM Cortex-M3" (indeed - it targeted the now NRND parts, here) provide such detail.

Pro/serious IDEs (such as IAR) provide a "live" cycle counter - if you wish I'll repeat your test and report...

0 Richard Bland over 12 years ago in reply to cb1_mobile

Expert 2050 points

Hi cb1, thanks for your reply - I'm using Code Composer Studio. Is 200nS per instruction what you would expect? I have just measured again with a "delay loop" macro...

#define delay1ms for(delay = 0; delay < 2500; delay++) ;

1mS/2500=400nS for a compare and an increment, so this confirms the performance I am getting, which is very disappointing. I would be grateful if you could have a quick test yourself and let me know.

Thanks, Richard

0 cb1_mobile over 12 years ago in reply to Richard Bland

Guru 117855 points

Hi back Richard,

This from Joseph's book: "Many instructions (cb1: "but not all") are single cycle! This includes, "multiply."

Have client lunch now - but shall return w/test results...

0 Richard Bland over 12 years ago in reply to cb1_mobile

Expert 2050 points

Hi cb1, many thanks. Seems like I should expect instructions to take 20nS then.

I have just repeated my test on a Launchpad with this code...

#include "stdint.h"
#include "stdbool.h"
#include "driverlib/sysctl.h"
#include "driverlib/rom.h"

#define redLedTog   ROM_GPIOPinWrite(GPIO_PORTF_BASE, GPIO_PIN_1, (ROM_GPIOPinRead(GPIO_PORTF_BASE, GPIO_PIN_1) ^GPIO_PIN_1))
#define   delay1ms   for(delay = 0; delay < 2500; delay++) ;

int speed, delay;

int main(void) {
    ROM_SysCtlClockSet(SYSCTL_SYSDIV_4 | SYSCTL_USE_PLL | SYSCTL_OSC_MAIN | SYSCTL_XTAL_16MHZ);
    ROM_SysCtlPeripheralReset(SYSCTL_PERIPH_GPIOF);
    ROM_SysCtlPeripheralEnable(SYSCTL_PERIPH_GPIOF);
    ROM_GPIOPinTypeGPIOOutput(GPIO_PORTF_BASE, GPIO_PIN_1);

    speed = ROM_SysCtlClockGet();

    while(1) {
        redLedTog;
        delay1ms;
    }
}

with the same results... speed is set to 50,000,000 but to get 1mS between led toggles I need 2500 inc/compares

hope someone can shed some light on this!

it's beer-time over here! try again tomorrow

cheers

Richard

0 Chester Gillon over 12 years ago in reply to Richard Bland

Guru 92251 points

Richard Bland said:
delay1ms;

You haven't stated what the compiler optimization level has been set to, or what the generated assembler instructions were. I haven't currently got a launchpad to try measuring the timings, but using CCS 5.5 and in the project properties -> Build -> ARM Compiler -> Advanced Options -> Assembler Options ticked "Keep the generated assemble language (.asm) file" option.

With the Optimization Level set to Off the assembler generated for the delay1ms macro was:

	.dwpsn	file "../project.c",line 45,column 9,is_stmt,isa 1
        LDR       A3, $C$CON8           ; [DPU_3_PIPE] |45| 
        LDR       A1, $C$CON8           ; [DPU_3_PIPE] |45| 
        MOVS      A2, #0                ; [DPU_3_PIPE] |45| 
        STR       A2, [A3, #0]          ; [DPU_3_PIPE] |45| 
        LDR       A1, [A1, #0]          ; [DPU_3_PIPE] |45| 
        MOV       A2, #2500             ; [DPU_3_PIPE] |45| 
        CMP       A2, A1                ; [DPU_3_PIPE] |45| 
        BLE       ||$C$L1||             ; [DPU_3_PIPE] |45| 
        ; BRANCHCC OCCURS {||$C$L1||}    ; [] |45| 
;* --------------------------------------------------------------------------*
;*   BEGIN LOOP ||$C$L2||
;*
;*   Loop source line                : 45
;*   Loop closing brace source line  : 45
;*   Known Minimum Trip Count        : 1
;*   Known Maximum Trip Count        : 4294967295
;*   Known Max Trip Count Factor     : 1
;* --------------------------------------------------------------------------*
||$C$L2||:    
        LDR       A2, $C$CON8           ; [DPU_3_PIPE] |45| 
        LDR       A3, $C$CON8           ; [DPU_3_PIPE] |45| 
        LDR       A1, [A2, #0]          ; [DPU_3_PIPE] |45| 
        ADDS      A1, A1, #1            ; [DPU_3_PIPE] |45| 
        STR       A1, [A2, #0]          ; [DPU_3_PIPE] |45| 
        LDR       A1, [A3, #0]          ; [DPU_3_PIPE] |45| 
        MOV       A2, #2500             ; [DPU_3_PIPE] |45| 
        CMP       A2, A1                ; [DPU_3_PIPE] |45| 
        BGT       ||$C$L2||             ; [DPU_3_PIPE] |45| 
        ; BRANCHCC OCCURS {||$C$L2||}    ; [] |45|

Whereas when the optimization level was set to 4 (the maximum) the generated assembler for the delay1ms macro was:

        MOV       V3, #2500             ; [DPU_3_PIPE] |45| 
        MOVS      A1, #0                ; [DPU_3_PIPE] |45| 
;* --------------------------------------------------------------------------*
;*   BEGIN LOOP ||$C$L3||
;*
;*   Loop source line                : 45
;*   Loop closing brace source line  : 45
;*   Known Minimum Trip Count        : 2500
;*   Known Maximum Trip Count        : 2500
;*   Known Max Trip Count Factor     : 2500
;* --------------------------------------------------------------------------*
||$C$L3||:    
        ADDS      A1, A1, #1            ; [DPU_3_PIPE] |45| 
        CMP       V3, A1                ; [DPU_3_PIPE] |45| 
        BGT       ||$C$L3||             ; [DPU_3_PIPE] |45| 
        ; BRANCHCC OCCURS {||$C$L3||}    ; [] |45| 
;* --------------------------------------------------------------------------*
        STR       A1, [V1, #4]          ; [DPU_3_PIPE]

The point I am trying to make is that it is difficult to know how many instructions cycles will be generated for a given set of C statements.

0 Chester Gillon over 12 years ago in reply to Chester Gillon

Guru 92251 points

Chester Gillon said:
The point I am trying to make is that it is difficult to know how many instructions cycles will be generated for a given set of C statements

Even if code timing loops in assembler, due to the use of flash wait states the instruction timing can be slower than expected - see Inconsistent delay with SysCtlDelay() compared to ROM_SysCtlDelay() for details.

0 cb1_mobile over 12 years ago in reply to Chester Gillon

Guru 117855 points

Beyond any limits imposed by any specific (thus limited) function - there is the timing uncertainty introduced when the MCU's System Clock exceeds 40MHz - and the MCU's flash memory is the code storage medium.

The IAR compiler - under brief testing thus far - is yielding far faster program execution.

0 Richard Bland over 12 years ago in reply to Chester Gillon

Expert 2050 points

Thank you all very much for your input here - I had completely forgotten about Optimisation settings - I always debug with them off, otherwise it can be very confusing when single-stepping. However having now tried my simple delay loop with the compiler Optimisation level at 4 and the Optimise for speed set to 5, I find I need to loop 33000 times to get 1mS delay. This is 13.2 times faster! WOW. Amazing. I come from a PIC assembler background so compiler settings are still a bit mysterious to me. Thank you very much for the support.

Cheers, Richard

Arm-based microcontrollers

Arm-based microcontrollers forum

tiva instruction rate very slow?