How fast a loop can execute?

c_builder

Other Parts Discussed in Thread: MSP430F5528, MSP430G2553, MSP430G2452, MSP430F5529, MSP430F5229, MSP430F5510, MSP430F5438A

I've been running some tests on a MSP430F5528. I set it to run off the DCO at 8Mhz. then flip a bit in a while loop. On the MSP430F5528 the signal generated from the bit flipping is around 523kHz. While I can see the SMCLK output at 8.5MHz. So my question is how do I know whats taking up the extra processing time? Is there a easy way to find out?

I've ran a similar test on a lauchpad MSP430 (430G2553) and I'm able to flip the bits on the pin and see it actually run at 8mhz. So it's behaving differently (running the while loop much faster). Whats the F5528 doing to make it run the loop slower?

Thanks.

Below is the example code I'm using on the MSP430F5528.

int main(void)

{

volatile unsigned int i;

WDTCTL = WDTPW+WDTHOLD; // Stop WDT

P1DIR |= BIT1; // P1.1 output

P1DIR |= BIT7; // P1.7 output

P1DIR |= BIT0; // ACLK set out to pins

P1SEL |= BIT0;

P2DIR |= BIT2; // SMCLK set out to pins

P2SEL |= BIT2;

//P7DIR |= BIT7; // MCLK set out to pins ... this doesn't work on the MSP430F5528

//P7SEL |= BIT7;

UCSCTL3 = SELREF_2; // Set DCO FLL reference = REFO

UCSCTL4 |= SELA_2; // Set ACLK = REFO

UCSCTL0 = 0x0000; // Set lowest possible DCOx, MODx

// Loop until XT1,XT2 & DCO stabilizes - In this case only DCO has to stabilize

{

UCSCTL7 &= ~(XT2OFFG + XT1LFOFFG + DCOFFG);

// Clear XT2,XT1,DCO fault flags

SFRIFG1 &= ~OFIFG; // Clear fault flags

}while (SFRIFG1&OFIFG); // Test oscillator fault flag

__bis_SR_register(SCG0); // Disable the FLL control loop

UCSCTL1 = DCORSEL_5; // Select DCO range 16MHz operation

UCSCTL2 |= 249; // Set DCO Multiplier for 8MHz

// (N + 1) * FLLRef = Fdco

// (249 + 1) * 32768 = 8MHz

__bic_SR_register(SCG0); // Enable the FLL control loop

// Worst-case settling time for the DCO when the DCO range bits have been

// changed is n x 32 x 32 x f_MCLK / f_FLL_reference. See UCS chapter in 5xx

// UG for optimization.

// 32 x 32 x 8 MHz / 32,768 Hz = 250000 = MCLK cycles for DCO to settle

__delay_cycles(250000);

while(1)

{

P1OUT ^= BIT7; // Toggle P1.7

//__delay_cycles(600000); // Delay

}

over 12 years ago

0 zrno soli over 12 years ago

Guru 34853 points

If you want to analyze something regarding timing, use XT1/X2 (for example 8 MHz), and not DCO. On MSP430F552x you have MCLK output pin, so you can measure MCLK. Your code is in C, did you checked if the produced final assembler code for MSP430G2/MSP430F5 is the same. Anyway, I found that sometimes older MSP430x2x devices are faster than new MSP430F5xx with same assembler code (and same MCLK).

http://forum.43oh.com/topic/2972-sbw-msp430f550x-based-programmer/?p=32639

0 c_builder over 12 years ago in reply to zrno soli

Prodigy 170 points

Looks like the MSP430F5528 does not have a MCLK out. The 5529 and 5527 do.

I will try to use XT2 rather then DCO and check the results. I'm testing using the target board http://www.ti.com/tool/msp-ts430rgc64usb . So it has the built in 4MHz crystal.

0 c_builder over 12 years ago in reply to c_builder

Prodigy 170 points

I tried using the 4MHz crystal on the board. I got a much more steady signal, but it's still running the while loop slow. At 4Mhz the while loop outputs a 250kHz signal. Which makes since, because when I ran the board at 8MHz (DCO) I was getting around 523kHz signal.

So now my question is, what can I do to make the while loop run faster?

0 Robert Cowsill over 12 years ago

Guru 16361 points

c_builder said:
I've ran a similar test on a lauchpad MSP430 (430G2553) and I'm able to flip the bits on the pin and see it actually run at 8mhz.

Do you mean that you're setting the msp430g2553's DCO to 8MHz and are getting an 8MHz toggle output? If so I can't see how that's possible. Jump instructions take two cycles and the pin toggle is going to take 4 or 5 cycles depending on which bit you're flipping. So that's at least 12 MCLK cycles for a full output cycle (toggle on and toggle off).

0 c_builder over 12 years ago in reply to Robert Cowsill

Prodigy 170 points

Ya, I was wondering this too. And actually I see why now. I was doing a pin toggle on P1.4 which was also the SMCLK output. So really it was the SMCLK that I was seeing on the scope and not the while loop doing the pin flipping. Thanks for clearing that part up.

So this means the fastest I can do a manual flipping of a bit in a while loop would be around 1MHz if I was running the board at 16MHz. Correct?

0 Jason Work over 12 years ago in reply to c_builder

TI__Genius 10830 points

You can also use the clock feature in Code Composer Studio to take a measurement or make a comparison.

Run --> Clock --> Enable

Here is a simple example with a breakpoint to show that this particular loop takes six clock ticks.

0 c_builder over 12 years ago in reply to Jason Work

Prodigy 170 points

Thanks Jason for showing me this. I enabled it and it's saying that it's 5 cycles per loop. Running the processor at 16MHz, I'm showing a 1MHz output signal from the pin. That means there is 8 cycles per loop. So i'm not sure this is working right. There are 3 cpu cycles that are not accounted for?

0 Jason Work over 12 years ago in reply to c_builder

TI__Genius 10830 points

My results show this:

Six clock counts on MSP430G2452 (using MSP-EXP430G2) as reported above
Four clock counts on MSP430F5529 (using MSP-EXP430F5529) as reported below

0 c_builder over 12 years ago in reply to Jason Work

Prodigy 170 points

Jason, sorry perhaps I didn't give enough details. I'm flipping P1.7 which turns out to be 5 cycles according to the CCS v5.

But what I'm seeing on the scope is different. Seems to me if I'm running at 16MHz, and it takes 5 cycles per loop. That would be 10 cycles for a complete pulse up/down. So the pulse coming out of P1.7 should be 1.6MHz right? But instead I'm seeing 1.0MHz on P1.7. Is it possible something else is going on it the background? interrupt routine? Even though I didn't enable any. There are 3 cycles per loop that I'm not sure where they went.

0 Jason Work over 12 years ago in reply to c_builder

TI__Genius 10830 points

I see what you mean. Here are my results from my code above using a logic analyzer.

MSP430G2452 frequency ~89.5 kHz
MSP430F5529 frequency ~76.5 kHz

This has the DCO running at the default frequency of ~1 MHz, so that would work out to roughly 11 to 13 clock ticks per two loops, or something like 5 to 7 clock ticks per loop. I will need to investigate further.

0 Jason Work over 12 years ago in reply to Jason Work

TI__Genius 10830 points

I ran the test again, but with a DCO of 8 MHz.

Ignore the clock dither. We are apporaching the nyquist sampling rate (24 MHz sampling rate on 8 MHz clock).

Device	MCLK	f_toggle	ticks/period	ticks/toggle
MSP430G2452	1.00E+06	8.95E+04	11.2	5.6
MSP430F5529	1.00E+06	7.65E+04	13.1	6.5
MSP430G2452	8.00E+06	6.49E+05	12.3	6.2
MSP430F5529	8.00E+06	5.33E+05	16 by counting (or 15.0 by 1/f)	8 by counting (or 7.5 by 1/f)

The results for the MSP430G2452 are close to what I would expect, six clock ticks per loop. The variation is within experimentor error sicne I didn't actually measure the MCLK frequency of the MSP430G2452.

The results for the MSP430F5229 are unexpected. I expected the loop time to be faster, five clock ticks per loop. (Code Composer Studio measured 4 clock cycles for the loop above, but five clock cycles for the while loop tested.) Instead I measured eight clock ticks. This is consistent with the three clock cycle discrepancy from the Code Composer Studio results like c_builder observed.

More investigation is required.

0 Jason Work over 12 years ago in reply to Jason Work

TI__Genius 10830 points

0602.loopAt1MHz.zip

Attached are results and code running the DCO and MCLK at 1 MHz. Code Composer Studio measures a five clock tick loop time. My logic analyzer shows an eight clock tick loop time. Why this difference?

Here are some details about my setup.

Code Composer Studio version 5.4.0
Hardware: MSP-EXP430F5529
MSP430 example starting point: MSP430F55xx_UCS_02
Code changes: updated MSP430F55xx_UCS_02.c attached in form of .txt file

0 Jason Work over 12 years ago in reply to Jason Work

TI__Genius 10830 points

I tested the effect of adding a four delay cycles. In both cases, it added four clock cycles to the loop time. For the sake of better resolution on the logic analyzer, I used a 1 MHz DCO.

Code Composer Studio loop time before: 5 clock ticks
Code Composer Studio loop time with 4 delay cycles: 9 clock ticks

Logic analyzer loop time before: 8 clock ticks
Logic analyzer loop time with 4 delay cycles: 12 clock ticks

Logic analyzer loop time with 4 delay cycles: 12 clock ticks

There is still a three clock tick discrepency between the number of clock ticks Code Composer Studio measures and what I measure using the logic analyzer.

0 c_builder over 12 years ago in reply to Jason Work

Prodigy 170 points

Thanks for all the work on this Jason. I'm very curious as to why this happening. And where are those mystery clock cycles are going.

Cheers.

0 zrno soli over 12 years ago in reply to c_builder

Guru 34853 points

c_builder said:

Looks like the MSP430F5528 does not have a MCLK out. The 5529 and 5527 do.

MSP430F5xx devices have MCLK out at P4 (PM_MCLK).

c_builder said:

So this means the fastest I can do a manual flipping of a bit in a while loop would be around 1MHz if I was running the board at 16MHz. Correct?

So now my question is, what can I do to make the while loop run faster?

If you need just port filpping, it can be done on MCLK frequency using MCLK pin output. Also, Timer outputs can be used for this at MCLK/2 frequency. If you want to send information on port output, this can be done by using DMA (copy pattern from memory to POUT) at MCLK/2 frequency.

If you want to analyze instruction timing, this must be done on final assembler code. Here is number of cycles for execution on MSP430F5x......

xor.b R1, &P1OUT   ; 4
xor.b #BIT0, &P1OUT   ; 4
xor.b #BIT7, &P1OUT   ; 5

If you have some questions regarding numer of cycles / execution time, I can help, but point me to assembler code lines, because I don't want to lose time on CCS/C combination (by thinking what is behind).

0 Jason Work over 12 years ago in reply to c_builder

TI__Genius 10830 points

c_builder said:

... what can I do to make the while loop run faster?

You could run the LED toggle while loop it at 2.7 MHz on the MSP430F5529 on P6.7 by increasing the DCO frequency near the maximum system frequency.

DCORSEL = 6 (for max resolution up to 25 MHz)
N = FFLN = 657 (takes into account +3.5% tolerance for REFO and S_DCO = the max of 1.12 to still not exceed 25 MHz system frequency)
Result is MCLK of 21.56 MHz (+/- 3.4 MHz)
Loop frequency of 2.7 MHz (with eight MCLK cycles per loop)

0 zrno soli over 12 years ago

Guru 34853 points

c_builder said:

I've been running some tests on a MSP430F5528. I set it to run off the DCO at 8Mhz. then flip a bit in a while loop. On the MSP430F5528 the signal generated from the bit flipping is around 523kHz. While I can see the SMCLK output at 8.5MHz. So my question is how do I know whats taking up the extra processing time? Is there a easy way to find out?

I've ran a similar test on a lauchpad MSP430 (430G2553) and I'm able to flip the bits on the pin and see it actually run at 8mhz. So it's behaving differently (running the while loop much faster). Whats the F5528 doing to make it run the loop slower?

I checked it on MSP430F5510 with 8 MHz XT2, and yes, you are right.

Loop1 xor.b #BIT0, &P1OUT ; 4 cycles
jmp Loop1 ; 2 cycles

[P1.0] 571510.0 Hz [PM_MCLK] 8001138.5 Hz

Loop2 xor.b #BIT7, &P1OUT ; 5 cycles
jmp Loop2 ; 2 cycles

[P1.7] 500072.0 Hz [PM_MCLK] 8001146.0 Hz

First loop must give at P1.0 frequency 8000000 / 2 * (4 + 2) = 8000000 / 12 = 666 kHz while measured frequency is 571 kHz.

Second loop must give at P1.7 frequency 8000000 / 2 * (5 + 2) = 8000000 / 14 = 571 kHz, while measured frequency is 500 kHz.

In both cases CPU will take 1 extra cycle per loop. This is due to CPUX, and there is no work around. I found example where MSP430x2xx will execute faster (the assembler same code) than MSP430x5xx, but know I see there is also other examples (like this one noticed by you), that shows the same thing (old MSP430x2xx is faster than new MSP430x5xx in some cases).

http://forum.43oh.com/topic/2972-sbw-msp430f550x-based-programmer/?p=32639

0 Jason Work over 12 years ago in reply to zrno soli

TI__Genius 10830 points

Insightful results zrno soli. This suggests that CCS is making a mistake counting cycles for the MSP430F552x, but not for the MSP430G2xxxx. The missing three cycles would be accounted for by the jmp (2 cycles) and CPUX (1 cycle). This is worth verifying between the CCS team and MSP430 team.

0 Jens-Michael Gross over 12 years ago in reply to Jason Work

Guru 227245 points

Jason Work said:
The missing three cycles would be accounted for by the jmp (2 cycles) and CPUX (1 cycle).

The two cycles for the JMP are obvious. But I don't see why CPUX should be responsible for the additional 1 cycle.
XOR isn't an MSP430X instruciton (with additional header byte), nor is the JMP. A JMP is basically an ADD instruction with the value hardcoded into the instruction word. So only one cycle is needed for reading the instruction and hte second is to invalidate the already fetched next instruction and load the one at the target address instead. No need for a 3rd cycle here too.

The only reason for an additional cycle would be the use of an XORX instruction with a 20 bit address of the port register (when using large data model). Which is stupid if the compiler knows that the target address is a constant and in lower 64k.
But if the compiler doesn't know (because the register address is unknown at compile time and added by the linker script later), then this is a true waste of code space and CPU time.

To eliminate the JMP, one could put a series of toggles into the loop (instead of just one toggle per loop). Depending on how many toggles are in the loop, this reduces the influence of the JMP until it is negligible. This should give more clues for where the additional cycle comes from.

0 zrno soli over 12 years ago in reply to Jens-Michael Gross

Guru 34853 points

Jens-Michael Gross said:

The only reason for an additional cycle would be the use of an XORX instruction with a 20 bit address of the port register (when using large data model). Which is stupid if the compiler knows that the target address is a constant and in lower 64k.
But if the compiler doesn't know (because the register address is unknown at compile time and added by the linker script later), then this is a true waste of code space and CPU time.

To eliminate the JMP, one could put a series of toggles into the loop (instead of just one toggle per loop). Depending on how many toggles are in the loop, this reduces the influence of the JMP until it is negligible. This should give more clues for where the additional cycle comes from.

It is not related to compiler problem, because I don't use compilers. I am using IAR assembler, and there are switch for X / non-X mode (-v1). If X mode is not used, and X instructions are used in source code than assembler will report error. If X instructions are not used, assembler will produce same target code for MSP430x2xx / MSP430x5xx. I also noted the same thing (MSP430x2xx faster than MSP430x5xx) and for sure with completely same binary target code, without JMP instructions. It is related to CPUX cashing (fetching two words of code at once) but don't know how it is implemented on low level. Anyway, it is not "fair" that there is no any remarks in CPUX datasheet about this.

0 Jens-Michael Gross over 12 years ago in reply to zrno soli

Guru 227245 points

zrno soli said:
It is related to CPUX cashing (fetching two words of code at once)

Sorry, that's nonsense. The CPU doesn't cache anything. It makes a fetch from the memory bus. 16 bit for instructions, 8 or 16 bit for data read/write. The CPu doesn't even know whether it is fetching its instrucitons from flash, ram or maybe module registers or even vacant memory.

In 5x family, flash is 32 bit wide and the flash controller does indeed some 'caching', or rather, it reads 32 bit and returns the lower or upper word. However, this doesn't introduce an additional waitstate or any kind. This cachin is not to speed-up things, but to reduce power consumption by using the 32 bit internal flash structure to reduce the number of actual flash accesses (as you can see, reading/executing from flash takes more energy than from ram)

BTW: on FRAM devices, there are indeed fram wait states when the CPU is faster than 8MHz.

However, it would be nice to hear some explanation from the 'internals'. There's definitely a bug somewhere in the documentation. The question is: where?

0 Jason Work over 12 years ago in reply to Jason Work

TI__Genius 10830 points

Jason Work said:

For a simple while loop program running on a MSP430F5529, why does the CCS cycle counter "clock," measure three fewer MCLK cycles than I measure using a logic analyzer?

Stefan said:

In the condition [you describe], does this consider only one iteration through the loop or do you also see this behavior also when you iterate the loop severl times without stopping inbetween? What difference do you get then? Again 3, x times 3, .....?

Yes, this condition occurs both when there is one iteration, and when I iterate the loop several times without stopping. In both cases, Code Composer Studio measures three fewer clock cycles than I measure on MCLK using a logic analyzer.

This time I tried the MSP-EXP430F5438 populated with a MSP430F5438A using the following program

code said:

#include <msp430.h>

int main(void)

{

WDTCTL = WDTPW + WDTHOLD;                 // Stop watchdog timer

P5DIR |= BIT0;                            // Set P5.0 to output direction

P11DIR |= BIT1;                           // Set P11.1 to output direction

P11SEL = BIT1;                            // MCLK on P11.1



while (1)

{

      P5OUT ^= 0x01;                        // Toggle P5.0 using exclusive-OR

}

}

Code Composer Studio measures four clock cycles per loop, regardless of one breakpoint for every loop or one breakpoint for every three loops. The logic analyzer measures seven clock cycles for every loop (toggle).

You can see the test and results in this video.

0 Jason Work over 12 years ago in reply to Jason Work

TI__Genius 10830 points

Stefan said:

the behavior you see comes from the pipelined CPU and the requirement to flush the pipe each time the CPU gets stopped. This results in a few cycles uncertainty when starting and stopping the device. Therefore it is recommended when measuring the cycles and a accurate number is required, to run over a longer time or several iterations so that the flushing of the pipe does not influence the measurement in that degree. When performing function improvements this normally is not required as the full function will be measured and the delta from code change to code change is the interesting information. (the few cycles variance from the flush of the pipe does not matter).

Note: even when using the skip Trigger for some triggers it will also stop and restart - just in the background.

Why is this different for the MSP430G2xx parts? I do not observe the same hehaviour.

Katie said:

The 2xx devices don't use pipelining - they have a different and simpler MSP430 core than 5xx.

Thanks for the insight. That's helpful. I changed my code ,and my observations still do not match the explanation.

code said:

#include <msp430.h>

int main(void)

{

WDTCTL = WDTPW + WDTHOLD;                 // Stop watchdog timer

P5DIR |= BIT0;                            // Set P5.0 to output direction

P11DIR |= BIT1;                           // Set P11.1 to output direction

P11SEL = BIT1;                           // MCLK on P11.1

volatile unsigned int i;

while (1)

{

      P5OUT ^= 0x01;                        // Toggle P5.0 using exclusive-OR

      i++;

}

}

This time I measure 11 clock cycles per toggle when measuring using the logic analyzer.

I did not use a breakpoint in Code Composer Studio to measure the cycles. Instead, I stepped through the code to this line.

P5OUT ^= 0x01;

Then I enabled the CPU clock counter, and set i to zero in the watch expression window. I resumed and suspended the program and hoped that it would suspend on the same line of code again. (When it didn't I would start over again.) When it did, I looked at the value of i to see how many loops had occured and also recorded the clock counter. I managed to capture this once for six loops and once for seven loops. Six loops measured 36 CPU cycles. Seven loops measured 42 CPU cycles. This means one loop is measured to be 42-36 = 6 CPU cycles by Code Composer Studio.

I'm puzzled why my results do not match your explanation. I see 11 cycles using a logic analyzer where I measure 6 cycles using Code Composer Studio. This is after executing mutiple loops without breakpoints. What else may be flawed with my test method, or what gap do I have in my understanding of the explanation of this behaviour?

0 zrno soli over 12 years ago in reply to Jason Work

Guru 34853 points

Jason Work said:

I'm puzzled why my results do not match your explanation. I see 11 cycles using a logic analyzer where I measure 6 cycles using Code Composer Studio. This is after executing mutiple loops without breakpoints. What else may be flawed with my test method, or what gap do I have in my understanding of the explanation of this behaviour?

Problem is not in you or your results, problem is in explanation.

Stefan said:

the behavior you see comes from the pipelined CPU and the requirement to flush the pipe each time the CPU gets stopped. This results in a few cycles uncertainty when starting and stopping the device. Therefore it is recommended when measuring the cycles and a accurate number is required, to run over a longer time or several iterations so that the flushing of the pipe does not influence the measurement in that degree. When performing function improvements this normally is not required as the full function will be measured and the delta from code change to code change is the interesting information. (the few cycles variance from the flush of the pipe does not matter).

Note: even when using the skip Trigger for some triggers it will also stop and restart - just in the background.

I am using only assembler with logging inlined in code, without any breakpoints, debugging, whatever. Can you point CPU starting/point in this part of source. BTW, no meter of NOP/RRA order on MSP430x2xx it will always take 24 cycles. Just to be clear, firmware binary is completely the same on MSP430x2xx / MSP430F5xx. On MSP430F5xx number of cycles are changing with different instruction order. R5 point to non-USB RAM, USB is not used, and if instead of only R5 is used R5, R6, R7, result is the same.

    rra.b @R5         ; 3        rra.b @R5         ; 3
    nop               ; 1        rra.b @R5         ; 3
    rra.b @R5         ; 3        rra.b @R5         ; 3
    nop               ; 1        rra.b @R5         ; 3
    rra.b @R5         ; 3        rra.b @R5         ; 3
    nop               ; 1        rra.b @R5         ; 3
    rra.b @R5         ; 3        nop               ; 1
    nop               ; 1        nop               ; 1
    rra.b @R5         ; 3        nop               ; 1
    nop               ; 1        nop               ; 1
    rra.b @R5         ; 3        nop               ; 1
    nop               ; 1        nop               ; 1
-------------------------    -------------------------
Total number of cycles 24    Total number of cycles 27

0 Mike Mitchell1 over 10 years ago in reply to zrno soli

Intellectual 970 points

I'm guessing here, but is it possible the speed is being reduced by the FRAM controller wait states? Maybe if you execute out of RAM...

0 zrno soli over 10 years ago in reply to Mike Mitchell1

Guru 34853 points

No, it is not (only) FRAM. It is related to any MSP430F5xx device, executed from RAM or flash, on any MCLK.

**Attention** This is a public forum

MSP low-power microcontrollers

MSP low-power microcontroller forum

How fast a loop can execute?