M4F cycles

Brian McCarthy59237

Other Parts Discussed in Thread: TM4C129XNCZAD

We are running the TIVA EVM board. I wrote a simple assembly file which does a "sub r1,r1,r2". There are 100 of these in a loop which is called 1000 times. Looking at the documentation this is a single cycle instruction so this loop should execute in about 100, 000 cycles. I'm measuring 200,000. Is this really a 2 -cycle instruction due to possibly a program read from memory. I think there is a cache. Any insights would be helpful. Thanks.

over 11 years ago

0 Amit Ashara over 11 years ago

TI__Guru**** 244400 points

Hello Brian,

Which TIVA devce is this and what frequency is the core running at?

Regards

Amit

0 Brian McCarthy59237 over 11 years ago in reply to Amit Ashara

Prodigy 130 points

Hi Amit

Here is the data for 120 Mhz.

board:

Tiva™ TM4C129X Development Board

Microcontroller:

Tiva TM4C129XNCZAD, 1024-KB flash memory, 256-KB SRAM, 120-MHz operation

configure timer

   SysCtlClockFreqSet((SYSCTL_XTAL_25MHZ | SYSCTL_OSC_MAIN | SYSCTL_USE_PLL | SYSCTL_CFG_VCO_480), 120000000);
   SysCtlPeripheralEnable(SYSCTL_PERIPH_TIMER0);
   TimerConfigure(TIMER0_BASE, TIMER_CFG_PERIODIC);
   TimerLoadSet(TIMER0_BASE, TIMER_A, 0xFFFFFFFF);
   TimerEnable(TIMER0_BASE, TIMER_A);

read timer

TimerValueGet(TIMER0_BASE, TIMER_A);

Thanks!

Brian

0 Amit Ashara over 11 years ago in reply to Brian McCarthy59237

TI__Guru**** 244400 points

Hello Brian,

At 16MHz and above on TM4C129 devices, the wait states are introduced for Flash access. You may be seeing the effects of double fetch cycles because of that.

Change the System Clock to run at 16MHz or below and the behavior would not be seen.

Regards

Amit

0 Brian McCarthy59237 over 11 years ago in reply to Amit Ashara

Prodigy 130 points

Hi Amit

Are you saying to change the clock from 120 Mhz down to 16 Mhz? I don't see how I can run my whole app at 16 Mhz.

Also I think this dummy program is running out of SRAM. Wouldn't that be independent of system clock?

Thanks again,

Brian

0 cb1_mobile over 11 years ago in reply to Amit Ashara

Guru 117855 points

Amit Ashara said:
At 16MHz and above on TM4C129 devices, the wait states are introduced for Flash access

Hi Amit,

Are you quite certain of the "16MHz" value? Our older, less capable (and less costly) LX4F and basic TM4C usually do not encounter such, "wait-state imposition" until System Clock rises above 40MHz! Thus the, "more capable, 129" may operate under a (serious) disadvantage...

Is it not normal that such parts, "improve" w/passage of time and cost increase? Indeed the 129's peripheral inventory has expanded - but if the 16MHz proves true - there exists a (limited) "System Clock frequency band" (16-40 MHz) in which the older, less costly parts may exceed ithe 129's speed of execution... (this assumes that > 1 wait state must be introduced when the 129's System Clock is ordered beyond ~80MHz - this fact not yet revealed/expanded - this thread)

0 Amit Ashara over 11 years ago in reply to cb1_mobile

TI__Guru**** 244400 points

Hello cb1,

This has been a part of the data sheet in the Internal Memory Section Table 8-1. I have pasted the same in the post. It may not be as serious disadvantage as the prefetch buffer size has been increased to hold a lot more of the Flash Data.

Regards

Amit

0 cb1_mobile over 11 years ago in reply to Amit Ashara

Guru 117855 points

@Amit,

Thank you - appreciated. (this guy - many others)

"May not be a serious disadvantage" would be better supported by a tightly controlled, "benchmark" comparison - would it not? (i.e. old/new MCUs running the identical code (where appropriate) at a variety of System Clock settings - then charted - to far more effectively illustrate the impact (if any) of such change upon program execution speed...

0 SourceTwo over 11 years ago in reply to cb1_mobile

Expert 2355 points

Hence my interest/concern: http://e2e.ti.com/support/microcontrollers/tiva_arm/f/908/t/299147.aspx

No further response received from Sue;-(

Regards,

Dave

0 SourceTwo over 11 years ago in reply to Amit Ashara

Expert 2355 points

Hi Amit,

I suspect the poster will still experience a 2-cycle pace at lower frequencies as the results from the previous instruction must be written back to RAM before the fetch for the next instruction can be made. This is why optimizing compilers generally shuffle instructions around to avoid delays due to data dependencies.

Regards,

Dave

0 SourceTwo over 11 years ago in reply to SourceTwo

Expert 2355 points

Actually, I must correct myself - the delay is not due to RAM, as the code is only using registers. The general comment about data dependencies is true, but may no longer apply.

An interesting test would be to replace the group of sub r2,r1,r1 with a half-length group of interleaved pairs of sub r2,r1,r2 & sub r4,r3,r3. This would be the same total number of instructions but without possibility of delays for result storing.

Regards,

Dave

0 Amit Ashara over 11 years ago in reply to SourceTwo

TI__Guru**** 244400 points

Hello SourceTwo,

SRAM execution has a one wait cycle which may explain the 2 times execution time. It is however not clear as to is this the only instruction or not?

Regards

Amit

0 cb1_mobile over 11 years ago in reply to Amit Ashara

Guru 117855 points

And these "details" have provided the needed "cover" so that the necessary benchmark issue may be escaped... (sigh)

(of course - it's duly noted...)

0 Brian McCarthy59237 over 11 years ago in reply to Amit Ashara

Prodigy 130 points

Hi Amit

I looked at the ARM website and this instruction should execute in 1 cycle. Is the 2-cycle issue that you mention above due to TI's implementation of the ARM Core?

This is just a dummy program. I originally wrote a floating point dot product and noticed every instruction executed in 2 cycles. These were loads (ldr which I would'n be surprised at 2-cycles) vmul.f32, vadd.f32 etc.

Thanks!

Brian

0 Brian McCarthy59237 over 11 years ago in reply to SourceTwo

Prodigy 130 points

Hi Dave

The dummy program executing out of SRAM would have to grab the program line out of the RAM but I don't see any writes back to RAM. Even though the program has about 100 lines in the loop I imagine the program would execute out of the program cache after the 1st loop.

Thanks,

brian

0 SourceTwo over 11 years ago in reply to Brian McCarthy59237

Expert 2355 points

Hi Brian,

Time permitting, it may prove "instructive" if you try:

1000 * 50 of the two-line sequence {sub r1,r1,r2 <NL> sub r3,r3,r4}

This combination will execute at a 1-cycle pace unless there really is a memory latency issue. The difference is that the latter sequence does not have an adjacent instruction dependency (no common register).

If you try this, please let us all know your findings. I am certain this will run at 1-cycle on the TM4F123 at 80 MHz and I REALLY HOPE it will on the TM4F129 at 120MHz too. Otherwise, all hope is lost for this family...

Regards,

Dave

0 Robert Adsett over 11 years ago in reply to Amit Ashara

Guru 27665 points

Amit Ashara said:
It may not be as serious disadvantage as the prefetch buffer size has been increased to

Presumably the prefetch will be flushed by the function call.

The presence of the prefetch raises more questions

Can it keep up with linear 1 cycle instructions (If it can why is it needed, is it just converting wide memory accesses to narrow)?
What is the penalty if the prefetch contents do not contain the next needed instruction?
Just how big is it anyway?
Does the prefetch do some sort of predictive fetching in the presence of conditional branches?

Robert

0 Brian McCarthy59237 over 11 years ago in reply to SourceTwo

Prodigy 130 points

Hi Dave

OK I re-ran the program as you suggested and did not experience any difference. I ran this at 120 Mhz. Maybe I should drop the clock speed to 80 Mhz. In my original post I listed how I set up the timers. Maybe I made a mistake?

Tiva TM4C129XNCZAD, 1024-KB flash memory, 256-KB SRAM, 120-MHz operation

configure timer

read timer

TimerValueGet(TIMER0_BASE, TIMER_A);

Thanks again.

0 cb1 over 11 years ago in reply to Brian McCarthy59237

Guru 47900 points

Might it be "logical" to run at vendor's suggested 16MHz - and see if you can "ever" achieve cyclic improvement?

Are we all quite sure the design of the test/monitor (via calls to start timer & then read) is entirely correct/proper?

0 Robert Adsett over 11 years ago in reply to Brian McCarthy59237

Guru 27665 points

For this sort of test I'd do a pin toggle rather than use the timers. Just run the test continuously. It should give you an idea of the jitter as well although I'd be a little surprised if there was enough jitter to measure on something this simple.

Simple and straightforward.

Robert

0 Amit Ashara over 11 years ago in reply to Robert Adsett

TI__Guru**** 244400 points

Hello Robert,

1. It is converting wide memory access to narrow

2. If the prefetch buffer does not contain the next instruction, then it will be flushed

3. The spec mentions it to be 4 deep 256 wide for the whole flash bank-EVEN and bank-ODD.

4. No it does not do predictive fetch.

Regards

Amit

0 Brian McCarthy59237 over 11 years ago in reply to cb1

Prodigy 130 points

I dropped the clock to 16 Mhz and it still comes out at 2-cycles. I think I must be measuring this incorrectly.

At 16 Mhz, program in SRAM, no memory accesses (other than program lines), I'm pretty sure this should run at 1-cycle.

Thanks,

brian

0 Amit Ashara over 11 years ago in reply to Brian McCarthy59237

TI__Guru**** 244400 points

Hello Brian,

As I mentioned earlier "SRAM execution has a one wait cycle which may explain the 2 times execution time". Also is the test loop somewhat like the following in Assembly?

sub r1, r1, r2

cbz r1, <sub line>

All,

If the program is executed from SRAM, then irrespective of the System Clock it is always the same number of wait cycles unlike the Flash.

Regards

Amit

0 Brian McCarthy59237 over 11 years ago in reply to Amit Ashara

Prodigy 130 points

Hi Amit

I guess I'm not clear. I thought SRAM would be faster than flash. Where does the code execute in single cycle? cache?

Here is the code I'm testing with. I cut this loop down to 10 instructions. It measure about 26000 cycles.

   .syntax unified
    .type   Dummy, %function
    .text
    .align 2
    .global Dummy

Dummy:

   mov r2,#1000
Loop:

   sub r0,r1,r1
   sub r3,r3,r4
   sub r0,r1,r1
   sub r3,r3,r4
   sub r0,r1,r1
   sub r3,r3,r4
   sub r0,r1,r1
   sub r3,r3,r4
   sub r0,r1,r1
   sub r3,r3,r4


   subs r2,r2,#1
   bne Loop

   bx lr

0 Petrei over 11 years ago in reply to Brian McCarthy59237

Guru 26105 points

Hi,

Brian,

I just checked your code - first it is missing an important declaration in .asm: add also .thumb, otherwise is another thing. I run this code, TM4C123GXL, and the number of cycles shown is 13007 for 1000 iterations, 80 MHz.

Attached is the source file to check (ignore all rubbish,please)

Fullscreen 5025.uart_echo.c Download

//*****************************************************************************
//
// uart_echo.c - Example for reading data from and writing data to the UART in
//               an interrupt driven fashion.
//
// Copyright (c) 2012-2014 Texas Instruments Incorporated.  All rights reserved.
// Software License Agreement
// 
// Texas Instruments (TI) is supplying this software for use solely and
// exclusively on TI's microcontroller products. The software is owned by
// TI and/or its suppliers, and is protected under applicable copyright
// laws. You may not combine this software with "viral" open-source
// software in order to form a larger program.
// 
// THIS SOFTWARE IS PROVIDED "AS IS" AND WITH ALL FAULTS.
// NO WARRANTIES, WHETHER EXPRESS, IMPLIED OR STATUTORY, INCLUDING, BUT
// NOT LIMITED TO, IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
// A PARTICULAR PURPOSE APPLY TO THIS SOFTWARE. TI SHALL NOT, UNDER ANY
// CIRCUMSTANCES, BE LIABLE FOR SPECIAL, INCIDENTAL, OR CONSEQUENTIAL
// DAMAGES, FOR ANY REASON WHATSOEVER.
// 
// This is part of revision 2.1.0.12573 of the EK-TM4C123GXL Firmware Package.
//
//*****************************************************************************

#include <stdint.h>
#include <stdbool.h>
#include "inc/hw_ints.h"
#include "inc/hw_types.h"
#include "inc/hw_memmap.h"
#include "inc/hw_ssi.h"
#include "driverlib/debug.h"
#include "driverlib/fpu.h"
#include "driverlib/gpio.h"
#include "driverlib/interrupt.h"
#include "driverlib/pin_map.h"
#include "driverlib/rom.h"
#include "driverlib/ssi.h"
#include "driverlib/sysctl.h"
#include "driverlib/uart.h"
//
#include "inc/hw_nvic.h"          /* for definition of NVIC_DBG_INT */
#include "inc/hw_memmap.h" /* for definition of DWT_BASE */
#include "inc/hw_types.h"       /* for definition of HWREG */

#define DWT_O_CYCCNT 0x00000004
static  uint32_t cstart, cstop;
volatile uint32_t exe_time;


extern void Dummy(void);
void EnableTiming(void);

//*****************************************************************************
//
//! \addtogroup example_list
//! <h1>UART Echo (uart_echo)</h1>
//!
//! This example application utilizes the UART to echo text.  The first UART
//! (connected to the USB debug virtual serial port on the evaluation board)
//! will be configured in 115,200 baud, 8-n-1 mode.  All characters received on
//! the UART are transmitted back to the UART.
//
//*****************************************************************************

//*****************************************************************************
//
// The error routine that is called if the driver library encounters an error.
//
//*****************************************************************************
#ifdef DEBUG
void
__error__(char *pcFilename, uint32_t ui32Line)
{
}
#endif

uint32_t ui32sysclock;
uint32_t flag;
uint32_t ssirecv;

void SSI0IntHandler(void)
{
    static unsigned long ulStatus;

    ulStatus = SSIIntStatus(SSI0_BASE, true);
    if (ulStatus == 0x08) flag = 1;
    SysCtlDelay(30);
    SSIIntClear(SSI0_BASE, ulStatus);
    SysCtlDelay(100);

}

void initSSI()
{
    SysCtlPeripheralEnable(SYSCTL_PERIPH_SSI0);

    SysCtlPeripheralEnable(SYSCTL_PERIPH_GPIOA);
    GPIOPinConfigure(GPIO_PA2_SSI0CLK);
    GPIOPinConfigure(GPIO_PA3_SSI0FSS);
    GPIOPinConfigure(GPIO_PA4_SSI0RX);
    GPIOPinConfigure(GPIO_PA5_SSI0TX);

    GPIOPinTypeSSI(GPIO_PORTA_BASE, GPIO_PIN_2|GPIO_PIN_3|GPIO_PIN_4|GPIO_PIN_5);

    SSIConfigSetExpClk(SSI0_BASE, SysCtlClockGet(), SSI_FRF_MOTO_MODE_3, SSI_MODE_MASTER, SysCtlClockGet()/8, 8);

    HWREG(SSI0_BASE + SSI_O_CR1) = 0x00000010; //EOT = 1
    SSIEnable(SSI0_BASE);

    IntEnable(INT_SSI0);
    SSIIntEnable(SSI0_BASE, SSI_TXFF); //  | SSI_RXFF | SSI_RXTO | SSI_RXOR

}

int main(void)
{
    FPULazyStackingEnable();
    SysCtlClockSet(SYSCTL_SYSDIV_20 | SYSCTL_USE_PLL | SYSCTL_XTAL_16MHZ | SYSCTL_OSC_MAIN); //10 MHz
    ui32sysclock = SysCtlClockGet();
    initSSI();
    EnableTiming();
    IntMasterEnable();

    while(1){

    	for(int i = 0; i < 4; i++){
        SSIDataPutNonBlocking(SSI0_BASE, 0xF0);
        SSIDataGetNonBlocking(SSI0_BASE, &ssirecv);
       }
       SysCtlDelay(10000);
       // timing measurement
       cstart = HWREG(DWT_BASE + DWT_O_CYCCNT);
       Dummy();
       cstop = HWREG(DWT_BASE + DWT_O_CYCCNT);
       exe_time = cstop - cstart;
    } //  set here a breakpoint
}

void EnableTiming(void){
static int enabled = 0;

if (!enabled){
   HWREG(NVIC_DBG_INT) |= 0x01000000;  /*enable TRCENA bit in NVIC_DBG_INT*/
   HWREG(DWT_BASE + DWT_O_CYCCNT) = 0; /* reset the counter */
   HWREG(DWT_BASE) |= 0x01;            /* enable the counter */
   enabled = 1;
 }
}

Petrei

0 Robert Adsett over 11 years ago in reply to Brian McCarthy59237

Guru 27665 points

If it is like the 123 there is a prefetch buffer on flash only. So executing from RAM will run without the prefetch buffer for good and for ill.

The prefetch buffer on the 123 series does recognize branches and stops fetching allowing for the possibility of a backwards branch staying in the buffer so it has some cache like behaviour. Amit's description suggests the 129 does not do this but I suspect it does. This does only apply to flash though.

Note that in flash your bne at the bottom of the loop could result in a wait as the prefetch buffer's first entry is read (loading the first instruction again) if the loop does not fit into the prefetch or if branches unconditionally flush the prefetch queue.

Robert

0 Brian McCarthy59237 over 11 years ago in reply to Petrei

Prodigy 130 points

Petrei

According to a post by Amit the SRAM on the 129 takes a wait state. Did you run this code on the 123 out of Flash or SRAM? Does the SRAM have the same penalty as it does on the 129?

I will try using your .c file to make the measurements.

Thanks,

Brian

0 Petrei over 11 years ago in reply to Brian McCarthy59237

Guru 26105 points

Hi,

On 123 I am running on flash. I will try also on a 129, also from flash, I am curious...

Petrei

0 SourceTwo over 11 years ago in reply to Petrei

Expert 2355 points

Hi Petri,

Your cycle count on the '123 is as expected - I am waiting with bated breath to see if the '129 does as well. I really hope so,as future plans need more performance (much greater than 120MHz would be nice) and I sure don't want to port code to another vendor;-(

Thank you for checking this on the '129!

Regards,

Dave

0 Petrei over 11 years ago in reply to SourceTwo

Guru 26105 points

Hi,

I have checked today also on TM4C129 - the result was almost the same - 13012 cycles.

Please run the same to check/verify before jumping to possible wrong conclusions.

Petrei

0 SourceTwo over 11 years ago in reply to Petrei

Expert 2355 points

Thank you!

I am grateful for the first '129 performance data that I have seen.

Regards,

Dave

0 Brian McCarthy59237 over 11 years ago in reply to Petrei

Prodigy 130 points

Hi Petrei/Dave/Amit/cb_1

Thank you very much to all for your help. I too confirm that I get single cycle when running out of Flash at 120 Mhz on the 129. As mostly being a DSPer, I'm a little surprised/baffled that I can get better performance from the flash than I can get out of the SRAM. In all sincerity, if someone knows what advantages the SRAM brings, I'd be interested to hear what it is so I will know how to leverage it in the future.

Thanks Again!

0 Petrei over 11 years ago in reply to Brian McCarthy59237

Guru 26105 points

Hi,

Brian, since you are interested, I attach below a document from ARM describing DSP capabilities of Cortex-M4.

8662.Developing_Advanced_Signal_Processing_Software_on_the_Cortex-M4_Processor.pdf

Enjoy! (and please mark this thread as answered).

Petrei

0 Brian McCarthy59237 over 11 years ago in reply to Petrei

Prodigy 130 points

Thank you again.

0 Amit Ashara over 11 years ago in reply to Brian McCarthy59237

TI__Guru**** 244400 points

Hello Brian

The Flash for a small loop code would be reading from the prefetch buffer giving the single cycle latency, but when running a larger code it may cause prefetch flush and fill too often. On the SRAM this would be always a two clock cycle latency irrespective of the system clock frequency.

Regards

Amit

0 SourceTwo over 11 years ago in reply to Amit Ashara

Expert 2355 points

Hi Amit/Brian,

Amit - I am surprised by the claim of a two-cycle latency on the SRAM, both because it would be unique to the '129 and because the data sheet claims "256 KB single-cycle System SRAM". Every M3 & M4 I have looked at from every vendor has single-cycle (no-wait-state) SRAM. In Brian's case, with no data read/write occurring to SRAM, I would have expected the best possible performance. The only explanation that makes sense to me is that the SRAM may not have the best bus-connection (i-fetch optimized) to the core?

Brian - when you successfully obtained the 130xx-cycle count out of flash, were you using the same code sequence you tried earlier from SRAM? Is your SRAM vs. Flash measurement apples-to-apples?

Amit - Please note that "8.2.5 Bus Matrix Memory Accesses" in the data sheet indicates that the CPU Data Bus cannot access SRAM - certainly a documentation error...

Regards,

Dave

0 Brian McCarthy59237 over 11 years ago in reply to SourceTwo

Prodigy 130 points

Hi Dave

Yes, exactly the same code.

Thanks!

brian

Arm-based microcontrollers

Arm-based microcontrollers forum

M4F cycles