clock timing

Rob Barton

Other Parts Discussed in Thread: LAUNCHXL-F28069M, TMS320F28335, TMS320F28069, CONTROLSUITE, TMS320F28069M

Dear Code composer studio,

I am noticing that when I use the clock option found under run, clock, enable

that the clock (cycles) that appear in the bottom right corner do not consistently report the same number.

For example, when I single step through every machine level command or if I just press F6 and do a C/C++ level single step, I get two different answers.

The correct answer only seems to occur if I step through every machine (ctrl-shift-f6) instruction. When I do a single step at the c/C++ level I see significantly more cycles occurring.

What is the cause of this, and is there any way to get a consistent measure regardless of how I step through the code.

The other option of setting two break points, and letting it run at full speed between the breakpoints also produces a HIGH cycle count compared to what I get if I painstakingly step through every machine cycle.

Thanks.

over 9 years ago

Chester Gillon over 9 years ago

Guru 92251 points

Rob Barton said:
I am noticing that when I use the clock option found under run, clock, enable that the clock (cycles) that appear in the bottom right corner do not consistently report the same number.

Which device are you using, since the implementation of the cycle counter is device family specific?

Also, which version of CCS and which debug probe are you using?

Rob Barton over 9 years ago in reply to Chester Gillon

Expert 2220 points

I am using a c28375D.

We have the XDS200 USB probe.

the version of ccs is

6.1.3.00033

we also have the development kit which is a XDS100v2 built in USB probe on the c28377D (F2837x_180controlcard_R1_1) which I noticed the same issue on, but the main concern is on the final development platform of the c28375D.

THanks.

Chester Gillon over 9 years ago in reply to Rob Barton

Guru 92251 points

Rob Barton said:
we also have the development kit which is a XDS100v2 built in USB probe on the c28377D (F2837x_180controlcard_R1_1) which I noticed the same issue on, but the main concern is on the final development platform of the c28375D.

I don't have that hardware, but investigated using a LAUNCHXL-F28069M using the built-in XDS100v2 and CCS 6.2.0.00050.

The following simple test program was used:

/*
 * main.c
 */
asm("       .def _DSP28x_usDelay");
asm("        .global  _DSP28x_usDelay");
asm("_DSP28x_usDelay:");
asm("        SUB    ACC,#1");
asm("        BF     _DSP28x_usDelay,GEQ    ;; Loop if ACC >= 0");
asm("        LRETR");

extern void DSP28x_usDelay(long LoopCount);

int main(void)
{
    DSP28x_usDelay (10);
	
	return 0;
}

Where the DSP28x_usDelay assembler function should take 5 cycles per iteration plus 9/10 cycles overhead per call when run in zero wait-state RAM. Tested the Cycle Count displayed for the DSP28x_usDelay (10) call in the following combinations:

Test number	Code running from	"Allow software breakpoints to be used" debug project property	Step mode	Cycle Count	Comment on Cycle Count
1	RAM	Ticked	Step Over	65	Reasonable (10 loop iterations of 5 cycles each plus 15 cycles of overhead which includes setting up the LoopCount argument)
2	RAM	Ticked	Assembly Step Into	64	Reasonable
3	RAM	NOT Ticked	Step Over	1	Bogus - too short
4	RAM	NOT Ticked	Assembly Step Into	64	Reasonable
5	FLASH	Ticked	Step Over	16	Bogus - too short
6	FLASH	Ticked	Assembly Step Into	606	This is 541 more cycles compared to the corresponding test 2 when the code was running from RAM, and the test involves 34 reads from flash where each flash read is 15 wait-states and so running from flash would add at least 510 cycles. i.e. this result looks reasonable.
7	FLASH	NOT Ticked	Step Over	16	Bogus - too short
8	FLASH	NOT Ticked	Assembly Step Into	606	This is the same as test 6, and is reasonable

The summary from the above test combinations is:

a) The Cycle Count when single stepping instructions is reasonable regardless of Software or Hardware breakpoints are used.

b) The Cycle Count when using Step Over is reasonable when Software Breakpoints are used, but bogus if Hardware Breakpoints are used.

When the code is running in RAM believe the "Allow software breakpoints to be used" debug Project Property controls if CCS uses a Software or Hardware breakpoint. Whereas when the code is running in FLASH CCS has to use a Hardware breakpoint.

Not sure if the observed behavior of the Cycle Count being bogus when using Step Over with Hardware Breakpoints is a limitation of the C28xx emulation logic, or a bug in the CCS debugger.

Chester Gillon over 9 years ago in reply to Chester Gillon

Guru 92251 points

Chester Gillon said:
I don't have that hardware, but investigated using a LAUNCHXL-F28069M using the built-in XDS100v2 and CCS 6.2.0.00050.

I repeated the same 8 test combinations, but using a TMS320F28335 device and Blackhawk USB2000 emulator. Got exactly the same number of clock cycles for each of the 8 test combinations as using the TMS320F28069 device and XDS100v2 emulator.

i.e. the problem of the "bogus" clock cycle values isn't specific to a particular device or emulator.

Rob Barton over 9 years ago in reply to Chester Gillon

Expert 2220 points

So the general conclusion is you are seeing a difference when you step over an instruction vs stepping into a function.

I tried your exact same test on my setup for comparative reasons, because when I step over a function running in ram with a software breakpoint, for example I stepped over and through:
ui32Space = USBBufferSpaceAvailable((tUSBBuffer *)&g_sTxBuffer)/2;
when I stepped over I got a larger count say 400ish cycles, when I stepped through it, I got 70 cycles.

In general this is what I'm seeing throughout my timings on various functions, stepping over I get significantly larger results than if I step into them make it run each one.
I am using software breakpoints.

It should be noted that when I do the exact same delay function as you are using to test, I do see that the STEP INTO and STEP OUT give the same result, just as you do, but only on that particular function. Perhaps you need to try other functions (I don't know why) but I don't observe the same (Running in ram, with software breakpoints) on all functions, but I do agree with your observation on the simple delay loop function, my code composer is generating the same results you see on THAT test case...

In general I am running in RAM, and using sw breakpoints. I can of course do whatever you instruct me to do, but the problem is, we are very interested in profiling, or knowing how long functions take, and I can't always single step through them all to get an accurate timing, but having observed that I always see a larger value when I step over, sometimes by a HUGE amount (900 cycles more) this means I can't trust the TOOL to take a measurement, but I have no other tool right now to 'profile' or time functions in terms of cycles.

What would be involved in determining the cause of this? How can we resolve this problem? (Aside from single stepping through every line of code to get accurate timing)

So the question is:
1. What is the next step? While s/w breakpoints running in ram stepping over or into works on the delay test case, it doesn't work in all cases.

Thanks very much.

Chester Gillon over 9 years ago in reply to Rob Barton

Guru 92251 points

Rob Barton said:
In general I am running in RAM, and using sw breakpoints. I can of course do whatever you instruct me to do, but the problem is, we are very interested in profiling, or knowing how long functions take, and I can't always single step through them all to get an accurate timing, but having observed that I always see a larger value when I step over, sometimes by a HUGE amount (900 cycles more) this means I can't trust the TOOL to take a measurement, but I have no other tool right now to 'profile' or time functions in terms of cycles.

On the Profiling on C28x Targets WIki page there is the FAQ entry Q: Cycle counts for 'step-by-step' and 'run to line' do not match? which could explain why you see a larger value when using step over. i.e. the Wiki page suggests single stepping the code one instruction at a time can report a lower cycle count due to not measuring the effect of pipeline stalls.

I guess a test to validate the clock cycles reported by the tool would be to instrument a section of code by setting a GPIO output at the beginning and clearing the GPIO output at the end of the section. By measuring the duration of the GPIO signal with external test equipment, while stepping over the code would allow the real duration to be compared against the number of clock cycles reported by CCS.

Chester Gillon over 9 years ago in reply to Rob Barton

Guru 92251 points

Rob Barton said:
It should be noted that when I do the exact same delay function as you are using to test, I do see that the STEP INTO and STEP OUT give the same result, just as you do, but only on that particular function. Perhaps you need to try other functions (I don't know why) but I don't observe the same (Running in ram, with software breakpoints) on all functions,

I changed my test to measure both:

a) The ControlSuite DELAY_US delay macro, which uses the previous delay function.

b) A function which performs floating point multiplies, where each element in an output array is the square of the corresponding element in an input array.

The Cycle Counter value for each piece of code tested was measured using both:

a) Step over, i.e. code run at full speed. A GPIO was set at the beginning and cleared at the end, and used a LSA to measure the actual elapsed time.

b) Single stepping each instruction. To automate the single stepping of instructions a GEL script was used to repeatedly call GEL_AsmStepInto() until the final instruction was reached. The GEL script also counted the number of instructions executed.

The target device was a TMS320F28069M with the ControlSuite startup code used to set the clock to the maximum 90 MHz supported by the device. The internal oscillator was used. The code was running in RAM and software breakpoints were used.

The test results are:

Code timed	Clock Cycles from Step Over	LSA measured duration of Step Over	Clock Cycles from Single Stepping instructions	Num instructions from single stepping	Comment
DELAY_US (100)	9016	99.800 us	9011	3608	The cycle count from single stepping is only 5 cycles less than from stepping over. Converting from cycle counts to time, using the 90MHz CPU clock, the step over equates to 100.2 us and the single step equates to 100.1 us. The code tested should generate a fixed 100us delay and the LSA measurement and cycle counts from the step over are consistent with the expected delay. i.e. this is a sanity check of the timing mechanism.
Square of 10 element vector	272	2.950 us	197	131	The cycle count from single stepping is 66 cycles less than from stepping over. Converting from cycle counts to time the step over equates to 3.022 us and the single step equates to 2.19 us.
Square of 100 element vector	2520	27.840 us	1815	1209	The cycle count from single stepping is 1311 cycles less than from stepping over. Converting from cycle counts to time the step over equates to 28.0 us and the single step equates to 13.43 us.
Square of 1000 element vector	25021	277.020 us	18016	12009	The cycle count from single stepping is 7005 cycles less than from stepping over. Converting from cycle counts to time the step over equates to 278.01 us and the single step equates to 200.17 us.

The conclusions for the above are:

1) With code which is performing floating point operations as well as reading and writing memory, have repeated your observation that the cycle count reported when stepping each instruction is a significantly less than the cycle count when stepping over.

2) When the cycle counts are compared against the actual elapsed time measured externally to the device, then the cycle count from stepping over matches the external measurement but the the cycle count from single stepping is under reading.

In summary, suggest you can trust the tool to report a valid cycle count when you step over, such that the code is running at full speed.

As per the Q: Cycle counts for 'step-by-step' and 'run to line' do not match? Wiki page when single stepping instructions the impact of pipeline stalls is not measured.

For reference, the test project is attached 1513.Example_2806xGpioSetup.zip. The zip file also contains the modified f28029.gel GEL script used. Under the "Profile Clock Tests" menu there are functions used to run the code. I performed the test by:

- Manually running the code to the start of a test block in the main function (run to the line when sets the GPIO output). Manually reset the profile clock.

- Execute one of the GEL script menu options to run the code to the end of the test block. Used one of the "*_STEP_OVER" options to step over and one of the "*_STEP_TO_END" options to single step.

- Wait for the code to run to the end point for the test. This is "instant" with "*_STEP_OVER" option but can take several seconds with a "*_STEP_TO_END" option. When the "*_STEP_TO_END" option has finished the GEL script reports the number of instructions stepped to the console. Manually read the profile clock to get the number of cycles from the tested code.

[I did try and fully automate the test, but couldn't find a way to read the profile clock in GEL. Believe an automated test would be possible with Debug Server Scripting.]

Rob Barton over 9 years ago in reply to Chester Gillon

Expert 2220 points

When I test this on some assembly functions written here, when I manually count up how many clock cycles the routine is supposed to take, since the reason fro writing them was to optimize it to a very small tight loop of 13(N-1)+30 clock cycles... I measure exactly that when I single step through the assembly routine by single stepping through the assembly instructions one by one... but when I step over it I see the same growth seen in your experiments.

So to say that I should trust the step over result because stepping into it is undermeasuring it, is confusing, because in this hand crafted test case, it SHOULD take the measured 13*(N-1)+30 clock cycles to run the routine, not excessively more.

So that is partly where the question stemmed from, and that I am noticing it elsewhere. SO in the hand crafted assembly routine where I've made the effort to count and know the cycles, it is matching with single stepping through it, not with stepping over it.

Any thoughts?

There should not be any pipeline issues in my hand crafted assembly assuming I did it correctly, and since it's a fairly simple loop.

Any thoughts?

Chester Gillon over 9 years ago in reply to Rob Barton

Guru 92251 points

Rob Barton said:
but when I step over it I see the same growth seen in your experiments.

In your environment are you able to measure the actual execution time, and then compare the actual execution time against the result of (CPU cycles / CPU frequency) ?

On my example code the actual execution time matched that derived from the number of CPU cycles when step over, which lead me to conclude that the number of CPU cycles when step over was reporting the "correct" value.

Rob Barton said:
There should not be any pipeline issues in my hand crafted assembly assuming I did it correctly, and since it's a fairly simple loop.

Unfortunately I have never had to hand craft C28xx assembler, and so can't offer advice on pipeline issues. Suggest you will get a better answer by asking the device experts on the C2000 forum.

Code Composer Studio™︎

Code Composer Studio forum

clock timing