Hello TI E2E Community:
While optimizing time-critical functions for speed by ordering structure variables, I came across a situation in which a reduction in function instructions can evidently increase the number of cycles it takes the function to execute. Cycles are measured using Timestamp_get32(), which is calibrated before each elapsed time measurement, and “Project/Clean…” is used after code changes with optimization off and prefetch disabled to ensure consistent results. Note that the difference in measurements observed are not the result of the compiler utilizing parallel instructions during delay cycles. In the example below, the function call with it’s 9 instructions consume 61 cycles, but the function call with it’s 10 instructions consume only 53 cycles, which is 8 cycles less for the additional instruction.
The situation appeared when timing a function which accepts a pointer to the structure below which contains the sine and cosine variables, which are used for the sincos() function, and the unused variables a, b, c, and d, which consume 8 words.
//Structure with sine, cosine and unused variables
typedef struct
{
float sine, cosine;
float a, b, c, d;
}SICOSTRUCT;
Since the situation occurred only when sincos() was called, an assembly function identical to sincos(), named sico(), was created in an effort to locate at which line(s) the additional cycles were consumed. It was observed that when the sico() function was simply returned from, the issue remained. The assembly code for sico() follows:
_sico
LRETR
The test function SiCoTest() is called and timed in main.c as highlighted below. Note that the Timestamp_get32() function is calibrated each time the SiCoTest() elapsed time is measured:
SICOSTRUCT sc;
etcount = etsum = 0;
outfile = fopen("c:\\TDB\\SiCoTest.csv", "w");
for (test=0; test<1000; ++test)
{
cal = Timestamp_get32();
cal = Timestamp_get32() - cal;
et = Timestamp_get32();
SiCoTest(&sc);
et = Timestamp_get32() - et - cal;
fprintf(outfile, "Sine, %f, Cosine, %f, ET, %ld\n", sc.sine, sc.cosine, et);
etsum += et;
++etcount;
}
fprintf(outfile, "Sine, %f, Cosine, %f, ET, %ld, Count: %ld, Sum: %ld, Mean: %ld\n",
sc.sine, sc.cosine, et, etcount, etsum, etsum/etcount);
fclose(outfile);
The assembly code generated by the compiler for measuring and calling the SiCoTest() function is given below:
;----------------------------------------------------------------------
; 203 | et = Timestamp_get32();
;----------------------------------------------------------------------
LCR #_xdc_runtime_Timestamp_get32__E ; [CPU_] |203|
MOVW DP,#_et ; [CPU_U]
MOVL @_et,ACC ; [CPU_] |203|
;----------------------------------------------------------------------
; 204 | SiCoTest(&sc);
;----------------------------------------------------------------------
MOVL XAR4,#_sc ; [CPU_U] |204|
LCR #_SiCoTest ; [CPU_] |204|
;----------------------------------------------------------------------
; 205 | et = Timestamp_get32() - et - cal;
;----------------------------------------------------------------------
LCR #_xdc_runtime_Timestamp_get32__E ; [CPU_] |205|
When the sine and cosine variables are given first in the structure passed to SiCoTest(), the 9 assembly code instructions below are generated by the compiler for SiCoTest() and takes 61 cycles to execute:
_SiCoTest:
ADDB SP,#2 ; [CPU_U]
MOVL *-SP[2],XAR4 ; [CPU_] |162|
;----------------------------------------------------------------------
; 166 | sico(1, &sc->sine, &sc->cosine); //sine,cosine first=61 a,b,c,d first=53
;----------------------------------------------------------------------
MOVL XAR5,*-SP[2] ; [CPU_] |166|
MOVL XAR4,*-SP[2] ; [CPU_] |166|
MOVIZ R0H,#16256 ; [CPU_] |166|
ADDB XAR5,#2 ; [CPU_] |166|
LCR #_sico ; [CPU_] |166|
SUBB SP,#2 ; [CPU_U]
LRETR ; [CPU_]
When the unused variables a, b, c, and d are given first in the structure passed to SiCoTest(), the 10 assembly code instructions below are generated by the compiler for SiCoTest(), which includes the additional instruction highlighted. However, with the ADDITIONAL instruction this code executes in 53 cycles, 8 cycles LESS than that above which executes in 61 cycles having one less instruction:
_SiCoTest:
ADDB SP,#2 ; [CPU_U]
MOVL *-SP[2],XAR4 ; [CPU_] |162|
;----------------------------------------------------------------------
; 166 | sico(1, &sc->sine, &sc->cosine); //sine,cosine first=61 a,b,c,d first=53
;----------------------------------------------------------------------
MOVL XAR5,*-SP[2] ; [CPU_] |166|
MOVL XAR4,*-SP[2] ; [CPU_] |166|
MOVIZ R0H,#16256 ; [CPU_] |166|
ADDB XAR5,#10 ; [CPU_] |166|
ADDB XAR4,#8 ; [CPU_] |166|
LCR #_sico ; [CPU_] |166|
SUBB SP,#2 ; [CPU_U]
LRETR ; [CPU_]
The additional instruction above is the result of the sine variable no longer being the first variable in the structure and, thus, an offset to sine must be added to XAR4. However, when sine is first, there is no need for the additional instruction to add an offset to XAR4 for the pointer to sine passed to sico() and originally sincos().
It is worthy to note that when executing sincos() the timing results were 192 cycles when sine and cosine were given first and 184 cycles when a, b, c, and d were given first, which is consistent with the odd results above of fewer cycles with more instructions.
Also, when sincos() and sico() were not executed and sine and cosine were simply set to 1 and 2 in SiCoTest(), 1 less cycle was consumed when the 1 less instruction was executed as expected.
Note that when prefetch was enabled, the results were as expected with 1 less (not 8 more) cycle when 1 less assembly instruction existed (i.e. when sine and cosine are located before a, b, c, d in the structure passed to SiCoTest()). Also, with prefetch enabled, there were approximately 100 cycles saved when executing sincos() which is consistent with the results reported at http://e2e.ti.com/support/microcontrollers/c2000/f/171/t/201078.aspx.
HOWEVER with prefetch enabled, adding other variables to a similar structure in which those added variables are used, also generated unexpected timing results. For example, additional cycles were consumed when used variables (residing in the first 8 words of the structure and can be accessed by a single instruction instead of two) are placed before the unused variables.
I would like to know how it is possible that code which contains an ADDITIONAL instruction can execute in 8 FEWER cycles. Note that these results are very consistent and repeatable, even after recycling power to the entire system and “cleaning” the code after code changes.
Is it possible that Timestamp_get32() is the root of the problem even though it is calibrated each time before the elapsed time is measured?
Thank you,
Tim Ball
TDB Consulting