This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Reducing Assembly Instructions Increases Cycles

Other Parts Discussed in Thread: SYSBIOS

Hello TI E2E Community:

While optimizing time-critical functions for speed by ordering structure variables, I came across a situation in which a reduction in function instructions can evidently increase the number of cycles it takes the function to execute. Cycles are measured using Timestamp_get32(), which is calibrated before each elapsed time measurement, and “Project/Clean…” is used after code changes with optimization off and prefetch disabled to ensure consistent results. Note that the difference in measurements observed are not the result of the compiler utilizing parallel instructions during delay cycles. In the example below, the function call with it’s 9 instructions consume 61 cycles, but the function call with it’s 10 instructions consume only 53 cycles, which is 8 cycles less for the additional instruction.

 

The situation appeared when timing a function which accepts a pointer to the structure below which contains the sine and cosine variables, which are used for the sincos() function, and the unused variables a, b, c, and d, which consume 8 words.

//Structure with sine, cosine and unused variables

 typedef struct

{

   float sine, cosine;

   float a, b, c, d;

}SICOSTRUCT;

 

Since the situation occurred only when sincos() was called, an assembly function identical to sincos(), named sico(), was created in an effort to locate at which line(s) the additional cycles were consumed. It was observed that when the sico() function was simply returned from, the issue remained. The assembly code for sico() follows:

_sico

 LRETR

 

The test function SiCoTest() is called and timed in main.c as highlighted below. Note that the Timestamp_get32() function is calibrated each time the SiCoTest() elapsed time is measured:

SICOSTRUCT sc;

etcount = etsum = 0;

outfile = fopen("c:\\TDB\\SiCoTest.csv", "w");

for (test=0; test<1000; ++test)

{

   cal = Timestamp_get32();

   cal = Timestamp_get32() - cal;

   et = Timestamp_get32();

   SiCoTest(&sc);

   et = Timestamp_get32() - et - cal;

   fprintf(outfile, "Sine, %f, Cosine, %f, ET, %ld\n", sc.sine,  sc.cosine, et);

   etsum += et;

   ++etcount;

}

fprintf(outfile, "Sine, %f, Cosine, %f, ET, %ld, Count: %ld, Sum: %ld, Mean: %ld\n",

        sc.sine, sc.cosine, et, etcount, etsum, etsum/etcount);

fclose(outfile);

 

The assembly code generated by the compiler for measuring and calling the SiCoTest() function is given below:

;----------------------------------------------------------------------

; 203 | et = Timestamp_get32();                                               

;----------------------------------------------------------------------

        LCR       #_xdc_runtime_Timestamp_get32__E ; [CPU_] |203|

        MOVW      DP,#_et               ; [CPU_U]

        MOVL      @_et,ACC              ; [CPU_] |203|

;----------------------------------------------------------------------

; 204 | SiCoTest(&sc);                                                        

;----------------------------------------------------------------------

        MOVL      XAR4,#_sc             ; [CPU_U] |204|

        LCR       #_SiCoTest            ; [CPU_] |204|

;----------------------------------------------------------------------

; 205 | et = Timestamp_get32() - et - cal;                                    

;----------------------------------------------------------------------

        LCR       #_xdc_runtime_Timestamp_get32__E ; [CPU_] |205|

 

When the sine and cosine variables are given first in the structure passed to SiCoTest(), the 9 assembly code instructions below are generated by the compiler for SiCoTest() and takes 61 cycles to execute:

_SiCoTest:

   ADDB      SP,#2                 ; [CPU_U]

   MOVL      *-SP[2],XAR4          ; [CPU_] |162|

;----------------------------------------------------------------------

; 166 | sico(1, &sc->sine, &sc->cosine);   //sine,cosine first=61  a,b,c,d first=53                                                   

;----------------------------------------------------------------------

   MOVL      XAR5,*-SP[2]          ; [CPU_] |166|

   MOVL      XAR4,*-SP[2]          ; [CPU_] |166|

   MOVIZ     R0H,#16256            ; [CPU_] |166|

   ADDB      XAR5,#2               ; [CPU_] |166|

   LCR       #_sico                ; [CPU_] |166|

   SUBB      SP,#2                 ; [CPU_U]

   LRETR     ; [CPU_]

 

When the unused variables a, b, c, and d are given first in the structure passed to SiCoTest(), the 10 assembly code instructions below are generated by the compiler for SiCoTest(), which includes the additional instruction highlighted. However, with the ADDITIONAL instruction this code executes in 53 cycles, 8 cycles LESS than that above which executes in 61 cycles having one less instruction:

_SiCoTest:

   ADDB      SP,#2                 ; [CPU_U]

   MOVL      *-SP[2],XAR4          ; [CPU_] |162|

;----------------------------------------------------------------------

; 166 | sico(1, &sc->sine, &sc->cosine);   //sine,cosine first=61  a,b,c,d first=53

;----------------------------------------------------------------------

   MOVL      XAR5,*-SP[2]          ; [CPU_] |166|

   MOVL      XAR4,*-SP[2]          ; [CPU_] |166|

   MOVIZ     R0H,#16256            ; [CPU_] |166|

   ADDB      XAR5,#10              ; [CPU_] |166|

   ADDB      XAR4,#8               ; [CPU_] |166|

   LCR       #_sico                ; [CPU_] |166|

   SUBB      SP,#2                 ; [CPU_U]

   LRETR     ; [CPU_]

 

The additional instruction above is the result of the sine variable no longer being the first variable in the structure and, thus, an offset to sine must be added to XAR4. However, when sine is first, there is no need for the additional instruction to add an offset to XAR4 for the pointer to sine passed to sico() and originally sincos().

 

It is worthy to note that when executing sincos() the timing results were 192 cycles when sine and cosine were given first and 184 cycles when a, b, c, and d were given first, which is consistent with the odd results above of fewer cycles with more instructions.

Also, when sincos() and sico() were not executed and sine and cosine were simply set to 1 and 2 in SiCoTest(),  1 less cycle was consumed when the 1 less instruction was executed as expected.

 

Note that when prefetch was enabled, the results were as expected with 1 less (not 8 more) cycle when 1 less assembly instruction existed (i.e. when sine and cosine are located before a, b, c, d in the structure passed to SiCoTest()). Also, with prefetch enabled, there were approximately 100 cycles saved when executing sincos() which is consistent with the results reported at http://e2e.ti.com/support/microcontrollers/c2000/f/171/t/201078.aspx.

 

HOWEVER with prefetch enabled, adding other variables to a similar structure in which those added variables are used, also generated unexpected timing results. For example, additional cycles were consumed when used variables (residing in the first 8 words of the structure and can be accessed by a single instruction instead of two) are placed before the unused variables.

 

I would like to know how it is possible that code which contains an ADDITIONAL instruction can execute in 8 FEWER cycles. Note that these results are very consistent and repeatable, even after recycling power to the entire system and “cleaning” the code after code changes.

 

Is it possible that Timestamp_get32() is the root of the problem even though it is calibrated each time before the elapsed time is measured?

 

Thank you,

Tim Ball

TDB Consulting

  • Hi Tim,

    You mentioned turning on prefetch - so I assume the code is running from flash.  Here is what I think is happening.  I believe you are seeing the effect of the opcode alignment in flash.   The flash "word" consists of multiple C28x words.  For example on 2833x or 2802x, or 2806x the "flash word" is 4 C28x words. 

    Consider the case of an opcode that consists of 32-bits happens to straddle a "flash word" boundary.  If this happens, and prefetch is not enabled, then it will take two fetches to get the opcode - one to get the low half of the opcode and one to get the other half..  Each of these will see the impact of the weightstates.  If the opcode does not straddle a "flash word" boundary then only one fetch is required to get the opcode.

    So my theory is perhaps the new instruction is changing the alignment of instructions within flash and thus the cycles change.  I'm actually wondering why the delta isn't a multiple of 15 (default waitstates). 

    -Lori

  • Hello Lori:

     

    Thank you for your swift response.

    It appears that your theory about the new instruction changing the alignment of instructions within flash may be correct based on the information below. However, also notice the cycles that result when prefetch is enabled, then after optimization is also turned on, then after optimization for speed is also selected. In all cases the number of cycles increase and decrease as instructions are added.

     

    The same tests as those in the original comment above were performed but by calling sincos() instead of sico() in SiCoTest(). Inline assembly assembly NOPs (i.e. asm(" NOP");) were inserted in the 'C' function SiCoTest() just before the sincos()call, such as that below, to examine the affects of instruction alignments with prefetch disabled and enabled and optimization off and on:

    void SiCoTest(SICOSTRUCT *sc)

    {

       asm(" NOP");

       sincos(1, &sc->sine, &sc->cosine);

    }

     

    In the table below, the left column indicates the number of inline assembly NOPs that were inserted in the 'C' function SiCoTest() just before sincos() is called. The numbers in the columns to the right indicate the cycles that resulted when the unused structure variables sine,cosine (sin,cos) were given first in the structure and when a,b,c,d were given first in the structure. The abbreviations below those headings are “NP” for No Prefetch, “PF” for PreFetch, “P3” for Prefetch with level 3 optimization (no speed vs. size selected), and “P3 Speed 5” for Prefetch with level 3 optimization with Speed setting 5 selected. The numbers in those columns indicate the number of cycles it took SiCoTest() to execute for the number of NOPs with the given setting. For example, in the fourth column, when sine,cosine were given first in the structure with PreFetch enabled and 0 NOPs, it took SiCoTest() 82 cycles to execute. However, when one NOP was included, it took SiCoTest() only 77 cycles to execute. That is, it took 5 cycles less (i.e. 82-77=5) to execute the additional instruction NOP.

     

    Num.   sin,cos    a,b,c,d    sin,cos   a,b,c,d   sin,cos   a,b,c,d   sin,cos

    NOPs      NP         NP         PF        PF        P3        P3     P3 Speed 5

     0        184        196        82        82        56        60        56

     1        188        200        77        78        76        76        80

     2        200        192        77        78        80        80        76

     3        192        204        77        78        81        85        80

     4        204        196        78        82        82        85        72

     5        196        208        82        86        85        81        77

     6        208        200        86        86        81        85        74

     7        200        212        86        86        85        86        79

     8        212        204        86        83        78        82        80

     

    Notice that the number of cycles increase and decrease as NOPs are inserted for all columns, even with Prefetch enabled with level 3 optimization Speed 5 selected. This leads one to wonder if the TI compiler could produce even faster executing code by inserting NOPs, as needed to increase speed, in the assembly code it generates for C code. For example, in the far right column when there were 3 NOPs inserted before calling sincos(), SiCoTest() executed in 80 cycles. However, when one additional NOP was ADDED, SiCoTest() executed in just 72 cycles, 8 cycles less with an added NOP.

     

    You might have noticed in the original comment that when executing sincos(), the timing results were 192 cycles when sine and cosine were given first and 184 cycles when a, b, c, and d were given first. However, for some unknown reason, the results were later (in this comment) consistently 184 cycles with sine,cosine first and 196 cycles with a,b,c,d first. These results were confirmed several times for accuracy.

    Therefore, I would also like to know why the same code compiled at a later date can, evidently randomly, produce timing results that are about 8 to 12 cycles different (e.g. 192-184=8 and 196-184=12) than that compiled previously. Note too that I also tried using the same .metadata files for compiling but there are still the slight differences in timing results.

     

    I am also curious why after compiling code about 25 times with Code Composer Studio (CCS), the program crashes and terminates while compiling code. After restarting CCS and loading the debugger, an error occurs during the configuration phase and CCS must be shut down again and restarted. After restarting CCS for the second time, and connecting with the debugger, the projects for the processor cores must be grouped again.

     

    Thank you for your time and efforts,

    Tim

  • Tim,

    Timothy Ball said:
    I am also curious why after compiling code about 25 times with Code Composer Studio (CCS), the program crashes and terminates while compiling code. After restarting CCS and loading the debugger, an error occurs during the configuration phase and CCS must be shut down again and restarted. After restarting CCS for the second time, and connecting with the debugger, the projects for the processor cores must be grouped again.

    For the CCS issue I'm going to have to refer you to the CCS forum (http://e2e.ti.com/support/development_tools/code_composer_studio/f/81/t/3131)

    They can hopefully help to track down the issue.  I know that is very frustrating and can feel your pain.  

    For the cycle question - you have posted a lot of information to digest and I'm afraid it will take me a while to get through it all to comment intelligently. 

    Have you run a similar experiment in RAM?  This would remove the dependency on the flash prefetch, wordsize and waitstates.

    All three of these will come into play in flash performance.  For example if code is mostly 32-bit opcodes then the prefetch buffer will fill faster and you may see additional cycles.  As mentioned earlier alignment can also change the cycles slightly.  On our newer devices the word size and prefetch size has been increased to try to combat this. 

    Timothy Ball said:
    Therefore, I would also like to know why the same code compiled at a later date can, evidently randomly, produce timing results that are about 8 to 12 cycles different (e.g. 192-184=8 and 196-184=12) than that compiled previously. Note too that I also tried using the same .metadata files for compiling but there are still the slight differences in timing results.

    This has me confused.  Is it the exact same code, same memory alignment, same compiler version, same compiler options, same flash configuration etc...?

  • Hello Lori:

    Yes, it is the exact same code with all settings the same.

    Is it possible that the different counts are related to the Timestamp_get32 function between compilings?

    The disassembly code calls for a single et = Timestamp_get32() call follows:

    761F02BF    MOVW         DP, #0x2bf

    1E0A        MOVL         @0xa, ACC

    1030        et = Timestamp_get32();

    765116FB    LCR          xdc_runtime_Timestamp_get32__F

    761F02BF    MOVW         DP, #0x2bf

    1E0C        MOVL         @0xc, ACC

           xdc_runtime_Timestamp_get32__F:

    765116F8    LCR          xdc_runtime_Timestamp_SupportProxy_get32__E

    0006        LRETR    

       

           xdc_runtime_Timestamp_SupportProxy_get32__E:

    765115F1    LCR          ti_sysbios_family_c28_f28m35x_TimestampProvider_get32__E

    0006        LRETR        

           ti_sysbios_family_c28_f28m35x_TimestampProvider_get32__E:

    8F004E0C    MOVL         XAR4, #0x004e0c

    06C4        MOVL         ACC, *+XAR4[0]

    62      }

    0006        LRETR        

  • Timothy Ball said:

    Yes, it is the exact same code with all settings the same.

    Is it possible that the different counts are related to the Timestamp_get32 function between compilings?

    Tim,

    I don't think it should be.  The code for the function Timestamp_get32 should be the same in both cases as well. 

    The compile is on the same machine / CCS setup as well?  

    -Lori

  • Timothy Ball said:
    it is the exact same code with all settings the same.

    So you have two builds that are the same in every respect.  But they perform differently.  Is that right?  I recommend you check on whether those two builds match by using the objdiff utility in the cg_xml package.

    Thanks and regards,

    -George

  • Hello George:
    Yes, the same code and settings in all respects.
    I will compare the .out files generated as you suggest and then post a working example.
    Thank you,
    Tim
  • As a background task during the last few weeks, code has been occasionally recompiled and executed in an effort to demonstrate that the number of cycles measured can occasionally randomly change by just recompiling the code as described above on 12/3/2014. Other engineers have reported this same issue when compiling code that is identical with identical settings.
    Unfortunately, the issue has not repeated itself yet. When cycles do change after a recompile, I will be sure to provide an update and where in the .out files created that differences exist.