This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CCS/TMS320F28335: the execution time of same function are obvious difference when call from different files

Part Number: TMS320F28335

Tool/software: Code Composer Studio

Hi,

I find that the execution time of same function are obvious difference when call from different files. Why? Is it alright?

Description,

Condition: CCS6.0; TMS320F28335/TMS320LF2407

I want to test the arithmetic speed of TMS320F28335, so i make a simple function to run. But There are different execution time when calling the same function between main.c and other file(non-main.c)

A) The function is defined in the main.c, and the execution time is 1.6us.

B) The function is defined in the sub_program.c, and the execution time is 3.6us.

It's the same stuation when I test the TMS320LF2407.

A)

#include "DSP2833x_Device.h"     // DSP2833x Headerfile Include File
#include "DSP2833x_Examples.h"   // DSP2833x Examples Include File
#include "DSP28335_ex.h"
#include "para_space.h"
#include "myinclude.h"

void sub_program3(void)
{

  temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
  temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
  temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
  temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
  temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
  temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
  temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
  temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
  temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
  temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
///////////////////////////////////////////////////////////////////////////////////
}

void main(void)
{
   InitSysCtrl();  //PLL, WatchDog,Clocks
   InitXintf16Gpio(); // configuer IO ,need revise
   DINT;  // Disable CPU interrupts
   InitPieCtrl();   // Disable PIE interrupts,clear all PIE interrupt flags:
   IER = 0x0000;
   IFR = 0x0000;  // Disable CPU interrupts and clear all CPU interrupt flags:
   InitPieVectTable();  //Initialize the PIE vector table,open PIE interrupt
   configtestled();    // configuer IO ,need revise
   LED1=1;
   //DELAY_time(10);
   LED1=1; LED2=1;  LED3=0;  LED4=0;
   MemCopy(&RamfuncsLoadStart, &RamfuncsLoadEnd, &RamfuncsRunStart);
   InitFlash();
   //DELAY_US(1)=10.5us

while(1)
 { //示波器测试IO周期,用于测试CPU速度
 //调用子程序sub_program();sub_program3();两个程序相同,但是放的位置不一样
// 同样的程序代码,为什么调用sub_program();与sub_program3()比较;速度慢了很多?
  LED3=~LED3;             //IO口电平
  sub_program();           //test  deley time  3.6us  子程序
  //sub_program3();        ////test deley time  1.6us


     }

}

B)

#include "DSP2833x_Device.h"     // DSP2833x Headerfile Include File
#include "DSP2833x_Examples.h"   // DSP2833x Examples Include File
#include "DSP28335_ex.h"
#include "para_define.h"
#include "myinclude.h"

///*
void sub_program()
{
  temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
  temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
  temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
  temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
  temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
  temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
  temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
  temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
  temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
  temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
  //////////////////////////////////////////////////////////////////////////////////////

///////////////////////////////////////////////////////////////////////////////////
 }

  • The way the compiler works is that when no optimization is applied it has no information about where the globals are when the function is in a different file, so it performs a DP load for each variable access. In contrast, when the function is in the main file, it performs one DP write, and everything else is a single instruction cycle, so the cycle count is much lower. I'm seeing 130 and 355 cycles respectively, which is a little lower than you seem to be getting.

    If you set optimization to -o2, the compiler will recognize the redundancy in the code and just perform one line. The counts will go down to about 22 and 26 respectively. There may be a few extra cycles for the function in a different file because the extra DP loads don't go away completely.

    The numbers you're getting are reasonable. I recommend applying level -o2 optimization at least.

    Regards,

    Richard
  • BTW, did you have floating-point enabled when you ran the test (float_support=fpu32)? I just assumed all the temp variables were long integers when I ran the function.

    When FPU support is not enabled, you can also try the --postlink_opt option to eliminate DP loads. If the FPU is enabled, it can only issue advice because of the unprotected pipe. The postlink_opt option is a linktime optimization option and comes after the linker switch (-z). This would allow you to get rid of those extra DP loads.

    Also, C2000 would perform much better if you were using an array instead of separate globals. Some of C2000’s main performance strengths are the ability to directly access memory operands in arithmetic instructions (as opposed to having to load the values into registers), and the auto-incremented address register operands. Using an array instead of temps should highlight some of this.

    Regards,

    Richard
  • dear Richard
    This is my topic.
    I didn't use Floating-point arithmetic. I just define variables such as :
    extern int32 temp1;
    extern int32 temp2;
    extern int32 temp3;
    extern int32 temp4;
    extern int32 temp5;
    int32 temp1=1;
    int32 temp2=1;
    int32 temp3=1;
    int32 temp4=1;
    int32 temp5=1;
  • I use DSP TMS320F2407 to test the same function .the test platform is CCS3.1

    void sub_program()
    {
    temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
    temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
    temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
    temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
    temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
    temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
    temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
    temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
    temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
    temp1=0;temp2=0;temp3=0;temp4=0;temp5=0;temp1=1;temp2=1;temp3=1;temp4=1;temp5=1;
    //////////////////////////////////////////////////////////////////////////////////////

    ///////////////////////////////////////////////////////////////////////////////////
    }
    DSP TMS320F2407 : When I called funtion from same file " main.c" ,I tested program execution time is 3.2 us.
    So there's a new problem. you know DSP TMS320F2407 basic frequency is 40M ,TMS320F28335 basic frequency is 150M,so when program execution time in TMS320F2407 is 3.2 us,then TMS320F28335 must be 0.85 us . but I test TMS320F28335 program execution time is 1.6 us !
    When I called the same funtion from another file " subprogram.c" ,used DSP TMS320F2407, I tested program execution time is 8.8 us.
    this is a Serious problem.because motor control algorithm need high-performance DSP. if carrier frequency define 10k , my algorithm time only 100us .
  • my platform of TMS320F28335 is CCS6.0
    I set optimization such as:--opt _info_-on 2 ,but no effect. I tested the deley time is 3.6us,when I called the same function from another file.you know the same function called from same file such as "main.c", the deley time is 1.6us
    so i set --opt_for_speed-mf 3,no effect too.
    I hope you run my set to test the progarm.
  • You will need to set the "--out_level" setting to influence how the C code is optimized.  In CCSv6, right-click on the project, then select "Properties", then go to "CCS Build -> C2000 Compiler -> Optimization".  See the attached screen-shot for the window.

    Regards,

    Richard

  • First of all thank you for your solution,but i got a different interface below .And i didn't see any difference here after i did what you told me,so i think maybe this is not the problem i get here.Do you have any other solutions for my problem?

  • OK, thanks.  I am using CCS v6.1.3, so apparently the settings have moved around since v6.0.  I don't have that version on my machine.

    Can you check under the other compiler categories to see if there's an Optimization level setting?  It should be: "--opt_level", or "-O".  If so, can you change it to -o2 and see if that makes a difference?

    Regards,

    Richard

  • Thanks verymuch!  Problem is solved. 

    IN CCS6.0,the  "--out_level" setting paths is "CCS Build -> C2000 Compiler ->Basic option"

    Thanks!