This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

performance measurements on TMS570 processor

Hi

I am using TMS 570 processor for cycles measurement of basic functions like FIR, FFT etc..

I  am using  MDK keil platform and using  the following code to calculate cycles. 

I am getting differant cycles(almost doubled) for same function when i run on cortex-R4( NXP FCC2) and cortex-R4(TMS 570).

Can you please where i am going wrong when calculation on TMS570. I am using following code for the cycles measurement.

 

/* initialisation */

 Enable_Performance_Monitor(0);
  Performance_Monitor_Start(0);  

  /* ----------------------------------------------------------------------   
  **  Initialize the timer calculation.   
  ** ------------------------------------------------------------------- */   
   
  startTime = Performance_Monitor_Read_CycleCount(0);
   
  /* ----------------------------------------------------------------------   
  ** Call the my_function      
  ** ------------------------------------------------------------------- */   
   
  my_function();   
   
  /* ----------------------------------------------------------------------   
  ** Calculate the execution time   
  ** ------------------------------------------------------------------- */   
   
  endTime = Performance_Monitor_Read_CycleCount(0);

  execTime = endTime - startTime;

where functions are defined as below

 export   Enable_Performance_Monitor 
Enable_Performance_Monitor
 MOV r0, #0
    MCR p15, #0, r0, c9, c12, #5  ; Write PMNXSEL Register = select performance monitor counter #0
 MOV r0, #0x01     ; select I$ miss
 MCR p15, #0, r0, c9, c13, #1  ; Write EVTSELx Register

 MOV r0, #1
 MCR p15, #0, r0, c9, c12, #5  ; Write PMNXSEL Register = select performance monitor counter #1
 MOV r0, #0x03        ; # D$ miss
 MCR p15, #0, r0, c9, c13, #1  ; Write EVTSELx Register

    MOV r0, #2
    MCR p15, #0, r0, c9, c12, #5  ; Write PMNXSEL Register = select performance monitor counter #2"
 MOV r0, #0x08     ; Instruction architecturally executed
 MCR p15, #0, r0, c9, c13, #1  ; Write EVTSELx Register

 MRC p15, #0, r0, c9, c12, #0  ; Read PMNC Register
 ORR r0, r0, #7     ; reset cycle + performance counters and enable
 MCR p15, #0, r0, c9, c12, #0  ; Write PMNC Register
    BX  LR
  
 export   Performance_Monitor_Start
Performance_Monitor_Start

 MRC p15, #0, r0, c9, c12, #1  ; Read CNTENS Register
 ORR r0, r0, #7     ; Enable counters 0,1,2
 ORR r0, r0, #0x80000000   ; Enable cycle counter
 MCR p15, #0, r0, c9, c12, #1  ; Write CNTENS Register
    BX LR;

    export   Performance_Monitor_Stop
Performance_Monitor_Stop
 
     MRC p15, #0, r0, c9, c12, #2  ; Read CNTENC Register
  ORR r0, r0, #7        ; Disable counters 0,1,2
  ORR r0, r0, #0x80000000  ; Disable cycle counter
    MCR p15, #0, r0, c9, c12, #2  ; Write CNTENC Register");
  BX  LR
 
    export   Performance_Monitor_Read_CycleCount
Performance_Monitor_Read_CycleCount
 
  MRC p15, #0, r0, c9, c13, #0  ; Read CCNT Register  
     BX LR     
   

    end

Thanks in advance ,

Kranti

 

 

 

 

  • Hi Kranti,

    Can you please check if you have the same clock setting s for two device? If one device runs faster than another one, you will get more cycle for this device.

    You code is good, but I like this order:

    1. enable_performance_monitor(0)

    2. performance_monitor_start(0)

    3.  your benchmark test "MY_Function()"

    4. performance_monitor_stop(0)

    5. performance_monitor_read_cyclecount(0)

    6. calculate the time interval using #5 and your device clock

     

    Regards,

    QJ

  • Hi,

     

    I also created the project on the code composer and tried to use two methods to calculate the actual cycles. One is above method, i.e Performance monitor and other one is RTI  counter. 

     

    When I caculate the cycles with performance monitor I get around 6072 cycles but with the RTI I get 771 cycles. I don't think both are correct. I believe VCLK is running at same clock of HCLK i.e core clock, it is at 160MHz. If I provide the VCLK devider as 1 i.e then I get the cycles half i.e 330 around. 

     

    Can you please have a look at the code for both menthods and let me know why this is so different? Even i enable the optimization I did not see any chnage in cycles, can you please cross check  this as well. I am using the CCS V4 .

     

    Please import the project and build and run on MCB570. Now look at the file arm_fir_Q31_test.c file and you can see by default we are running Performance monitor. Once you run we see execTime as 6072. If you want to run the RTI please use below code. 

    Thanks much for your help in resolving this issue. 

     

    Best regards,

    Kranti.

    7444.MCBTMS570_Test.zip

  • Hi Kanti,

    I will try your code tomorrow.

    Regards,

    QJ

  • Hi Kanti,

    I run both PMU (performance monitor unit) test and RTI test for your benchmark code. The test results are consistent.  I changed VCLK setting:

        SYSTEM_1->VCLKR  = 1U;                    //VCLK=160MHz/2

     The settings are: CPU clock (PLL): 160MHz

                                   RTI clock: VCLK, 80MHz

                                   RTI Free Running Freq: RTI Clock/(7+1) = 10MHz

    The test results (average):

    • 1. PMU: 6046 clock cycles = 6046/160 = 37.79 uS
    • 2. RTI: 378 clock cycles = 378/10 = 37.8 uS

     Regards,

    QJ

  • Hi Karl,

    I am not sure of your question, do you mean whether the memory is configured to some wait states ? or code is running from internal memory  or flash memory?  Please let me know more details. 

    Do we need to configure memory is some state so that we will get best performance? 

    I am placing all the code and data in internal memory of the TMS570. Here is the linker file. I also place code and data in internal memeory of the device that I am comparing. 

    You can look at my CCS project on TMS570 attached in this thread 7444.MCBTMS570_Test.zip

    I also run code from flash and data from internal memory but there are no bigger difference in cycles. almost same. 

    4657.sys_link.zip

     

    Best regards,

    Kranti. 

  • Kranti,

    I tried them with different CPU clock and w/o optimization.

    160MHz CPU: 1407cycles, and 5266cycles(loop unroll)

    96MHz CPU: 1538cycles, and 5266cycles(loop unroll)

    There is no difference for optimization. Your code is pretty neat, so I don't thik the the high-level optimization in CCS will improve the performance much. For FIR or FFT, I am pretty sure the assembly code (SIMD instruction: single instruction mutliple data) will give you very big improvement. here is a good link for implementing a FIR filter in Cortex-R4.

    http://www.eetimes.com/design/signal-processing-dsp/4017562/Using-the-ARM-Cortex-R4-for-DSP-part-2-Software-optimization?pageNumber=1

    What are the cycles you got for another device (M4?)?

    Regards,

    QJ

  • Hi QJ,

    Are the mentioned cycles (160MHz CPU: 1407cycles, and 5266cycles(loop unroll) 96MHz CPU: 1538cycles, and 5266cycles(loop unroll)   ) for Adder function or FIR Function?

    Best Regards,

    Kranti

     

     

  • Hi QJ,

    Theoritically speaking Cycles for unroll ADD function should be less compared to ADD function without unrolling( As the number of comparisons for loop count will be less in unroll function) 

    I wonder what might be the reason for increasing cycles drastically from  1407( No unrolling)  to 5266(used unrolling).

    Best Regards,

    Kranti

     

     

  • Hi Kanti,

    I want correct my test. The file level optimization improves the performance very much.

    1. Program level optimization doesn't improve much.

    2. File level optimization: I optimized arm_add_q31.c and are_add_unrioll_q31.c. here is the comparison:

                                                                         Add_q31                         add_unroll_q31

    No file opt:                                                  6988                                         5266

    Opt_level 0 + opt_for_speed 0               1849                                           1401

    Opt_level 3+ opt_for_speed 3               1407                                         1353

    Opt_level 5+ opt_for_speed 5              1407                                       1353

    To use file opt, select file first, then click property, then basic options.

    But last test, I did file opt for add_q31, but forgot to do file opt for add_unroll_q31.

    Regards,

    QJ

     

  • Hi QJ,

    Thank you very much for the information.

    Do we always set File level settings for getting better optimisation? But i think it should take from Project options? 

    Coming to the case of FIR when i do simple FIR implementation i am getting around 6000 cycles but when I take unrolled code i am getting around 20000 cycles on TMS 570 using CCS.

    But when i run the same source code on another variant of Cortex-R4 i get 5600 cycles.

    I have also attached the assembly file generated by same source file(arm_fir_q31.c)  built on RVDS, where i get 5600 cycles when running TMS 570 using MDK.

    Do i need to change any project options when running on Code composer studio?

    Is there any cache enabling when running on TMS 570 using Code composer studio.

    Can we generate *.asm file when we build corresponding *.c file using CCS?

    4762.arm_fir_q31_5600.zip

     

    Best Regards,

    Kranti

     

     

     

     

     

  • Hi Kranti,

    Please refer to TMS470R1x Optimizing C/C++ Compiler User's Guide (SPNU151) for optimization options. You can do the optumization to any files or to the whole project. Becaue of the code re-arrangement and allocation of variables to registers during the optimization, it is very hard to debug the project. I only optimized your two benchmark files, and keep all other files AS IS, so I can monitor the execution flow and check several variables mamually.

    CCS can generate *.asm files. Under Assembler, please check "keep the generated assembly languages files". Those *.asm files are located in debug folder.

    Regards,

    QJ

  • Hi QJ,

    Thank you very much for the support.

    when i try to build a project(example project attached) , i am unable to build the project arm_fir_q15_test because  unable to link SIMD Intrinsics __smlald etc..

    Please find the attachment for the example project. 

    Can you please let us know if I miss any project options.

    Best Regards,

    Kranti

     

    1680.arm_fir_q15_test.zip

     

     

  • Hi Kranti,

     

    The functions you called don't exist in your project. Please add those lines in your arm_fir_q15.c to use CCS4 R4 compiler intrinsics. 

    #define __smlald _smlald

    #define __ssat   _ssat16

    #define __pkhbt _pkhbt

     

    Regards,

    QJ

  • Hi QJ,

    Thank you very much for the clarification.

    But ssat  is differant from ssat16, Can you please refer me to complete list of intrinsics(Document which lists all the intrinsics)

    Best Regards,

    Kranti

     

     

     

  • Hi QJ,

    Following are list of functions which are unable to link when i try to build a project on TMS570 using CCS.

    __clz(), __qadd(), __qsub(), __ssat()

    Best Regards,

    Kranti

     

     

  • Hi QJ,

    When i refered (TMS470R1x Optimizing C/C++ Compiler 4.1.x Beta Doc), i dont find the intrinsic for qadd().

    I have the following inline assembly function on GCC Compiler.

    static INLINE uint32_t  _qadd(uint32_t op1, uint32_t op2)

    {

      uint32_t result=0;

      __ASM volatile ("qadd %0, %1, %2" : "=r" (result) : "r" (op2), "r" (op1) );

      return(result);

    }

    I got build error when i use same code on TI Compiler.

    Can i have equivalent inline assembly function for the above _qadd() function on TMS570 using  TI Compiler . 

    Best Regards

    Kranti

  • Hi 

     

    I am getting following build error when i build the project in release mode.


     

    line 58: (of arm_add_q15.c)

                   INTERNAL ERROR: no match for ICALL

    This may be a serious problem.  Please contact customer support with a

    description of this problem and a sample of the source files that caused this

    INTERNAL ERROR message to appear.


    Cannot continue compilation - ABORTING!

     

    Please find the attachment for the example project.8053.lib_test.zip

    Can you Please help us what might be the problem.

    Thanks in advance

    Kranti

  • Hi Kranti,

    ssat is for 32bit, and ssat16 is for 16bit. Please use the R4 intrinsics in the enclosed UG.

    http://www.ti.com/lit/ug/spnu151g/spnu151g.pdf

    Regards,

    QJ

     

     

     

  • Hi Kranti,

    Please update CCS4 Code Generation Tool (CGT) to 4.91. It will solve your problem. CGT 4.6 doesn't have the intrinsics used in your project.

    You can update CGT using "Software Updates" under CCS "Help".

    Regards,

    QJ

     

     

  • Hi QJ,

    Thanks for your help. I am able to use CGT 4.91 and able to build the functions in the following way

    I used _norm() in place of  __clz()

    sadd() inplace of  __qadd(),

    ssub() inplace of __qsub(),

    and used _ssata() for __ssat().

    Please let me know if i miss anything.

    Following are list of functions which i am still unable to link when i try to build the project on TMS570 using CCS.

    __SHASX()

    __SHSAX()

    Can you please suggest us.

    Best Regards,

    Kranti

     

  • Hi Kranti,

    CGT4.91 doesn't have SHASX and SHSAX intrinsics. You can use sasx and ssax instructions, or use _saddsux intrinsic.

        The GCC, and IAR compilers have __shsax intrinsic. The usage is:

                         res = __shsax(val1,val2);                     /* res[15:0] = (val1[15:0] + val2[31:16]) >> 1;        res[31:16] = (val1[31:16] - val2[15:0]) >> 1      */

        CCS4 CGT4.91 doesn't _shsax intrinsic, but you can use _saddsubx.

                        res = _saddsubx(val2, val1);                /* res[15:0] = (val1[15:0] + val2[31:16]) ;        res[31:16] = (val1[31:16] - val2[15:0])     */

     

    I am not sure this is ok for your performance requirement.

     

    Regards,

    QJ

  • Hi Kranti,

    Did you try _saddsubs() for _shsax() in your application? Does this meet your performance requirement?

    Regards,

    QJ

  • Hi QJ,

    Sorry for the late reply. I tried _shasx() for _shaaddsubx  in our application. It is building without any warnings.

    In another application, I am getting build error when i use _sqrtf() on TMS570 using code composer studio.

    Is the  __builtin_sqrtf() function is the equivalent function for _sqrtf()?

    I wonder is there any include file, i need include if i want to use __builtin_sqrtf().

    Best Regards,

    Kranti

     

     

     

     

  • Hi Kranti,

     __builtin_sqrtf() is GCC extensions. You can use it by compiling the project with  --gcc option. 

    Regards,

    QJ

     

  • Hi QJ,

     

    Thank you for your response.  

    Do we have any other function equivalent to __sqrtf() instead of  __builtin_sqrtf()?

    If that is the only function we have, where should we the -gcc option for compiling the project.

    Thanks in advance,

    Kranti


  • Hey Kranti,

    Hopefully you found the proper option since posting this a month and a half ago, but for anyone else who stumbles across this thread in the future, the -gcc option for compilation in Code Composer Studio can be found in the project properties.  Open the properties and expand CCS Build, then TMS470 Compiler, then click on the Language Options entry.  About fifteen items down is a checkbox labeled "Enable support for GCC extensions (--gcc)."  If you're using CCSv5, it might be necessary to click the Show Advanced Settings button in the bottom right.

    -Jordan

  • An update ... we have received the R4 dsplib from ARM and are in the process of evaluating it and to make it easy to use and understand the benefits.  This effort will require a few more weeks of work.  Look for a public posting of this library in the March 2012 timeframe.