performance measurements on TMS570 processor

kranti kumar

I am using TMS 570 processor for cycles measurement of basic functions like FIR, FFT etc..

I am using MDK keil platform and using the following code to calculate cycles.

I am getting differant cycles(almost doubled) for same function when i run on cortex-R4( NXP FCC2) and cortex-R4(TMS 570).

Can you please where i am going wrong when calculation on TMS570. I am using following code for the cycles measurement.

/* initialisation */

Enable_Performance_Monitor(0);
Performance_Monitor_Start(0);

/* ----------------------------------------------------------------------
** Initialize the timer calculation.
** ------------------------------------------------------------------- */

startTime = Performance_Monitor_Read_CycleCount(0);

/* ----------------------------------------------------------------------
** Call the my_function
** ------------------------------------------------------------------- */

my_function();

/* ----------------------------------------------------------------------
** Calculate the execution time
** ------------------------------------------------------------------- */

endTime = Performance_Monitor_Read_CycleCount(0);

execTime = endTime - startTime;

where functions are defined as below

export   Enable_Performance_Monitor
Enable_Performance_Monitor
MOV r0, #0
    MCR p15, #0, r0, c9, c12, #5 ; Write PMNXSEL Register = select performance monitor counter #0
MOV r0, #0x01     ; select I$ miss
MCR p15, #0, r0, c9, c13, #1 ; Write EVTSELx Register

MOV r0, #1
MCR p15, #0, r0, c9, c12, #5 ; Write PMNXSEL Register = select performance monitor counter #1
MOV r0, #0x03 ; # D$ miss
MCR p15, #0, r0, c9, c13, #1 ; Write EVTSELx Register

    MOV r0, #2
    MCR p15, #0, r0, c9, c12, #5 ; Write PMNXSEL Register = select performance monitor counter #2"
MOV r0, #0x08     ; Instruction architecturally executed
MCR p15, #0, r0, c9, c13, #1 ; Write EVTSELx Register

MRC p15, #0, r0, c9, c12, #0 ; Read PMNC Register
ORR r0, r0, #7     ; reset cycle + performance counters and enable
MCR p15, #0, r0, c9, c12, #0 ; Write PMNC Register
    BX LR

export   Performance_Monitor_Start
Performance_Monitor_Start

MRC p15, #0, r0, c9, c12, #1 ; Read CNTENS Register
ORR r0, r0, #7     ; Enable counters 0,1,2
ORR r0, r0, #0x80000000   ; Enable cycle counter
MCR p15, #0, r0, c9, c12, #1 ; Write CNTENS Register
    BX LR;

    export   Performance_Monitor_Stop
Performance_Monitor_Stop

     MRC p15, #0, r0, c9, c12, #2 ; Read CNTENC Register
ORR r0, r0, #7       ; Disable counters 0,1,2
ORR r0, r0, #0x80000000  ; Disable cycle counter
MCR p15, #0, r0, c9, c12, #2 ; Write CNTENC Register");
BX LR

    export   Performance_Monitor_Read_CycleCount
Performance_Monitor_Read_CycleCount

MRC p15, #0, r0, c9, c13, #0 ; Read CCNT Register
     BX LR

end

Thanks in advance ,

Kranti

over 14 years ago

0 QJ Wang over 14 years ago

TI__Guru**** 197256 points

Hi Kranti,

Can you please check if you have the same clock setting s for two device? If one device runs faster than another one, you will get more cycle for this device.

You code is good, but I like this order:

1. enable_performance_monitor(0)

2. performance_monitor_start(0)

3. your benchmark test "MY_Function()"

4. performance_monitor_stop(0)

5. performance_monitor_read_cyclecount(0)

6. calculate the time interval using #5 and your device clock

Regards,

0 kranti kumar over 14 years ago in reply to QJ Wang

Prodigy 150 points

Hi,

I also created the project on the code composer and tried to use two methods to calculate the actual cycles. One is above method, i.e Performance monitor and other one is RTI counter.

When I caculate the cycles with performance monitor I get around 6072 cycles but with the RTI I get 771 cycles. I don't think both are correct. I believe VCLK is running at same clock of HCLK i.e core clock, it is at 160MHz. If I provide the VCLK devider as 1 i.e then I get the cycles half i.e 330 around.

Can you please have a look at the code for both menthods and let me know why this is so different? Even i enable the optimization I did not see any chnage in cycles, can you please cross check this as well. I am using the CCS V4 .

Please import the project and build and run on MCB570. Now look at the file arm_fir_Q31_test.c file and you can see by default we are running Performance monitor. Once you run we see execTime as 6072. If you want to run the RTI please use below code.

Thanks much for your help in resolving this issue.

Best regards,

Kranti.

7444.MCBTMS570_Test.zip

0 QJ Wang over 14 years ago in reply to kranti kumar

TI__Guru**** 197256 points

Hi Kanti,

I will try your code tomorrow.

Regards,

0 QJ Wang over 14 years ago in reply to QJ Wang

TI__Guru**** 197256 points

Hi Kanti,

I run both PMU (performance monitor unit) test and RTI test for your benchmark code. The test results are consistent. I changed VCLK setting:

SYSTEM_1->VCLKR = 1U; //VCLK=160MHz/2

The settings are: CPU clock (PLL): 160MHz

RTI clock: VCLK, 80MHz

RTI Free Running Freq: RTI Clock/(7+1) = 10MHz

The test results (average):

1. PMU: 6046 clock cycles = 6046/160 = 37.79 uS
2. RTI: 378 clock cycles = 378/10 = 37.8 uS

Regards,

0 kranti kumar over 14 years ago

Prodigy 150 points

Hi Karl,

I am not sure of your question, do you mean whether the memory is configured to some wait states ? or code is running from internal memory or flash memory? Please let me know more details.

Do we need to configure memory is some state so that we will get best performance?

I am placing all the code and data in internal memory of the TMS570. Here is the linker file. I also place code and data in internal memeory of the device that I am comparing.

You can look at my CCS project on TMS570 attached in this thread 7444.MCBTMS570_Test.zip.

I also run code from flash and data from internal memory but there are no bigger difference in cycles. almost same.

4657.sys_link.zip

Best regards,

Kranti.

0 QJ Wang over 14 years ago in reply to kranti kumar

TI__Guru**** 197256 points

Kranti,

I tried them with different CPU clock and w/o optimization.

160MHz CPU: 1407cycles, and 5266cycles(loop unroll)

96MHz CPU: 1538cycles, and 5266cycles(loop unroll)

There is no difference for optimization. Your code is pretty neat, so I don't thik the the high-level optimization in CCS will improve the performance much. For FIR or FFT, I am pretty sure the assembly code (SIMD instruction: single instruction mutliple data) will give you very big improvement. here is a good link for implementing a FIR filter in Cortex-R4.

http://www.eetimes.com/design/signal-processing-dsp/4017562/Using-the-ARM-Cortex-R4-for-DSP-part-2-Software-optimization?pageNumber=1

What are the cycles you got for another device (M4?)?

Regards,

0 kranti kumar over 14 years ago in reply to QJ Wang

Prodigy 150 points

Hi QJ,

Are the mentioned cycles (160MHz CPU: 1407cycles, and 5266cycles(loop unroll) 96MHz CPU: 1538cycles, and 5266cycles(loop unroll) ) for Adder function or FIR Function?

Best Regards,

Kranti

0 kranti kumar over 14 years ago in reply to kranti kumar

Prodigy 150 points

Hi QJ,

Theoritically speaking Cycles for unroll ADD function should be less compared to ADD function without unrolling( As the number of comparisons for loop count will be less in unroll function)

I wonder what might be the reason for increasing cycles drastically from 1407( No unrolling) to 5266(used unrolling).

Best Regards,

Kranti

0 QJ Wang over 14 years ago in reply to kranti kumar

TI__Guru**** 197256 points

Hi Kanti,

I want correct my test. The file level optimization improves the performance very much.

1. Program level optimization doesn't improve much.

2. File level optimization: I optimized arm_add_q31.c and are_add_unrioll_q31.c. here is the comparison:

Add_q31 add_unroll_q31

No file opt: 6988 5266

Opt_level 0 + opt_for_speed 0 1849 1401

Opt_level 3+ opt_for_speed 3 1407 1353

Opt_level 5+ opt_for_speed 5 1407 1353

To use file opt, select file first, then click property, then basic options.

But last test, I did file opt for add_q31, but forgot to do file opt for add_unroll_q31.

Regards,

0 kranti kumar over 14 years ago in reply to QJ Wang

Prodigy 150 points

Hi QJ,

Thank you very much for the information.

Do we always set File level settings for getting better optimisation? But i think it should take from Project options?

Coming to the case of FIR when i do simple FIR implementation i am getting around 6000 cycles but when I take unrolled code i am getting around 20000 cycles on TMS 570 using CCS.

But when i run the same source code on another variant of Cortex-R4 i get 5600 cycles.

I have also attached the assembly file generated by same source file(arm_fir_q31.c) built on RVDS, where i get 5600 cycles when running TMS 570 using MDK.

Do i need to change any project options when running on Code composer studio?

Is there any cache enabling when running on TMS 570 using Code composer studio.

Can we generate *.asm file when we build corresponding *.c file using CCS?

4762.arm_fir_q31_5600.zip

Best Regards,

Kranti

0 QJ Wang over 14 years ago in reply to kranti kumar

TI__Guru**** 197256 points

Hi Kranti,

Please refer to TMS470R1x Optimizing C/C++ Compiler User's Guide (SPNU151) for optimization options. You can do the optumization to any files or to the whole project. Becaue of the code re-arrangement and allocation of variables to registers during the optimization, it is very hard to debug the project. I only optimized your two benchmark files, and keep all other files AS IS, so I can monitor the execution flow and check several variables mamually.

CCS can generate *.asm files. Under Assembler, please check "keep the generated assembly languages files". Those *.asm files are located in debug folder.

Regards,

0 kranti kumar over 13 years ago in reply to QJ Wang

Prodigy 150 points

Hi QJ,

Thank you very much for the support.

when i try to build a project(example project attached) , i am unable to build the project arm_fir_q15_test because unable to link SIMD Intrinsics __smlald etc..

Please find the attachment for the example project.

Can you please let us know if I miss any project options.

Best Regards,

Kranti

1680.arm_fir_q15_test.zip

0 QJ Wang over 13 years ago in reply to kranti kumar

TI__Guru**** 197256 points

Hi Kranti,

The functions you called don't exist in your project. Please add those lines in your arm_fir_q15.c to use CCS4 R4 compiler intrinsics.

#define __smlald _smlald

#define __ssat _ssat16

#define __pkhbt _pkhbt

Regards,

0 kranti kumar over 13 years ago in reply to QJ Wang

Prodigy 150 points

Hi QJ,

Thank you very much for the clarification.

But ssat is differant from ssat16, Can you please refer me to complete list of intrinsics(Document which lists all the intrinsics)

Best Regards,

Kranti

0 kranti kumar over 13 years ago in reply to kranti kumar

Prodigy 150 points

Hi QJ,

Following are list of functions which are unable to link when i try to build a project on TMS570 using CCS.

__clz(), __qadd(), __qsub(), __ssat()

Best Regards,

Kranti

0 kranti kumar over 13 years ago in reply to kranti kumar

Prodigy 150 points

Hi QJ,

When i refered (TMS470R1x Optimizing C/C++ Compiler 4.1.x Beta Doc), i dont find the intrinsic for qadd().

I have the following inline assembly function on GCC Compiler.

static INLINE uint32_t _qadd(uint32_t op1, uint32_t op2)

{

uint32_t result=0;

__ASM volatile ("qadd %0, %1, %2" : "=r" (result) : "r" (op2), "r" (op1) );

return(result);

}

I got build error when i use same code on TI Compiler.

Can i have equivalent inline assembly function for the above _qadd() function on TMS570 using TI Compiler .

Best Regards

Kranti

0 kranti kumar over 13 years ago in reply to kranti kumar

Prodigy 150 points

I am getting following build error when i build the project in release mode.

line 58: (of arm_add_q15.c)

INTERNAL ERROR: no match for ICALL

This may be a serious problem. Please contact customer support with a

description of this problem and a sample of the source files that caused this

INTERNAL ERROR message to appear.

Cannot continue compilation - ABORTING!

Please find the attachment for the example project.8053.lib_test.zip

Can you Please help us what might be the problem.

Thanks in advance

Kranti

0 QJ Wang over 13 years ago in reply to kranti kumar

TI__Guru**** 197256 points

Hi Kranti,

ssat is for 32bit, and ssat16 is for 16bit. Please use the R4 intrinsics in the enclosed UG.

http://www.ti.com/lit/ug/spnu151g/spnu151g.pdf

Regards,

0 QJ Wang over 13 years ago in reply to kranti kumar

TI__Guru**** 197256 points

Hi Kranti,

Please update CCS4 Code Generation Tool (CGT) to 4.91. It will solve your problem. CGT 4.6 doesn't have the intrinsics used in your project.

You can update CGT using "Software Updates" under CCS "Help".

Regards,

0 kranti kumar over 13 years ago in reply to QJ Wang

Prodigy 150 points

Hi QJ,

Thanks for your help. I am able to use CGT 4.91 and able to build the functions in the following way

I used _norm() in place of __clz()

sadd() inplace of __qadd(),

ssub() inplace of __qsub(),

and used _ssata() for __ssat().

Please let me know if i miss anything.

Following are list of functions which i am still unable to link when i try to build the project on TMS570 using CCS.

__SHASX()

__SHSAX()

Can you please suggest us.

Best Regards,

Kranti

0 QJ Wang over 13 years ago in reply to kranti kumar

TI__Guru**** 197256 points

Hi Kranti,

CGT4.91 doesn't have SHASX and SHSAX intrinsics. You can use sasx and ssax instructions, or use _saddsux intrinsic.

The GCC, and IAR compilers have __shsax intrinsic. The usage is:

res = __shsax(val1,val2); /* res[15:0] = (val1[15:0] + val2[31:16]) >> 1; res[31:16] = (val1[31:16] - val2[15:0]) >> 1 */

CCS4 CGT4.91 doesn't _shsax intrinsic, but you can use _saddsubx.

res = _saddsubx(val2, val1); /* res[15:0] = (val1[15:0] + val2[31:16]) ; res[31:16] = (val1[31:16] - val2[15:0]) */

I am not sure this is ok for your performance requirement.

Regards,

0 QJ Wang over 13 years ago in reply to QJ Wang

TI__Guru**** 197256 points

Hi Kranti,

Did you try _saddsubs() for _shsax() in your application? Does this meet your performance requirement?

Regards,

0 kranti kumar over 13 years ago in reply to QJ Wang

Prodigy 150 points

Hi QJ,

Sorry for the late reply. I tried _shasx() for _shaaddsubx in our application. It is building without any warnings.

In another application, I am getting build error when i use _sqrtf() on TMS570 using code composer studio.

Is the __builtin_sqrtf() function is the equivalent function for _sqrtf()?

I wonder is there any include file, i need include if i want to use __builtin_sqrtf().

Best Regards,

Kranti

0 QJ Wang over 13 years ago in reply to kranti kumar

TI__Guru**** 197256 points

Hi Kranti,

__builtin_sqrtf() is GCC extensions. You can use it by compiling the project with --gcc option.

Regards,

0 kranti kumar over 13 years ago in reply to QJ Wang

Prodigy 150 points

Hi QJ,

Thank you for your response.

Do we have any other function equivalent to __sqrtf() instead of __builtin_sqrtf()?

If that is the only function we have, where should we the -gcc option for compiling the project.

Thanks in advance,

Kranti

0 Stellaris Jordan over 13 years ago in reply to kranti kumar

TI__Intellectual 1185 points

Hey Kranti,

Hopefully you found the proper option since posting this a month and a half ago, but for anyone else who stumbles across this thread in the future, the -gcc option for compilation in Code Composer Studio can be found in the project properties. Open the properties and expand CCS Build, then TMS470 Compiler, then click on the Language Options entry. About fifteen items down is a checkbox labeled "Enable support for GCC extensions (--gcc)." If you're using CCSv5, it might be necessary to click the Show Advanced Settings button in the bottom right.

-Jordan

0 Brian Fortman over 13 years ago in reply to kranti kumar

TI__Expert 7650 points

An update ... we have received the R4 dsplib from ARM and are in the process of evaluating it and to make it easy to use and understand the benefits. This effort will require a few more weeks of work. Look for a public posting of this library in the March 2012 timeframe.

Arm-based microcontrollers

Arm-based microcontrollers forum

performance measurements on TMS570 processor