Tiva TM4C123G floating point divide and square root

Matthew Tarca

Other Parts Discussed in Thread: TM4C123GH6PGE

Hello all,

I have what I believe to be a relatively beginner question, but I am new here, so go easy on me...

Here are the tools I am currently using:

-Tiva C series development board with a TM4C123GH6PGE processor. This is supposedly equivalent to the Stellaris LM4F232H5QG.

-Code Composer Studio v5.4.0.00091

Compiler version: TI v5.0.4

Runtime support library: <automatic>

And as for the question: I am running a very time sensitive application that needs to do a moderate amount of math including divides, square roots, and possibly trig functions such as sine(x). If I am not mistaken the processor is capable of floating point hardware divides and square roots, the datasheet shows the assembly commands as "VDIV.F32" and "VSQRT.F32" respectively. Furthermore, the "ARM Cortex-M4 Processor Technical Reference Manual" (rev r0p1) states that those instruction each require 14 clock cycles.

I have written test code that successfully performs the divide and square root, which can be seen as

float x,y,z,zdiv;

x = 2.34457123123;

y = 5.12366434127;

zdiv = x / y;

z = sqrtf(x);

Using JTAG I am able to verify that the values of "zdiv" and "z" are what you would expect, but the timing and the assembly code is not what I expect. The last 4 lines above are repeatedly run inside a very slow loop, and I used a very rough method of toggling a GPIO line immediately before and after those lines to get a rough estimate of timing (I understand this is imperfect) and am seeing something like 700ns execution time. I am certain that the CPU clock is configured for 80MHz, this has already been verified several other ways. Most confusing however is the assembly code. I see no evidence of any divide, let alone a hardware floating point divide. There is a long branch to a square root routine, also not what I was hoping for/expecting. As far as I can tell these are the assembly lines associated with the C code shown above:

;----------------------------------------------------------------------
; 153 | x = 2.34457123123;                                                     
;----------------------------------------------------------------------
        LDR       A1, $C$FL1            ; [DPU_3_PIPE] |153| 
        VMOV      S0, A1                ; [DPU_LIN_PIPE] |153| 
        LDR       A1, $C$CON25          ; [DPU_3_PIPE] |153| 
	.dwpsn	file "../timers.c",line 154,column 2,is_stmt,isa 1
;----------------------------------------------------------------------
; 154 | y = 5.12366434127;                                                     
;----------------------------------------------------------------------
        LDR       A2, $C$CON26          ; [DPU_3_PIPE] |154| 
	.dwpsn	file "../timers.c",line 153,column 2,is_stmt,isa 1
        VSTR.32   S0, [A1, #0]          ; [DPU_LIN_PIPE] |153| 
	.dwpsn	file "../timers.c",line 154,column 2,is_stmt,isa 1
        LDR       A1, $C$FL2            ; [DPU_3_PIPE] |154| 
        STR       A1, [A2, #0]          ; [DPU_3_PIPE] |154| 
	.dwpsn	file "../timers.c",line 155,column 2,is_stmt,isa 1
;----------------------------------------------------------------------
; 155 | zdiv = x / y;                                                          
;----------------------------------------------------------------------
        LDR       A1, $C$FL3            ; [DPU_3_PIPE] |155| 
        LDR       A2, $C$CON27          ; [DPU_3_PIPE] |155| 
        STR       A1, [A2, #0]          ; [DPU_3_PIPE] |155| 
	.dwpsn	file "../timers.c",line 156,column 2,is_stmt,isa 1
;----------------------------------------------------------------------
; 156 | z = sqrtf(x);                                                          
;----------------------------------------------------------------------
$C$DW$88	.dwtag  DW_TAG_TI_branch
	.dwattr $C$DW$88, DW_AT_low_pc(0x00)
	.dwattr $C$DW$88, DW_AT_name("sqrtf")
	.dwattr $C$DW$88, DW_AT_TI_call
        BL        sqrtf                 ; [DPU_3_PIPE] |156| 
        ; CALL OCCURS {sqrtf }           ; [] |156| 
        LDR       A1, $C$CON28          ; [DPU_3_PIPE] |156| 
	.dwpsn	file "../timers.c",line 157,column 2,is_stmt,isa 1
        STR       V3, [V1, #0]          ; [DPU_3_PIPE] |157| 
	.dwpsn	file "../timers.c",line 156,column 2,is_stmt,isa 1
        VSTR.32   S0, [A1, #0]          ; [DPU_LIN_PIPE] |156|

Any help would be so appreciated! Thanks.

over 11 years ago

0 Matthew Tarca over 11 years ago

Prodigy 30 points

Hello everyone again,

I have answered half of my own question... you may notice the two lines that assigned constant values to "x" and "y" that I was writing previously... well it turns out with an optimization level of 2 all of the work is done in advance! The divide computation was optimized right out and performed in advance. That is why the assembly code looked like it was simply moving numbers around, that is actually what it was doing. I checked this two ways... I declared "x" and "y" in the main initialization routines instead of right before the computation themselves so that the ISR had to treat the variables as globals without deterministic values. Another method was to select an optimization level of 0 and make no changes to the code whatsoever. Either way, the assembly code for the divide line was shown to then become:

;----------------------------------------------------------------------
; 156 | zdiv = x / y;                                                          
;----------------------------------------------------------------------
        LDR       A1, $C$CON29          ; [DPU_3_PIPE] |156| 
        VLDR.32   S0, [A1, #0]          ; [DPU_LIN_PIPE] |156| 
        LDR       A1, $C$CON28          ; [DPU_3_PIPE] |156| 
        VLDR.32   S1, [A1, #0]          ; [DPU_LIN_PIPE] |156| 
        VDIV.F32  S0, S1, S0            ; [DPU_LIN_PIPE] |156| 
        LDR       A1, $C$CON27          ; [DPU_3_PIPE] |156| 
        VSTR.32   S0, [A1, #0]          ; [DPU_LIN_PIPE] |156| 
	.dwpsn	file "../timers.c",line 159,column 2,is_stmt,isa 1

This was shown to be taking approximately 200ns on my oscilloscope, actually faster than I would have presumed (2 + 2 + 2 + 2 + 14 + 2 + 2 clock cycles @ 80MHz ~= 325ns). I will again attribute this to some type of optimization or instruction pipelining...

I would like to point out that I still have not solidified my understanding of how the square root routine is being performed. The assembly code still shows it as a "branch with link" instruction to some routine. That line of code is taking about 375ns according to the scope, not terrible, but not exactly what I expected. I do not have proof that the "routine" being called to perform the square root is not essentially a few assembly lines to perform a hardware square root, I just would like to know for sure. Again, thanks for any comments.

0 Chester Gillon over 11 years ago in reply to Matthew Tarca

Guru 92251 points

Matthew Tarca said:
The assembly code still shows it as a "branch with link" instruction to some routine.

The sqrtf function in the run time library validates the input argument, and if passed a negative number raises a floating point exception and returns a NAN value, as per the C standard library.

The source for the run time library is in <ccs_install_root>\ccsv5\tools\compiler\arm_<arm_compiler_version>\lib\rtssrc.zip if you want to see what it does.

If you want to bypass the overhead of calling sqrtf, try using the compiler intrinsic __sqrtf to use the hardware square root instruction directly.

0 Matthew Tarca over 11 years ago in reply to Chester Gillon

Prodigy 30 points

Chester,

Thank you very much for the reply. I have perused to RTS libraries before but found them a bit daunting. The code is a little cryptic. With a little persistence I found the lines that I believe in plain English say to return an error if a negative float is detected else "return __sqrtf(float)"....

Using the inline code worked like a charm... now the .asm reads as

; 107 | zsqrt = __sqrtf(zdiv);                                                 
; 108 | //zsqrt = sqrtf(zdiv);                                                 
;----------------------------------------------------------------------
        VSQRT.F32 S0, S0                ; [DPU_LIN_PIPE] |107| 
        LDR       A1, $C$CON30          ; [DPU_3_PIPE] |107| 
        VSTR.32   S0, [A1, #0]          ; [DPU_LIN_PIPE] |107| 
	.dwpsn	file "../timers.c",line 109,column 5,is_stmt,isa 1

The code also appears to be executing much faster.

Lastly, I'd like to point out that since I realized I am not running my ISR from RAM I should expect

such deterministic timings. Anyways, thanks a lot!

MJT

Arm-based microcontrollers

Arm-based microcontrollers forum

Tiva TM4C123G floating point divide and square root