This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM3358 FPU speed

Genius 5820 points

Assumed I'm doing only basic calculations (addition, substraction, division, multiplication) and only with 32 bit integers or floats: What loss in speed do I have to expect for floating point operations with the FPU of Sitara processor comparing to the same plain fixed point calculations where no FPU is required?

  • The AM335X Linux SDK supports hard floating point. FP calculations are performed by the VFP coprocessor included in the Cortex-A8 core.

  • OK, that's clear so far. The main question is about speed: are 32 bit floating point calculations that make use of VFP faster or slower than similar fixed point operations (basic arithmetic operations only). And in case VFP is still slower: which factor does one have to calculate with?

  • Hi Hans,

    The VFP any ARM Cortex-A8 is not fully pipelined, so basic floating point operations (add, mult, sub) will take 9 - 12 cycles, where as the fixed point operations will occur in 1-2 cycles.  These numbers are assuming the pipeline is full.  Keep in mind, if you can utilize NEON, it is fully pipelined, so you can get 32 bit floating SIMD (mult, add, sub) out in 1 - 2 cycles.  But NEON does not support double precision float.

  • Jeff,

    I'm using only 32 bit floats, so that is not a problem.

    For TIs ARM compiler SIMD/NEON is used when option "--neon" is set and "--float-support=VFPv3" is used, correct?

    Cheers

    Hans

  • Hans,

    For TI's compiler you can enable NEON:  (This is somewhat old data and I haven't checked it in a while)

    "-o3 -mv7a8 --neon -mf "

    see this link for more info: http://processors.wiki.ti.com/index.php/Cortex-A8

    However, since NEON is SIMD, the compiler will not automatically utilize NEON for any (add, mult, sub) in your code. You will find that the compiler will auto vectorize some simple for loops, but in general, you need to either write your own NEON assembly or utilize an existing library such as the NE10 library. http://projectne10.github.io/Ne10/