This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

ARM Cortex-A8 using NEON and VFP

I have a customer that has some questions about the level of control the programmer has with the compiler to route different instruction types to different processing cores in the ARM subsystem.  They are working with the AM3503 and are concerned about floating point performance.  Originally they were going to use the VPF, but because the core is not pipelined they cannont get the performance they are looking for.  They need both single and double precision float support.  Here is their questions:

Here's a possible option, please advise.

 

Can we compile all of our code with Neon switched on for all floating point ops?  Here are the assumptions we would be making and let us know if these are reasonable and would help.

  1. Any single float would be trapped and executed by the Neon
  2. Any double float would be executed by the VFP
  3. Everything else would be executed by the MPU A8 core.

In other words, is the compiler smart enough to match the operation with the most appropriate resource?  Is any of this feasible?  Does it buy us anything, or do we lose too many cycles forcing some sort of context switching between the A8, VFP and Neon cores?  Is there a particular compiler that does a better job with code optimization?

I'm not familiar enough with the compiler to know if this level of control is possible.

Thanks.

  • Tommy,

    The TI ARM compiler has 4 modes for the Cortex-A8 controlled by the --float_support=vfpv3 and --neon options. The --neon option without the --float_support option will result in integer operations being vectorized for NEON, but no floating point. The --float_support optoin without --neon will result in VFP instructions. If you specify both you will get floating point instructions being vectorized for Neon and VFP instructions. As to the pipeline behavior of the A8, I am not an expert. If I read the TRM correctly, section 16.7.2 says that single precision VFP instructions will execute in the Neon FP pipeline if it is available. If this is correct then generating single precision VFP instructions is all that is needed to get some pipeline benefit from Neon. I don't think the compiler will generate the SIMD form of the instruction for a single result. For reference I am talking about the difference between:

    VADD.F32 S0, S1, S2 ;VFP instruction
    VADD.F32 D0, D1, D2 ; Neon instruction

    I don't think there is a penalty for issuing instructions on the different cores, but again I'm not an expert on A8. I recommend that you use the latest compiler, which is 4.9.0.