This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

[FAQ] What are the different FFT implementations on a C28x, C28x FPU, and VCU?

Q: What are the different FFT implementations on a C28x, C28x FPU, and VCU?  How do I choose which one to use?

Differences:

32-bit Fixed-Point

This implementation uses the C28 fixed-point CPU. It uses the on-chip 32bit fix-point math capabilities of the CPU. As a rule of thumb, it takes ~20 cycles for each FFT butterfly for an optimized 32-bit implementation.

16-bit Fixed-Point

This implementation can use the C28 fixed-point CPU or the C28x with VCU enhancements (C28x+VCU).

C28x Fixed-Point

This implementation uses the 16-bit math capabilities of the C28x fixed-point CPU. If using the C28 CPU core, it takes ~16 cycles per FFT butterfly

C28x with VCU

This implementation uses the 16-bit math capabilities of the C28x with VCU. The VCU provides optimized 16-bit complex math capabilities that are in addition to that of the fixed-point CPU. If using the C28x+VCU enhancements, then it takes ~5 cycles for each FFT butterfly. There are currently two versions of the VCU, Type 0 more commonly referred to as VCU-I and Type 2 referred to as VCU-II. The FFT is substantially sped up on VCU-II with the butterfly taking ~2.5 cycles on average to complete.

Note: the VCU is not available on all devices and there are no plans to put it on future devices. 

32-bit Floating-Point

This implementation uses the extended floating-point instruction set. It uses the 32-bit floating point math capabilities of the C28x+FPU as well as the repeat block (RPTB) instruction. As a rule of thumb, it takes ~10 cycles for each FFT butterfly. If the implementation is on a floating-point device and 32-bits are required, then this is the preferred implementation.

32-bit Floating-Point with TMU

The Trigonometric Math Unit (TMU) is an extension of the 32-bit single precision Floating Point Unit (FPU). It provides instructions to do certain trigonometric and arithmetic functions in a cycle efficient manner. The TMU specific instructions are used to speed up magnitude and phase calculations through efficient computation of square roots, divisions, and arc tangents. The TMU can be enabled by setting the following compiler options:

Note: Refer to the C2000 Compiler User's Guide (http://www.ti.com/lit/spru514) for all floating point related compiler options. 

 --float_support=fpu32 and --tmu_support=tmu0.

Should the user wish to make use of the TMU in C code they must turn on the additional option,

--fp_mode=relaxed

This will cause the compiler to replace calls to the standard C math library, like sin or cosine, with TMU instructions.

Conclusions:

16-bit Implementation

While the C28x+VCU implementation offers the best performance, it is not available on all devices and there are no plans to include it on future devices.  The VCU magnitude and phase calculations would be the same as on a fixed-point device. This is because the VCU does not have enhancements to improve these algorithms.

32-bit Fixed-Point FFT Performance

To improve the performance of a 32-bit fixed-point FFT:

Consider using a floating-point device. The FPU can double the performance. In addition, magnitude and phase calculations are faster because the FPU does a better job at this than 32-bit fixed-point math. The trade-off in resolution between a 32-bit fixed-point and 32-bit floating-point implementation is negligible.

If the application can tolerate a 16-bit implementation, then consider using the C28x+VCU. This would be faster compared to a 32-bit fixed-point implementation. The VCU does not, however, have instructions to improve the performance of a magnitude or phase calculation. These operations are best done in floating-point.

32-bit FPU vs 16-bit VCU

The performance difference between a 16-bit VCU and a 32-bit FPU implementation is not great. The VCU-I (Type 0) does not have enhancements to improve the performance of magnitude and phase calculations. VCU-II has a new instruction to compute the magnitude of a 16-bit fixed point complex variable in a single cycle; it does not provide any improvements towards phase calculations.

CLA

While the CLA itself is not well suited for a full FFT algorithm, it could be considered for magnitude and phase calculations. This would offload these operations from the main CPU. On a device like F2806x a floating point FFT could be performed on the main C28x+FPU and the magnitude calculation performed on the CLA, as an example.  Because of limited RAM for the CLA, the number of points will be limited.