TMS320F280049C: 1024-point fft (both Real-FFT and Complex-IFFT) using VCU0

Daniel Staver

Part Number: TMS320F280049C
Other Parts Discussed in Thread: C2000WARE

I need example code for 1024-point FFT (both directions) and need to evaluate if there is enough CPU power for my product's dsp requirements, before settling on the proper part for production.

TI literature suggests using VCU0 for FFT will yield speed-up by factor of 4x or 5x, but a counter example in a related post appeared to dash those hopes.

Q1) To clarify something, if I use C28x VCU for FFT, is the required format to use IQmath, or is that just your generic terminology for fixed-point math?

Q2) Since I start with real data, process then iFFT back to real data, would the RealFFT be most efficient or try doing the ComplexFFT on 2-sets of data ... ?

Q3) More confusion: is the CLA_FFT1024 library call the same as using the VCU or an actual CLA block?

Q4) If I can use 16-bit IQmath with VCU0, then why shouldn't it be faster than on FPU?

Q5) Given the VCU0 has a 3-deep (?) pipeline, would it be more efficient to use 5-stages of 4-point FFTs, instead of 10-stages of 2-point?

Q6) Where can I get code for 1024-point FFT which uses C28x+VCU0+TMU)?

Q7) Or, am I stuck using 32-bit FPU, and if so, where can I get example code for 1024-poing FFT (C28x+FPU)? Actually both would be better so I can compare.

Q8) Why does example code in library only go to 256/512-point FFTs? Is a 1024-point FFT not suitable for a C28x+VCU0 math or HW ?

Q9) I assume to get best performance, I'd need to put FFT routines in RAM. Can you suggest a good example how to do this?

thanks, Dan

over 2 years ago

0 Omer Amir over 2 years ago

TI__Mastermind 30185 points

Hello Dan,

For question 2, 5, 6, 8 I will include the VCU expert to help answer these (because it is a long weekend in the US, the response will come sometime early next week).

For the VCU FFT you are not required to use IQmath (looking at the data structure for CFFT, the pointers are integer types)
I cannot find the benchmarks for doing the real vs complex FFT in VCU so I will leave this for the VCU expert to respond to, but I believe it would be better to just perform an FFT on one set of data (I do not believe the CFFT has significant performance improvement over the real FFT)
No, these are different functions and different source files (CLA utilizes the CLA hardware, VCU utilizes the VCU hardware)
Unless I'm mistaken, it is faster; looking at the benchmarks in the VCU library doc (C2000Ware_4_XX_XX_XX\libraries\dsp\VCU\c28\docs), the 1024-point CFFT for VCU runs at 14435 cycles whereas the 1024-point CFFT_f32 for FPU runs at 53220 cycles.
You can find DSP FPU FFT examples in the C2000Ware directory: C2000Ware_4_XX_XX_XX\libraries\dsp\FPU\c28\examples\fft. I do not believe any of them use a 1024-point FFT, so you will have to modify an existing example or create your own, but these examples include the basic setup
The existing FFT examples in the DSP FPU library are set up to use RAM to run the FFT I believe, let me know if you need clarification on this or there are issues with the examples

Best regards,

Omer Amir

0 Sira Rao80 over 2 years ago in reply to Omer Amir

TI__Mastermind 25070 points

Dan,

2. Start with the RFFT on real data, then use the iCFFT on the complex FFT data.

5. I need to check on this. What is implemented is the standard implementation.

6. The VCU0 (e.g. F28004x) DSP library does not provide asm code for a 1024-pt CFFT. The most it does is a 256-pt CFFT. The VCU2 (e.g. F2837x) DSP library does provide asm code for a 1024-pt CFFT. The VCU2 has specific CFFT HW instructions. The TMU in general won't be needed because twiddle factors are pre-computed in tables.

7. The FPU asm code is written more generically, not FFT size specific.

8. Unfortunately these were developed 10+ years ago, and we don't have the context on why they stopped at 256-pt CFFT asm with VCU0. I don't see a reason why you cannot go beyond, but it's not something we're actively investing in.

Thanks,

Sira

0 Daniel Staver over 2 years ago in reply to Sira Rao80

Intellectual 355 points

Hi Omar and Sira, Thank you for the pointers. I believe I have found the right examples (buried in various libraries :-) and have been doing a lot of digging into options for FFTs on real data.
I now see that the 1024pt "real-only data" example (FPU) itself calls a 512-pt complex FFT then does the unwrap trick.
I'm trying to put together working code, this week, to confirm my understanding. If my calculations are correct, then for 1st stage development, I should be able to get by with using FPU only, then see about potentially migrating to CLA-based 16-bit FFTs to faster computation (slower clock).

Q) In an ideal world (where I start with 16-bit real time-domain data), why can't I use VCU-based FFT for the forward FFT, then use CLA to convert to single-prec float, do some processing, then use the main core's FPU to do iCFFT for final time-domain steps?

Thanks to you both (Omar and Sira) for your help!
Dan

0 Sira Rao80 over 2 years ago in reply to Sira Rao80

TI__Mastermind 25070 points

9. I think Dan's question is about how to set things up in the generic case when code is running from Flash. Our FFT example (atleast the FPU DSP CFFT_f32 example) only has the RAM build configuration, so it doesn't help him much.

So you have to think in terms of what all you can fit into RAM in your particular use-case

- can you fit buffers in RAM?

- can you run the entire library from RAM, in which case you can modify the linker cmd file like below

GROUP
{
.TI.ramfunc
{
-l c28x_vcu0_library_fpu32.lib
}
} LOAD = FLASHC,
RUN = RAMLS1,
RUN_START(RamfuncsRunStart),
LOAD_START(RamfuncsLoadStart),
LOAD_SIZE(RamfuncsLoadSize),
PAGE = 0

- or can you only run a portion of the library from RAM, in which case you can modify the linker cmd file like below (just an example)

.TI.ramfunc : LOAD = FLASHC,
RUN = RAMLS1,
RUN_START(_RamfuncsRunStart),
LOAD_START(_RamfuncsLoadStart),
LOAD_SIZE(_RamfuncsLoadSize),
PAGE = 0
{
--library=driverlib.lib<flash.obj> (.text)
}

0 Daniel Staver over 2 years ago in reply to Sira Rao80

Intellectual 355 points

Hi Sira, You are correct. If I can run all from RAM, then OK, but my gut feel is that I may have to optimize the main FFT stuff to run from RAM and have the rest in FLASH, so, eventually, I may need an example to do that (especially when I try to optimize resources for production part selection.

I'd really like to know your opinion about my last quesion of mixing computation using all three resources: using VCU's fast complex math unit for 16-bit data, and the two FPUs located in CLA and CPU+FPU.

0 Sira Rao80 over 2 years ago in reply to Daniel Staver

TI__Mastermind 25070 points

Dan,

I have some follow up questions:

- why do you want to convert to float after doing an integer FFT? Why not stick to an integer FFT and IFFT?

- If you do need float representation, I still don't see why you'll need a CLA to convert from int to float. It might be inefficient to do that. You may be better off running the conversion on the FPU32.

Thanks,

Sira

0 Daniel Staver over 2 years ago in reply to Sira Rao80

Intellectual 355 points

Hi Sira,

For final product, my guess is that it would be best to do my computations as fast as possible, then shut down (go into some low-power mode) until it's time to repeat. So, wouldn't it make sense to use as much computational resources as available?
Both the CPU+FPU and the CLA have FPU32s. By splitting up the steps, I can use all resources to get the job done quickest. I assume the CLA's FPU can do the conversion or perhaps the VCU because it'll have more available time, but that is not important where the conversion happens, just that can I use all three computational resources for best throughput, no?

Here's a follow-on question. In FPU example code , I think it says FFT buffers must be address aligned for the fft to work. However, the example then defines a buffer that is not power-of-two in size:
#define TEST_SIZE (256U)
#pragma DATA_SECTION(test_output, "FFT_buffer_2")
float test_output[TEST_SIZE + 2U];

How can that be?

0 Sira Rao80 over 2 years ago in reply to Daniel Staver

TI__Mastermind 25070 points

Dan,

Definitely good to do computations as fast as possible, but the approach you are suggesting is - do the FFT with VCU0 - then use CLA for some operations (float conversion) - then use FPU32 for IFFT. That's all sequential, not really exploiting the parallelism the architecture provides. Typically, our customers will be running some operations on the C28x CPU+FPU, and in parallel, running something else on the CLA.

So, in your case, it may make sense to use VCU0 for FFT, and then VCU for IFFT, provided your application can run something else in parallel on the FPU32, and if you still have more stuff to do in parallel, on the CLA.

That depends on your application and use-case requirements, which I don't know about. Based on that, you can make the call on which exact C2000 device best fits your need.

I think the alignment restriction is on the input buffer in Flash (which is a power of 2), not the output buffer.

#if defined(FFT_ALIGN)
FFT_buffer_1 : > RAMLS, ALIGN = FFT_ALIGN

FFT_buffer_2 : > RAMLS

Thanks,

Sira

0 Daniel Staver over 2 years ago in reply to Sira Rao80

Intellectual 355 points

Hi Sira,

Of course I would parallel the operations, since my processing can be pipelined and a little pipeline delay can be tolerated. I thought that was obvious. :-)
I am working on the code now, and will get back if any further questions. Thanks for the assistance.

Dan

0 Sira Rao80 over 2 years ago in reply to Daniel Staver

TI__Mastermind 25070 points

Dan,

About aligning to a boundary. length of buffer is not a factor. Input to ALIGN() must be a power of 2.

Thanks,

Sira

C2000™︎ microcontrollers

C2000 microcontrollers forum

TMS320F280049C: 1024-point fft (both Real-FFT and Complex-IFFT) using VCU0