FFT Library question

Blair MacDonald

Other Parts Discussed in Thread: FFTLIB, SYSBIOS

There is a FFT library I found for the C66X DSP family that says it is for all floating point processors, Does anyone know if it runs on the C6748? I know that the DSPLIB has fft routines for the C6748, but I wanted to see if the FFTLIB code runs faster.

Blair

over 10 years ago

0 ToddMullanix over 10 years ago

TI__Guru* 96960 points

I moved your thread to the device forum.

Todd

0 Sivaraj Kuppuraj over 10 years ago

TI__Mastermind 35645 points

Hi,

FFTLIB supports only C66x TI DSP platform for little-endian and supports only single-precision and double-precision floating point operations. C66x devices also includes FFT coprocessor which is limited only to fixed point with power of two sized FFT operations.

Thanks & regards,

Sivaraj K

------------------------------------------------------------------------------------------------------

Please click the Verify Answer button on this post if it answers your question

-------------------------------------------------------------------------------------------------------

0 ran35366 over 10 years ago in reply to Sivaraj Kuppuraj

TI__Genius 12805 points

C66X is backward compatible with C647x but of course not the other way.

Look at table 7-3 in http://www.ti.com/lit/ug/spru187u/spru187u.pdf and you see a list of intrinsic functions that are available in C66 but not in C647. If any of these intrinsic functions is used in the FFT code, then running the C66 function on C647X core will result in illegal opcode exception (and of course the core will not know what to do with it)

But if you do not believe me, try

Of course, the natural C version of the library, if you recompile the source code under C647X it will run on C647, but this will be slow

Ran

0 Blair MacDonald over 10 years ago

Intellectual 610 points

Thanks folks. Looks like it won't work. I'm having a problem getting close to the benchmark results for the DSPF_sp_icfftr2_dif routine in DSPLIB. All data is in internal RAM (not sure about code but caching is enabled) and I'm consistently getting about double the number of cycles as the benchmark for a 256 point FFT. Interrupts are running, but I'm not getting an interrupt during the execution of the FFT (toggling an IO lead to check). The application is a SYSBIOS app running on a C6748. Any ideas on how to trouble shoot this or is the benchmark published in the DSPLIB package out to lunch? The DSPLIB version is 3.1.0.0.

Thanks,
Blair

0 Blair MacDonald over 10 years ago in reply to Blair MacDonald

Intellectual 610 points

Sorry, the routine in question is the FFT (DSPF_sp_cfftr2_dit) not the inverse FFT.

Blair

0 ran35366 over 10 years ago in reply to Blair MacDonald

TI__Genius 12805 points

/cfs-file/__key/communityserver-discussions-components-files/791/7673.C66x-L1-D-Memory-Banks.pptxWhen you say that the data is internal I assume you mean L1D and not L2, right

If the measured cycles are not the same as you expected it may be memory bank issue.

If the program reads two numbers from L1D in a cycle, and the two numbers are in the same memory bank(s) an additional delay will be inserted.

I attach two slides to illustrate the problem. They are C66 slides, but the same is true for C6478 as well.

Look at the slides. If you need help examine if you have this case, post it here and I will give you instructions

Ran

0 Blair MacDonald over 10 years ago in reply to ran35366

Intellectual 610 points

Actually the data is in L2. Does it have to be in L1 to get the benchmark timing?

0 ran35366 over 10 years ago in reply to Blair MacDonald

TI__Genius 12805 points

Yes, otherwise if L1D is configure as cache, it spend time to get it for the first time

If the data is not large enough put it in L1D and disable cache

Ran

0 Blair MacDonald over 10 years ago in reply to ran35366

Intellectual 610 points

Looks like the data will fit with no problem. I think I need a few pointers on how to locate the data in L1D. I'm using the cfg file to add segments and locate them in memory, but there is no L1 option. The available segments are IROM, IRAM, L3_CBA_RAM and DDR. I've set aside half of L1D for SRAM now and need to know how to locate my data there.

Thanks for the help.
Blair

0 ran35366 over 10 years ago in reply to Blair MacDonald

TI__Genius 12805 points

You need to understand how the linker is working. And then you need some pragma to tell the linker where to put the data

Look for example at www.ti.com/.../spru186w.pdf The linker is described there. If you read the document and still need example how to tell the linker where to put the data, let me know

By the way, if you use RTSC the memory definition is part of the platform and we do it a little different. So if you use RTSC (if you do not know what RTSC is you do not use it) let me know

Ran

0 Blair MacDonald over 10 years ago in reply to ran35366

Intellectual 610 points

OK I now have the two buffers, the data and the twiddle table located in L1D using the following code

#pragma LOCATION(FftBuffer,0x00F00000);
float FftBuffer[2048];

#pragma LOCATION(FftBuffer,0x00F02008);
float TwiddleBuffer[1024];

Code runs and produces the correct result, but still takes twice the time it should. Do the addresses put the data in different banks?

Blair

0 ran35366 over 10 years ago in reply to Blair MacDonald

TI__Genius 12805 points

OK.

The last thing I can suggest is moving the twiddle buffer one more bank away to 0x00F02010 (may be in the software pipeline the code reads the first element from vector 1 at the same cycle as the second element of vector 2)

Other than this, the benchmark values may be off. You have the source assembly code. You can estimate how many cycles it should take.

Ran

0 Blair MacDonald over 10 years ago in reply to ran35366

Intellectual 610 points

No luck. That didn't seem to help. Can the bench marks be off by a factor of two? People may be making a design choice to use this processor or not based on these benchmarks so that would be a big problem wouldn't it? I suspect it is something I'm not doing properly, but I can find out what yet. I'm running this code under TI-RTOS, any chance the FFT code could be preempted by a system task like timing?

Blair

0 ran35366 over 10 years ago in reply to Blair MacDonald

TI__Genius 12805 points

May be, run the test code without BIOS, I am not sure what is going on. BIOS in general does not preempt task unless there is an event, but who knows

Besides, I assume that you measure the time on the function and not on the task.

Try to run the code without sysBios and report back

Thanks

Ran

0 Blair MacDonald over 10 years ago in reply to ran35366

Intellectual 610 points

You are correct, I measured the time in the function. I should get a chance to give it a try without BIOS this afternoon and I'll let you know. One thing I have seen is that there does not appear to be any change when the variables are located in L1D as opposed to L2. I'll try it this afternoon and see what happens. Is there some benchmark code around that could be run as a test? That way I'd know that I was testing under the same conditions as the benchmark.

Thanks,
Blair

0 ran35366 over 10 years ago in reply to Blair MacDonald

TI__Genius 12805 points

One more thing

try to put the data and twiddle in L2, enable L1 cache and then do touch of the two arrays before you run the code. May be (but just may be) the cache is faster than reading from the memory

I scrape the bottom of the barrel here trying to find a reason why your code run slower...

Ran

0 Blair MacDonald over 10 years ago in reply to ran35366

Intellectual 610 points

Will do. I assume the "touch" synchs cache to L2? Any idea if there is a function for this or do I need to twiddle bits in the control registers?

Blair

0 Mukul Bhatnagar over 10 years ago in reply to ran35366

TI__Guru* 85595 points

This may not directly help, but please see the latest DSP benchmark page

www.ti.com/.../dreamdsp.page

www.ti.com/.../core-benchmarks.page

The DSP Benchmarking application note has the recipe on how some of the c6748 benchmarks were captured on the LCDK and what changes were made to linker command files etc. At least if you are able to reproduce the results from the benchmarks here, then it would reconfirm that there are no issues with your setup etc?

Regards
Mukul

0 ran35366 over 10 years ago in reply to Mukul Bhatnagar

TI__Genius 12805 points

Do search in e2e for touch. I recently answered e2e and I gave assembly source code (or the URL of) the touch function

Ran

0 Blair MacDonald over 10 years ago in reply to ran35366

Intellectual 610 points

One other question. I've been using DSPF_sp_cfftr2_dit for the FFT and I notice that the benchmark uses DSPF_sp_fftSPxSP, is one faster than the other?

Blair

0 Blair MacDonald over 10 years ago in reply to Blair MacDonald

Intellectual 610 points

Turns out the problem was the FFT routine I was using. If I use DSPF_sp_fftSPxSP and make sure the memory lines up properly I get times much closer to the benchmark and all is well.

Thanks for the help.
Blair

Processors

Processors forum

FFT Library question