This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

FFT Library question

Other Parts Discussed in Thread: FFTLIB, SYSBIOS

There is a FFT library I found for the C66X DSP family that says it is for all floating point processors, Does anyone know if it runs on the C6748?  I know that the DSPLIB has fft routines for the C6748, but I wanted to see if the FFTLIB code runs faster.

Blair

  • I moved your thread to the device forum.

    Todd
  • Hi,

    FFTLIB supports only C66x TI DSP platform for little-endian and supports only single-precision and double-precision floating point operations. C66x devices also includes FFT coprocessor which is limited only to fixed point with power of two sized FFT operations.

    Thanks & regards,

    Sivaraj K

    ------------------------------------------------------------------------------------------------------

    Please click the Verify Answer button on this post if it answers your question

    -------------------------------------------------------------------------------------------------------

  • C66X is backward compatible with C647x but of course not the other way.

    Look at table 7-3 in http://www.ti.com/lit/ug/spru187u/spru187u.pdf and you see a list of intrinsic functions that are available in C66 but not in C647.  If any of these intrinsic functions is used in the FFT code, then running the C66 function on C647X core will result in illegal opcode exception (and of course the core will not know what to do with it)

    But if you do not believe me,  try

    Of course, the natural C version of the library, if you recompile the source code under C647X it will run on C647,  but this will be slow

    Ran

  • Thanks folks. Looks like it won't work. I'm having a problem getting close to the benchmark results for the DSPF_sp_icfftr2_dif routine in DSPLIB. All data is in internal RAM (not sure about code but caching is enabled) and I'm consistently getting about double the number of cycles as the benchmark for a 256 point FFT. Interrupts are running, but I'm not getting an interrupt during the execution of the FFT (toggling an IO lead to check). The application is a SYSBIOS app running on a C6748. Any ideas on how to trouble shoot this or is the benchmark published in the DSPLIB package out to lunch? The DSPLIB version is 3.1.0.0.

    Thanks,
    Blair
  • Sorry, the routine in question is the FFT (DSPF_sp_cfftr2_dit) not the inverse FFT.

    Blair
  • /cfs-file/__key/communityserver-discussions-components-files/791/7673.C66x-L1-D-Memory-Banks.pptxWhen you say that the data is internal I assume you mean L1D and not L2, right

    If the measured cycles are not the same as you expected it may be memory bank issue.

    If the program reads two numbers from L1D in a cycle, and the two numbers are in the same memory bank(s) an additional delay will be inserted.

    I attach two slides to illustrate the problem.  They are C66 slides, but the same is true for C6478 as well.

    Look at the slides. If you need help examine if you have this case, post it here and I will give you instructions

    Ran

  • Actually the data is in L2. Does it have to be in L1 to get the benchmark timing?
  • Yes, otherwise if L1D is configure as cache, it spend time to get it for the first time

    If the data is not large enough put it in L1D and disable cache

    Ran
  • Looks like the data will fit with no problem. I think I need a few pointers on how to locate the data in L1D. I'm using the cfg file to add segments and locate them in memory, but there is no L1 option. The available segments are IROM, IRAM, L3_CBA_RAM and DDR. I've set aside half of L1D for SRAM now and need to know how to locate my data there.

    Thanks for the help.
    Blair
  • You need to understand how the linker is working. And then you need some pragma to tell the linker where to put the data

    Look for example at www.ti.com/.../spru186w.pdf The linker is described there. If you read the document and still need example how to tell the linker where to put the data, let me know

    By the way, if you use RTSC the memory definition is part of the platform and we do it a little different. So if you use RTSC (if you do not know what RTSC is you do not use it) let me know


    Ran
  • OK I now have the two buffers, the data and the twiddle table located in L1D using the following code



    #pragma LOCATION(FftBuffer,0x00F00000);
    float FftBuffer[2048];

    #pragma LOCATION(FftBuffer,0x00F02008);
    float TwiddleBuffer[1024];

    Code runs and produces the correct result, but still takes twice the time it should. Do the addresses put the data in different banks?

    Blair
  • OK.

    The last thing I can suggest is moving the twiddle buffer one more bank away to 0x00F02010  (may be  in the software pipeline the code reads the first element from vector 1 at the same cycle as the second element of vector 2)

    Other than this, the benchmark values may be off.  You have the source assembly code.  You can estimate how many cycles it should take.

    Ran

  • No luck. That didn't seem to help. Can the bench marks be off by a factor of two? People may be making a design choice to use this processor or not based on these benchmarks so that would be a big problem wouldn't it? I suspect it is something I'm not doing properly, but I can find out what yet. I'm running this code under TI-RTOS, any chance the FFT code could be preempted by a system task like timing?

    Blair

  • May be, run the test code without BIOS, I am not sure what is going on. BIOS in general does not preempt task unless there is an event, but who knows

    Besides, I assume that you measure the time on the function and not on the task.

    Try to run the code without sysBios and report back

    Thanks

    Ran
  • You are correct, I measured the time in the function. I should get a chance to give it a try without BIOS this afternoon and I'll let you know. One thing I have seen is that there does not appear to be any change when the variables are located in L1D as opposed to L2. I'll try it this afternoon and see what happens. Is there some benchmark code around that could be run as a test? That way I'd know that I was testing under the same conditions as the benchmark.

    Thanks,
    Blair
  • One more thing

    try to put the data and twiddle in L2, enable L1 cache and then do touch of the two arrays before you run the code. May be (but just may be) the cache is faster than reading from the memory

    I scrape the bottom of the barrel here trying to find a reason why your code run slower...

    Ran
  • Will do. I assume the "touch" synchs cache to L2? Any idea if there is a function for this or do I need to twiddle bits in the control registers?

    Blair
  • This may not directly help, but please see the latest DSP benchmark page

    www.ti.com/.../dreamdsp.page

    www.ti.com/.../core-benchmarks.page

    The DSP Benchmarking application note has the recipe on how some of the c6748 benchmarks were captured on the LCDK and what changes were made to linker command files etc. At least if you are able to reproduce the results from the benchmarks here, then it would reconfirm that there are no issues with your setup etc?

    Regards
    Mukul
  • Do search in e2e for touch. I recently answered e2e and I gave assembly source code (or the URL of) the touch function

    Ran
  • One other question. I've been using DSPF_sp_cfftr2_dit for the FFT and I notice that the benchmark uses DSPF_sp_fftSPxSP, is one faster than the other?

    Blair
  • Turns out the problem was the FFT routine I was using. If I use DSPF_sp_fftSPxSP and make sure the memory lines up properly I get times much closer to the benchmark and all is well.

    Thanks for the help.
    Blair