This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

[C6678] FFT performance ?

Other Parts Discussed in Thread: FFTLIB

Hi,

My customer has a question about FFT performance.
He found the following document in web site (not TI web site).

http://www.sagivtech.com/contentManagment/uploadedFiles/fileGallery/Multi_core_DSPs_vs_GPUs_TI_for_distribution.pdf

In page 10, you can see FFT performance table and it says DSP: 0.86 usec for TI C6678 @ 1.2 GHz for 1024 point FFT.

Do you have more detailed information about his ? For example, verified sample code, used API information, FFT lib package information, and etc...
I just tried fft_sp_1d_c2c_direct API from C66x FFT lib (fftlib_c66x_2_0_0_2) on my C6678 EVM (Core0 only) and I gave the following performance:

======================= CCS console=========================

...
...
...

FFT memory buffers:

    Buffer    Size(bytes)    Alignment

       0          8192            3       

       1          8192            3       

       2             8            3       

       3             0            3       

       4             0            3       

       5             0            3       

       6             0            3       

       7             0            3       

       8             0            3       

       9             0            3      
fft_sp_1d_c2c_direct size= 1024

max_diff = 0.006134 N = 1024 Cycle: 9866

...
...
...

================================================

It says fft_sp_1d_c2c_direct  API takes 9866 CPU cycle @ 1Ghz per core. If all cores do the same in parallel, 9866 / 8 = 1,233.25 CPU cycles would be potentially expected. Father more, assuming it works with 1.2Ghz device, the expected processing time would be 1,233.25 / 1.2 = 1.0277 usec.
So, I see some gap between 0.86 usec (from the above document) and 1.0277 usec (from EVM verification).

Best Regards,
Naoki 

 

 

  • Hi Naoki,

    I've forwarded this to the FFTC expert. Feedback will be posted here.

    Best Regards,
    Yordan
  • Yordan,

    Thanks for following up this thread, but C6678 does not have a FFTC hardware. I believe TI must achieve the FFT performance by using something software just like FFT library..

    Best Regards,
    Naoki

  • Hi Naoki,

    Yes, this has been forwarded to the correct experts, I've made a mistake in my previous post.

    Best Regards,
    Yordan
  • Naoki,

    Two Issues, first about single core performances of FFT, second about FFT across multiple cores.

    I run the single precision floating point FFT on one core of C6678 using the optimized function in DSPLIB.

     I use DSPLIB release 3.4.0.0 that is part of  several Processor SDK releases for K2H, AM57, C6678 and more.  DSPLIB contains single core implementation.

    I built and run the test code for DSPF_sp_fftSPxSP_66_LE_ELF (I may change the path to the library in the linker command file and change the optimization to full optimization and suppress all symbolic debug.  The number of cycles for 1K FFT (complex, floating point single precision) is 8881.

    The DSP core in C6678 can run up to 1.25G. Other devices run the DSP in 1.2G (H2K for example) so if you convert the number of cycles into time (assume 1.2G) then 1K FFT takes about 7.4 micro-seconds.

    Your comment about hardware FFT that exists in multiple TI devices is correct. Not only C6678 does not have one, but all hardware FFT engines are for fixed point FFT and not floating point (as far as I know).

    Last comment about C66 core FFT - in the release unit test code all the data and the sources are in L2 memory and L1 caches are enables. I assume you can get slightly better results if you disable L1D cache, put the input (8K bytes for 1K complex single precision FFT), output (again 8K) and twiddle factors (8K) in L1D SRAM. 

    About FFTLIB parallel execution - it uses OpenMP to distribute the FFT between multiple cores.  We played this game when the size of the FFT is large, for example, 1M FFT. for 1K FFT, the cost of moving data to and from shared memory so the algorithm is distributed between all cores may be high.  I will try to benchmark the 1D C2C function,  and you can try to benchmark it as well, but again, it may not give 8X performances advantages.

    Does it help? 

    Best regards

    Ran

  • Hello Ran,

    Thank you for your comments. I got almost same number with yours for DSPF_sp_fftSPxSP by using demo project delivered from DSPLIB (v3.4) package.

    As for FFTLIB, there is a call of DSPF_sp_fftSPxSP in fft_sp_1d_c2c_k1_66_LE_ELF CCS demo project. I'm not sure its intention, but I modified the code to do benchmarking both DSPF_sp_fftSPxSP and FFTLIB native calls. The implementation looks like below:

    === fft_sp_1d_c2c_d.c ===

            ...
            ...
            /* ---------------------------------------------------------------- */
            /* Compute the overhead of allocating and freeing EDMA              */
            /* ---------------------------------------------------------------- */
            p.edmaState = fft_assign_edma_resources();
            fft_free_edma_resources(p.edmaState);
            t_start = _itoll(TSCH, TSCL);
            p.edmaState = fft_assign_edma_resources();
            fft_free_edma_resources(p.edmaState);
            t_stop  = _itoll(TSCH, TSCL);
            t_overhead = t_stop - t_start;
    
            /* Kawada Added -- DSPLIB FFT */
            t_start = _itoll(TSCH, TSCL);
            DSPF_sp_fftSPxSP (N, ptr_x_cn, ptr_w_cn, ptr_y_cn, NULL, rad_cn, 0, N);
            t_stop = _itoll(TSCH, TSCL);
            t_opt  = (t_stop - t_start) - t_overhead;
            printf("DSPF_sp_fftSPxSP\tsize= %d\n", N);
            printf("\tN = %d\tCycle: %d\n\n", N, t_opt);
            ...
            ...
    

    Please note I put data sections (like bss) on L2SRAM. As for .text and .const data and other sections, I leave them MSMCSRAM because the room of L2SRAM is limited (The demo code is using very huge buffers for supporting various points of FFT). 

    Now here is a result :

    DSPF_sp_fftSPxSP size= 1024
    N = 1024 Cycle: 10953

    ...

    FFT memory buffers:
    Buffer Size(bytes) Alignment
    0 8192 3
    1 8192 3
    2 8 3
    3 0 3
    4 0 3
    5 0 3
    6 0 3
    7 0 3
    8 0 3
    9 0 3
    fft_sp_1d_c2c_direct size= 1024
    max_diff = 1303.194458 N = 1024 Cycle: 9765

    So, it looks FFTLIB call(9765) is a bit faster than DSPF_sp_fftSPxSP(10953) in this environment. But the number of 9765 can not achieve the number at all that the document stated.... If I put all data/code sections on L2SRAM, and If I use OpenMP runtime, I might be able to achieve the performance, but as you say, there should be some overhead for distributing required data to slave codes in OpenMP runtime --- maybe big overhead because of small data distribution (Just 1K FFT). For me, the number is still skeptical.

    Best Regards,
    Naoki

  • Naoki

    All I can say as a user is that FFTLIB was built for multi-core algorithms for either large FFTs or multiple FFT in parallel. We understand the algorithm how to break large FFT into multiple cores (see for example the following links -
    1. www.google.com/url
    2. www.ti.com/.../spry277.pdf )
    and the "cost" in terms of performances for doing it.
    I do not think that I can add anything else.

    Please close teh thread

    Best regards

    Ran
  • Naoki

    can you zip your FFTLIB project and attach it to this posting? I would like to look at it


    Best regards and Thanks

    Ran
  • Ran,

    Here is a zip for you.
    Please note you will need the latest ProcSDK (v03.01.00) for C667x and CCSv6.1.3. I'm assuming all software are installed in default path (C:\ti).

    0317.fft_sp_1d_c2c_k1_66_LE_ELF.zip

    Best Regards,
    Naoki

  • I forgot to mention but framework components version 3_40_02_07 is also required. You will be able to get it from the following path:

    software-dl.ti.com/.../index.html

    Best Regards,
    Naoki
  • Thanks Naoki,
  • Hi,

    are there any news on that?

    With your project (fft_sp_1d_c2c_k1_66_LE_ELF.zip) I don't even get the same number of clock cycles as you Naoki.

    Without changing anything in your project I get this result:

    fft_sp_1d_c2c_direct    size= 1024
    max_diff = 0.000000    N = 1024    Cycle: 21619

    With the original example Project the DSP needs 21706 cycles. Also it seems the number of cycles are not deterministic. The cycle number slightly changed for each run. For example from 21700 to 21706.

    Naoki, did you also run the code with the Blackhawk XDS560v2 Emulator? Does my code need more cycles because I'm running it through the emulator?

    Versions:

    CCS:  Version: 6.2.0.00050

    DSPLIB: 3.4.0

    FFTLIB: 3.1.0.0

    FFTLIB_c66x: 2.0.0.2

    EDMA3: 2.12.1

    Framework Components: 3.40.2.07

    SYS/BIOS: 6.45.1.29

    XDAIS: 7.24.0.04

    C667x PDK: 2.0.3

    Board: TMDSEVM6678LE

    Regards,

    Daniel

  • Daniel,

    No, no progress about this. I'm now working on other stuff so I can not try this out again at this moment, sorry.
    Some confirmations:
    Did you try prebuilt version of out file (Release/fft_sp_1d_c2c_k1_66_LE_ELF.out) ?
    Did you use a evmc6678l.gel file at C:\ti\ccsv6\ccs_base\emulation\boards\evmc6678l\gel ?

    Best Regards,
    Naoki

  • Thanks for getting back to me.
    I can answer both questions with yes. But the result does not change.
    Maybe someone from TI can help?