This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

fft time in evm6678l

Other Parts Discussed in Thread: TMS320C6678, TMS320C6670

Hello, I have just buy a tmdxevm6678l ,I am using it!

The question is that I debug a project in the path "Texas Instruments\dsplib_c66x_3_0_7",the name of project is DSPF_sp_fftSPxSP_66_LE_ELF.

In the targetconfiguration ccxml I use    texas instruments xds100v1 usb emulator    tms320c6678

I want to do a 1024 float fft ,before fft I use t1=clock(),after fft I use t2=clock();

my result is :    [c66x_0] dspf_sp_fftspxsp item#:1 rsult successfu n = 1024 radix = 4 natc:570776, optc:379303

because the c6678 is 1.25GHZ,so I calculate,the time of fft is 302us,it is too long ,In the data AVNET offered,for single precison floating point fft ,2048pt,radix 4,c66x@1.25GHZ the time is 14us.

I want to know why this happen?

Thanks in advance!

  • Hi,

    I tried dsplib_c66x_3_0_8 from the latest mcsdk_2_00_00_11 on C6678 EVM.  I set MAXN to 1024.  For 1024-point SP FFT, I got 12873 cycles.

    In your setup, could you change MAXN to 1024 and see what happens?

    -Xiaohui

     

  • Hi,

    I changed my config from debug mode to release mode,then for 1024-point SP FFT ,I got 15304 cycles.For 2048-point SP FFT,I got 33751 cycles.My EVM is configed to be 1GHz,so I calculate the time of 2048-point SP FFT is 33us.It is much more than the data from AVNET(15us).I want to know why?

    And what is different between the debug mode and release mode?

  • Jie,

    The difference between debug and release mode is typically 2 things, 1. The majority of debug information is removed from the release version and 2. a higher optimization level is typically used in release mode. 

    Where is the data that you are operating on?  Is it in internal memory?  Or external?  If external, is data cache turned on and is the cache size large enough?

    Regards,

    Dan

     

  • Hi,

    I use multicore shared memory,does it need data cache turned on ?How much the cache size is should to be?

  • Jie,

    Yes, the shared memory is external and does need cache turned on.

    The best answer that I can give you to the size of the data cache is "as large as you can afford".  Keep in mind, though, that with a 2-way set associative cache, you won't get any benefit of a cache larger than 1/2 the size of your data set. 

    Regards,

    Dan

     

  • DanRinkes,

    I change my data from shared memory to L2 ,but for 2048 float FFT,I still need 33us.

    I want to know why?It is not the external momery ,do I still need to turn the cache on?

     

    And Xiaohui Li's reply is as below,Is 12873 cycles(12us) for 1024 FFT the final result using cache?Is it too slow?Can it be faster?

    Xiaohui Li replied to Re: fft time in evm6678l in C66x Multicore DSP Forum.

    Hi,

    I tried dsplib_c66x_3_0_8 from the latest mcsdk_2_00_00_11 on C6678 EVM.  I set MAXN to 1024.  For 1024-point SP FFT, I got 12873 cycles.

    In your setup, could you change MAXN to 1024 and see what happens?

    -Xiaohui

     And without BIOS,Can I use cache?

    Regards,

    Jie

  • Jie,

    12873 cycles for 1024 floating point FFT is the performance with both code and data (in, out, and twiddle factors) placed in L2 SRAM.  Were you able to duplicate the performance?  This is the performance we can get from the current version of C66x DSPLIB.  There will be future updates and we can expect some performnace improvement.

    What kind of performance are you looking for for both 1024 and 2048 FFT?

    Regards,

    Xiaohui

     

     

  • As an innocent bystander: I looked at the example, and I only see a macro for N and not MAXN -- so is that the right file? dsplib/examples/FFT_Example_66_LE_COFF? (and I'm building ELF, but as long as I link the library, I think it's fine)

     

    And that example file has 3 calls to an FFT routine. Are the times that you all are quoting for 1 of those or all 3?  For all 3, I'm getting 30000 cycles for all debug / optimization options (implying the library is optimized only), and for the 16x16 I get 5337 cycles, 16x32 12242, 32x32 13022 cycles.  Are those what you are talking about?

  • We were talking about single precision floating point FFT.

  • Xiaohui Li ,

    I have duplicated this performance,my result is 14843cycles.

    The data from AVNET is 14us for single precision floating point FFT,2048pt,radix 4 ,1.25GHz,but the result of my test on EVM is different from the data from AVENT,so I afraid I made some mistake.

    Regards,

    Jie wang.

  • DanRinkes,

    I change my data from shared memory to L2 ,but for 2048 float FFT,I still need 33us.

    I want to know why?It is not the external momery ,do I still need to turn the cache on?

      And without BIOS,Can I use cache?

    Regards,

    Jie

  • What is the AVNET publication with cycle counts for FFT you referenced?  Would someone please provide me a link to it?

  • Hello,

    I got these data from a conference of AVNET.

  •  

    Hello,

    I can obtain more or less the performance declared only without a linker command files, that is code and data mapped from location 0. Since 0 is declared reserved, I suppose it maps to L1RAM, (maybe for compativbility with other CPU).

    With the following scenario:

    - code on MCSM (no L2 cachable)

    - FFT in , out and twiddle factors on DDR3, cachable

    - L2RAM configured as all cache

    I obtains the following results:

    - 1024 Complex:  min=12.934us, max=26,482us

    - 2048 Complex: min=29.586us, max=58.847us

    Where max is from the first execution, just before a code cache invalidate and a data cache flush, while min is from the second execution.

     

  • Hi, DanRinkes,

    tried the code in "..\dsplib_c66x_3_0_0_8\packages\ti\dsplib\src\DSPF_sp_fftSPxSP\c66\DSPF_sp_fftSPxSP_66_LE_ELF" on EVM6678L, and I've got the same results with Liang Wen:

    Liang Wen said:

    from"TMS320C6670 Breakthrough performance for process-intensive applications"

     

    C66x @1.2 GHz Single precision floating-point FFT, 2048 pt. radix 4 costs 14.60 us.

     

    but the code in dsplib doesn't achieve this performance , maybe only half the speed...

    i also don't know why...

    here the result and the  scenario:

    • [C66xx_0] DSPF_sp_fftSPxSP Iter#: 9 Result Successful N = 2048 radix = 2 natC: 97296 optC: 33197 cycles 
    • [C66xx_0] DSPF_sp_fftSPxSP Iter#: 8 Result Successful N = 1024 radix = 4 natC: 40476 optC: 14762 cycles 

     

    • both code and data (in, out, and twiddle factors) placed in L2 SRAM
    • ccxml:  texas instruments xds100v1 usb emulator
    • I use clock() as well as on-chip Timer to measure performance, the results are almost the same
    and the project file: 7750.DSPF_sp_fftSPxSP.zip
    How can I replicate the results “C66x @1.2 GHz Single precision floating-point FFT, 2048 pt. radix 4 costs 14.60 us.” on EVM6678L?
    Thanks!