This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

C6457(1GHz) FFT function timing

 

 

We used 1K FFT implementations from C64+ library in our project. The target DSP is 1GHz C6457, and the FFT function used is DSP_fft16x16(). From document SPRUEB8B, we found the # of cycles for SA is 4218, which corresponds to 4.218 us for 1G DSP. However, the processing timing measured is about 86 us. Obviously, our usage of the assambly library is not correct. Could you please help us to get the right FFT assembly implementation/usage for our project?
 
For reference, the natural C timing is measured as 38us (optimization 3 with full suppresion of debugging) compared to 26268 cycles in TI document, and intrisic C timing is 9.2 us compared to 5369 cycles


Thanks

Guangyi Wang
  • Guangyi,

    The performance benchmarks published in the TI document are using a C64x+ cycle accurate simultator that assumes all code is in internal memory so the cod is unlikely to provide you with the same numbers on the actual device due to addtional time required to access memory and caching.

    Coming to the performance of the assembly function as compared to the natural C and the intrinsic C version, it appears that you may have implemented something differently for the test setup of the assembly source function as compared to the test bench for the other versions. We provide a test bench for all the function in the library in the src folder in the library that can be used as reference implementation. Can you run the test bench for the fft16x16 function and report the numbers you see.

    Regards,

    Rahul 

  •  

    Hi Rahul

     

    We use internal RAM for code space. So, the cycle count should match the benchmark.
     
    The most important issue we need to solve is the processing timing for assembly FFT implementation. It's even slower than that of the natural C implementation.
     
    Here is the 3 FFT function calls used in out DSP project for comparison:
     

    // Natural C

    DSP_fft16x16_cn(&gTwiddleFft[0], FFT_SIZE_IMP, fftInPtr, (Int16*) &gFftOutUl[0]);

    // Intrisic C

    DSP_fft16x16_i(&gTwiddleFft[0], FFT_SIZE_IMP, fftInPtr, (Int16*) &gFftOutUl[0]);

    // Accembly

    DSP_fft16x16(&gTwiddleFft[0], FFT_SIZE_IMP, fftInPtr, (Int16*) &gFftOutUl[0]);

    Could you please help us to find out why the assembly implementation is so slow?

    Thanks

    Guangyi
  • Guangyi,

    Assuming you have same memory steup for all tests, the benchmarks that you are descibing seem  unlikely unless you have a timer/counter that is overflowing and wraps around or if the C code for some reason exits without computing the entire FFT . Have you compared the outputs to see if they are the same. Is it possible for you to send your test project so that we can replicate the scenerio or may be review the code for you?

    Regards,

    Rahul

     

  • Hi Rahul

     

    Please send me your contact info

     

    Thanks

     

    Guangyi

  •  

     

    Hi Rahul

     

    We have compared the outputs for natural C and assembly implementations. The outputs are the same. But, the problem is that the assembly implementation consumes much more cycles than expected, even more than the cycle count of the natural C implementation.

     

    Is there anything extra we need to do when build the assembly routine from TI library, such as build properties setup? We are using CCSv4.1.2.

     

    Thanks

     

    Guangyi

  • Guangyi,

    Can you send your code to the Developer Mailing List  mentioned here so that we can take a look at the issue.

    http://processors.wiki.ti.com/index.php/Software_libraries#Developer_Mailing_List

    Regards,

    Rahul

  • Did you ever figure out the problem? We're having similar issues.