fft time in evm6678l

jie wang75279

Other Parts Discussed in Thread: TMS320C6678, TMS320C6670

Hello, I have just buy a tmdxevm6678l ,I am using it!

The question is that I debug a project in the path "Texas Instruments\dsplib_c66x_3_0_7",the name of project is DSPF_sp_fftSPxSP_66_LE_ELF.

In the targetconfiguration ccxml I use texas instruments xds100v1 usb emulator tms320c6678.

I want to do a 1024 float fft ,before fft I use t1=clock(),after fft I use t2=clock();

my result is : [c66x_0] dspf_sp_fftspxsp item#:1 rsult successfu n = 1024 radix = 4 natc:570776, optc:379303

because the c6678 is 1.25GHZ,so I calculate,the time of fft is 302us,it is too long ,In the data AVNET offered,for single precison floating point fft ,2048pt,radix 4,c66x@1.25GHZ the time is 14us.

I want to know why this happen?

Thanks in advance!

over 14 years ago

0 Xiaohui Li over 14 years ago

TI__Intellectual 1870 points

Hi,

I tried dsplib_c66x_3_0_8 from the latest mcsdk_2_00_00_11 on C6678 EVM. I set MAXN to 1024. For 1024-point SP FFT, I got 12873 cycles.

In your setup, could you change MAXN to 1024 and see what happens?

-Xiaohui

0 jie wang75279 over 14 years ago in reply to Xiaohui Li

Intellectual 740 points

Hi,

I changed my config from debug mode to release mode,then for 1024-point SP FFT ,I got 15304 cycles.For 2048-point SP FFT,I got 33751 cycles.My EVM is configed to be 1GHz,so I calculate the time of 2048-point SP FFT is 33us.It is much more than the data from AVNET(15us).I want to know why?

And what is different between the debug mode and release mode?

0 DanRinkes over 14 years ago in reply to jie wang75279

TI__Expert 8055 points

Jie,

The difference between debug and release mode is typically 2 things, 1. The majority of debug information is removed from the release version and 2. a higher optimization level is typically used in release mode.

Where is the data that you are operating on? Is it in internal memory? Or external? If external, is data cache turned on and is the cache size large enough?

Regards,

Dan

0 jie wang75279 over 14 years ago in reply to DanRinkes

Intellectual 740 points

Hi,

I use multicore shared memory,does it need data cache turned on ?How much the cache size is should to be?

0 DanRinkes over 14 years ago in reply to jie wang75279

TI__Expert 8055 points

Jie,

Yes, the shared memory is external and does need cache turned on.

The best answer that I can give you to the size of the data cache is "as large as you can afford". Keep in mind, though, that with a 2-way set associative cache, you won't get any benefit of a cache larger than 1/2 the size of your data set.

Regards,

Dan

0 jie wang75279 over 14 years ago in reply to DanRinkes

Intellectual 740 points

DanRinkes,

I change my data from shared memory to L2 ,but for 2048 float FFT,I still need 33us.

I want to know why?It is not the external momery ,do I still need to turn the cache on?

And Xiaohui Li's reply is as below,Is 12873 cycles(12us) for 1024 FFT the final result using cache?Is it too slow?Can it be faster?

Xiaohui Li replied to Re: fft time in evm6678l in C66x Multicore DSP Forum.

Hi,

I tried dsplib_c66x_3_0_8 from the latest mcsdk_2_00_00_11 on C6678 EVM. I set MAXN to 1024. For 1024-point SP FFT, I got 12873 cycles.

In your setup, could you change MAXN to 1024 and see what happens?

-Xiaohui

And without BIOS,Can I use cache?

Regards,

Jie

0 Xiaohui Li over 14 years ago in reply to jie wang75279

TI__Intellectual 1870 points

Jie,

12873 cycles for 1024 floating point FFT is the performance with both code and data (in, out, and twiddle factors) placed in L2 SRAM. Were you able to duplicate the performance? This is the performance we can get from the current version of C66x DSPLIB. There will be future updates and we can expect some performnace improvement.

What kind of performance are you looking for for both 1024 and 2048 FFT?

Regards,

Xiaohui

0 Tim Wentz over 14 years ago in reply to Xiaohui Li

TI__Intellectual 1270 points

As an innocent bystander: I looked at the example, and I only see a macro for N and not MAXN -- so is that the right file? dsplib/examples/FFT_Example_66_LE_COFF? (and I'm building ELF, but as long as I link the library, I think it's fine)

And that example file has 3 calls to an FFT routine. Are the times that you all are quoting for 1 of those or all 3? For all 3, I'm getting 30000 cycles for all debug / optimization options (implying the library is optimized only), and for the 16x16 I get 5337 cycles, 16x32 12242, 32x32 13022 cycles. Are those what you are talking about?

0 Xiaohui Li over 14 years ago in reply to Tim Wentz

TI__Intellectual 1870 points

We were talking about single precision floating point FFT.

0 jie wang75279 over 14 years ago in reply to Xiaohui Li

Intellectual 740 points

Xiaohui Li ,

I have duplicated this performance,my result is 14843cycles.

The data from AVNET is 14us for single precision floating point FFT,2048pt,radix 4 ,1.25GHz,but the result of my test on EVM is different from the data from AVENT,so I afraid I made some mistake.

Regards,

Jie wang.

0 jie wang75279 over 14 years ago in reply to jie wang75279

Intellectual 740 points

DanRinkes,

I change my data from shared memory to L2 ,but for 2048 float FFT,I still need 33us.

I want to know why?It is not the external momery ,do I still need to turn the cache on?

And without BIOS,Can I use cache?

Regards,

Jie

0 James Steed over 13 years ago in reply to jie wang75279

Expert 1905 points

What is the AVNET publication with cycle counts for FFT you referenced? Would someone please provide me a link to it?

0 jie wang75279 over 13 years ago in reply to James Steed

Intellectual 740 points

Hello,

I got these data from a conference of AVNET.

0 Alberto Chessa over 13 years ago in reply to jie wang75279

Mastermind 6670 points

Hello,

I can obtain more or less the performance declared only without a linker command files, that is code and data mapped from location 0. Since 0 is declared reserved, I suppose it maps to L1RAM, (maybe for compativbility with other CPU).

With the following scenario:

- code on MCSM (no L2 cachable)

- FFT in , out and twiddle factors on DDR3, cachable

- L2RAM configured as all cache

I obtains the following results:

- 1024 Complex: min=12.934us, max=26,482us

- 2048 Complex: min=29.586us, max=58.847us

Where max is from the first execution, just before a code cache invalidate and a data cache flush, while min is from the second execution.

0 Hui Zhang over 13 years ago

Prodigy 10 points

Hi, DanRinkes,

I tried the code in "..\dsplib_c66x_3_0_0_8\packages\ti\dsplib\src\DSPF_sp_fftSPxSP\c66\DSPF_sp_fftSPxSP_66_LE_ELF" on EVM6678L, and I've got the same results with Liang Wen:

Liang Wen said:

from"TMS320C6670 Breakthrough performance for process-intensive applications"

C66x @1.2 GHz Single precision floating-point FFT, 2048 pt. radix 4 costs 14.60 us.

but the code in dsplib doesn't achieve this performance , maybe only half the speed...

i also don't know why...

here the result and the scenario:

[C66xx_0] DSPF_sp_fftSPxSP Iter#: 9 Result Successful N = 2048 radix = 2 natC: 97296 optC: 33197 cycles
[C66xx_0] DSPF_sp_fftSPxSP Iter#: 8 Result Successful N = 1024 radix = 4 natC: 40476 optC: 14762 cycles

both code and data (in, out, and twiddle factors) placed in L2 SRAM
ccxml: texas instruments xds100v1 usb emulator
I use clock() as well as on-chip Timer to measure performance, the results are almost the same

and the project file: 7750.DSPF_sp_fftSPxSP.zip

How can I replicate the results “C66x @1.2 GHz Single precision floating-point FFT, 2048 pt. radix 4 costs 14.60 us.” on EVM6678L?

Thanks!

Processors

Processors forum

fft time in evm6678l