Neon vs DSP FFT performance

Rick Rogers

Other Parts Discussed in Thread: OMAP3530

I'm seeing some odd timing results comparing FFT performance between the ARM/Neon and DSP cores of the OMAP3530.

FFT size 32768
ARM/Neon optimized FFTMPEG timing data: 10.4ms (32-bit floating point, complex)
DSP timing data: 60ms (16-bit, fixed point, complex)

With the large FFT size, it is not possible to place the data in internal DSP memory. Is there a logical explanation for why the Neon would appear to outperform the DSP?

Thanks.

over 15 years ago

0 Rick Rogers over 15 years ago

Prodigy 140 points

Forgot to mention the DSP fft kernel is from TI C64x+DSPLIB.

0 Rahul Prabhu over 14 years ago in reply to Rick Rogers

TI__Guru** 116330 points

Rick,

Can you describe your setup while running the FFT code on the DSP. Are you running the code only on the DSP or are you calling the DSP function from your ARM application using DSPLINK? What are the operating frequencies of ARM and DSP at which you are running these tests? Please note that on the OMAP3530 platform the ARM operating frequency is higher than the DSP frequency.

I am a little suprised with the numbers that you are observing. The performance numbers for the 16 bit FFT function with 32K samples on the DSP are around 205K cycles, which on OMAP3530 platform should be around 400 usecs (@520MHz). You can find these performance numbers in the documentation of C64x+ DSPLIB that can be found in the docs folder of the library. Please also do take a look at the performance numbers of 64K sample FFT on OMAP3530 that we observed while calling this function from an ARM application using our new DSP+ARM development tool called C6EZAccel on the following url.

http://processors.wiki.ti.com/index.php/C6Accel_FAQ

Regards,

Rahul

0 Rick Rogers over 14 years ago in reply to Rahul Prabhu

Prodigy 140 points

Rahul,

Thanks for the response.

Briefly, my overall setup for running dsp code on OMAP3530 is as follows:

Dsplink v1.63 used for general dsp control, such as code loading, proc start, etc.
- minimal use of dsplink functionality to minimize ARM/DSP IPC overhead
  - Notify API called from ARM to signal DSP to start calculation
  - Notify API called from DSP to signal ARM calculation is complete
  - No other dsplink facilities, such as ringio, msgq, etc. used during runtime
CMEM module used on ARM side to allocate contiguous memory (outside of kernel aware region) as ARM/DSP shared memory
C64x+DSPLIB used for fft
Clocks:
- ARM 500MHz
- DSP 365MHz

The clock cycles quoted in the DSPLIB doc assume all data is located in internal memory- for a 32K point, 16-bit fft, it is not possible to do this. Incidentally, I do not see numbers for a 64K point data set in the DSBLIB doc- are you extrapolating from highest shown of 16K? For my case of 32K points, 384KB of data space is required (128K each for input, output and twiddle factors- in place calculation not allowed per doc)- DSP cache is configured for max available (L1 data = 32K and L2 data = 64K). I was able to get the DSP version of the 32K point fft down to 29.3ms by using the "cache optimized" version of the dsplib fft (DSP_fft16x16r(), as per the example in the doc), and by catching an error in the cache configuration of the ARM/DSP shared memory region (appropriate MAR bits were not set).

It is still odd that I seem to get better performance on the ARM side, since it is using external RAM for all data also. Several things to note: ARM has 256K L2 cache, ARM is clocked higher, ARM is using floating point data (which I believe Neon is more heavily optimized for floats, even though it can handle both floating and fixed point data). Attached is test project that has the Neon optimized fft from FFMPEG. The project uses scons for building- to build for a gumstix dev board: "scons -f GSconstruct".

-Rick

FFMPEG_FFT.zip

0 Rick Rogers over 14 years ago in reply to Rick Rogers

Prodigy 140 points

Also, my method of timing analysis is setting gpio line high just before call to fft function and low immediately after function return.

-Rick

0 Rahul Prabhu over 14 years ago in reply to Rick Rogers

TI__Guru** 116330 points

Rick,

We are looking into this issue. Once we replicate this we will be able to profile the application and let you know the cause of this occurance. As you can see from the benchmark I earlier reported that the DSP is executing the FFT in 400 us but when you call a function on the DSP using DSPLINK there are overheads from Cache invalidations as well as address translation which are probably the reason for the overheads you are seeing.

We will take a look at this and see if we can help you achieve a performance better than 6.9ms that you are expecting.

Regards,

Rahul

0 Deepali Uppal over 14 years ago in reply to Rahul Prabhu

TI__Expert 4930 points

Rahul

For OMAP3530, the MMU mapping is one is to one physical to virtual. What overhead will the address translation cause? The cache calls are also minimal and do not explain the numbers in ms.

For Notify, the profiling numbers are in the usec range ~200 to 400.and not in ms range.

Could there be some delay in clearing the previous event on the ARM core? There are cache calls in a loop in the DSP ISR if the ARM core has not cleared the previous event. This could happen if the application on ARM is doing something else preventing the previous event to be cleared thereby increasing the latency?

Deepali

0 Rick Rogers over 14 years ago in reply to Deepali Uppal

Prodigy 140 points

I've observed roughly the same timing for Notify. To test this, a task on the DSP side waits for a Notify from the ARM (semaphore pend in DSP Notify callback ISR), then the DSP immediately responds with Notify to ARM. Timing was checked from immediately prior to the ARM sending its Notify to immediately after the ARM receives a Notify reponse from the DSP. It takes ~300-350us for this round trip, so about half for each Notify.

The FFT timing reported is as observed on the DSP (code excerpt below, where N = 32K).

GPIO_DBG_HI();
DSP_fft16x16r( N, &x1_16x16[0], &w_16x16[0], y1_16x16, N/4, 0, N );
DSP_fft16x16r( N/4, &x1_16x16[0], &w_16x16[2*3*N/4], y1_16x16, RADIX, 0, N );
DSP_fft16x16r( N/4, &x1_16x16[2*N/4], &w_16x16[2*3*N/4], y1_16x16, RADIX, N/4, N );
DSP_fft16x16r( N/4, &x1_16x16[2*N/2], &w_16x16[2*3*N/4], y1_16x16, RADIX, N/2, N );
DSP_fft16x16r( N/4, &x1_16x16[2*3*N/4], &w_16x16[2*3*N/4], y1_16x16, RADIX, 3*N/4, N );
GPIO_DBG_LO();

The GPIO() functions toggle a gpio line directly from the dsp code, observable on a scope- so the observed time must be from calculations related to the FFT, data memory accesses, or program memory accesses.

0 Sandeep R over 11 years ago in reply to Rick Rogers

TI__Prodigy 550 points

Rick,

In your study on Neon vs DSP, have you compared the benchmarks

- Dhrystone

- Whetstone

- BDTI

- FFTW

Rgds, Sandeep

0 Konstantin over 11 years ago in reply to Rick Rogers

Intellectual 505 points

Rick Rogers said:

code on OMAP3530 is as follows:

Dsplink v1.63 used for general dsp control, such as code loading, proc start, etc.

minimal use of dsplink functionality to minimize ARM/DSP IPC overhead

Notify API called from ARM to signal DSP to start calculation

Notify API called from DSP to signal ARM calculation is complete

No other dsplink facilities, such as ringio, msgq, etc. used during runtime

CMEM module used on ARM side to allocate contiguous memory (outside of kernel aware region) as ARM/DSP shared memory

Rick, could you provide your code that you mentioned above? Looks like I'm stuck in my bi-directional notify program with my own memory allocation...

Processors

Processors forum

Neon vs DSP FFT performance