This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Neon vs DSP FFT performance

Other Parts Discussed in Thread: OMAP3530

I'm seeing some odd timing results comparing FFT performance between the ARM/Neon and DSP cores of the OMAP3530. 

  • FFT size 32768
  • ARM/Neon optimized FFTMPEG timing data: 10.4ms (32-bit floating point, complex)
  • DSP timing data: 60ms (16-bit, fixed point, complex)

With the large FFT size, it is not possible to place the data in internal DSP memory.  Is there a logical explanation for why the Neon would appear to outperform the DSP?

Thanks.

 

  • Forgot to mention the DSP fft kernel is from TI C64x+DSPLIB.

     

  • Rick,

    Can you describe your setup while running the FFT code on the DSP. Are you running the code only on the DSP or are you calling the DSP function from your ARM application using DSPLINK? What are the operating frequencies of ARM and DSP at which you are running these tests? Please note that on the OMAP3530 platform the ARM operating frequency is higher than the DSP frequency.

    I am a little suprised with the numbers that you are observing. The performance numbers for the 16 bit FFT function with  32K samples on the DSP are around 205K cycles, which on OMAP3530 platform should be around 400 usecs (@520MHz). You can find these performance numbers in the documentation of C64x+ DSPLIB that can be found in the docs folder of the library. Please also do take a look at the performance numbers of 64K sample FFT on OMAP3530 that we observed while calling this function from an ARM application using our new DSP+ARM development tool called C6EZAccel on the following url.

    http://processors.wiki.ti.com/index.php/C6Accel_FAQ

     

    Regards,

    Rahul

  • Rahul,

    Thanks for the response. 

    Briefly, my overall setup for running dsp code on OMAP3530 is as follows:

    • Dsplink v1.63 used for general dsp control, such as code loading, proc start, etc.
      • minimal use of dsplink functionality to minimize ARM/DSP IPC overhead
        • Notify API called from ARM to signal DSP to start calculation
        • Notify API called from DSP to signal ARM calculation is complete
        • No other dsplink facilities, such as ringio, msgq, etc. used during runtime
    •  CMEM module used on ARM side to allocate contiguous memory (outside of kernel aware region) as ARM/DSP shared memory
    •  C64x+DSPLIB used for fft 
    • Clocks:
      • ARM 500MHz
      • DSP 365MHz

    The clock cycles quoted in the DSPLIB doc assume all data is located in internal memory- for a 32K point, 16-bit fft, it is not possible to do this.  Incidentally, I do not see numbers for a 64K point data set in the DSBLIB doc- are you extrapolating from highest shown of 16K?  For my case of 32K points, 384KB of data space is required (128K each for input, output and twiddle factors- in place calculation not allowed per doc)- DSP cache is configured for max available (L1 data = 32K and L2 data = 64K).  I was able to get the DSP version of the 32K point fft down to 29.3ms by using the "cache optimized" version of the dsplib fft (DSP_fft16x16r(), as per the example in the doc), and by catching an error in the cache configuration of the ARM/DSP shared memory region (appropriate MAR bits were not set). 

    It is still odd that I seem to get better performance on the ARM side, since it is using external RAM for all data also.  Several things to note: ARM has 256K L2 cache, ARM is clocked higher, ARM is using floating point data (which I believe Neon is more heavily optimized for floats, even though it can handle both floating and fixed point data).  Attached is test project that has the Neon optimized fft from FFMPEG.  The project uses scons for building- to build for a gumstix dev board: "scons -f GSconstruct". 

    -Rick

     

    FFMPEG_FFT.zip
  • Also, my method of timing analysis is setting gpio line high just before call to fft function and low immediately after function return.

    -Rick

     

  • Rick,

    We are looking into this issue. Once we replicate this we will be able to profile the application and let you know the cause of this occurance. As you can see from the benchmark I earlier reported that the DSP is executing the FFT in 400 us but when you call a function on the DSP using DSPLINK there are overheads from Cache invalidations as well as address translation which are probably the reason for the overheads you are seeing.

    We will take a look at this and see if we can help you achieve a performance better than 6.9ms that you are expecting.

    Regards,

    Rahul

  • Rahul

    For OMAP3530, the MMU mapping is one is to one physical to virtual.  What overhead will the address translation cause? The cache calls are also minimal and do not explain the numbers in ms.

    For Notify, the profiling numbers are in the usec range ~200 to 400.and not in ms range.

    Could there be some delay in clearing the previous event on the ARM core? There are cache calls in a loop in the DSP ISR if the ARM core has not cleared the previous event. This could happen if the application on ARM is doing something else preventing the previous event to be cleared thereby increasing the latency?

    Deepali

  • I've observed roughly the same timing for Notify.  To test this, a task on the DSP side waits for a Notify from the ARM (semaphore pend in DSP Notify callback ISR), then the DSP immediately responds with Notify to ARM.  Timing was checked from immediately prior to the ARM sending its Notify to immediately after the ARM receives a Notify reponse from the DSP.  It takes ~300-350us for this round trip, so about half for each Notify.

    The FFT timing reported is as observed on the DSP (code excerpt below, where N = 32K).

    GPIO_DBG_HI();
    DSP_fft16x16r( N,  &x1_16x16[0], &w_16x16[0], y1_16x16, N/4, 0, N );
    DSP_fft16x16r( N/4, &x1_16x16[0], &w_16x16[2*3*N/4], y1_16x16, RADIX, 0, N );
    DSP_fft16x16r( N/4, &x1_16x16[2*N/4], &w_16x16[2*3*N/4], y1_16x16, RADIX, N/4, N );
    DSP_fft16x16r( N/4, &x1_16x16[2*N/2], &w_16x16[2*3*N/4], y1_16x16, RADIX, N/2, N );
    DSP_fft16x16r( N/4, &x1_16x16[2*3*N/4], &w_16x16[2*3*N/4], y1_16x16, RADIX, 3*N/4, N );
    GPIO_DBG_LO();

    The GPIO() functions toggle a gpio line directly from the dsp code, observable on a scope- so the observed time must be from calculations related to the FFT, data memory accesses, or program memory accesses.

     

  • Rick, 

    In your study on Neon vs DSP, have you compared the benchmarks 

    - Dhrystone

    - Whetstone

    - BDTI

    - FFTW

    Rgds, Sandeep

  • Rick Rogers said:

    code on OMAP3530 is as follows:

    • Dsplink v1.63 used for general dsp control, such as code loading, proc start, etc.
      • minimal use of dsplink functionality to minimize ARM/DSP IPC overhead
        • Notify API called from ARM to signal DSP to start calculation
        • Notify API called from DSP to signal ARM calculation is complete
        • No other dsplink facilities, such as ringio, msgq, etc. used during runtime
    •  CMEM module used on ARM side to allocate contiguous memory (outside of kernel aware region) as ARM/DSP shared memory

     , could you provide your code that you mentioned above? Looks like I'm stuck in my bi-directional notify program with my own memory allocation...