This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

XTCIEVMK2LX: Performance of FFTLIB

Part Number: XTCIEVMK2LX

Hello, experts.

I'm testing FFTLIB these days and trying to FFT over very large number of points data, 20M-point, with 1D SP FFT R2C kernel (fft_sp_1d_r2c.c).

I utilized the example project of the kernel to write test code and performed the FFT in the direct mode because the input size is limited in EDMA mode.

It took too long time, over 4 seconds for creating the FFT plan, and about 3.8 seconds for executing the plan.

The times were measured using TSCL/TSCH. The time stamp registers give the number of cycles used, so I converted to time in seconds by dividing by the clock speed, 1GHz.

I found that if I use the power of 2 input size bigger than 20M, 2^25, it takes shorter time for creating the FFT plan than 20M input size (~1s). Time for execution is similar, though (~3.8s).

I'm wondering that this is the best performance for my DSP board, which has C66 DSP cores? When I do FFT using Python and Scipy FFT pack on my laptop, it only takes less than 10 us. My laptop has 2.53GHz CPU, and I know it's much faster than the DSP with 1GHz clock speed, but the gap is too big considering the DSP chip is specialized for signal processing. I guess it would be faster if I utilize the all of the 4 DSP cores on the board, but it wouldn't be 4 times faster due to the limitation of parallel processing. Even if it could be 4x faster, it's still not so fast enough for my purpose (total collapsed time must be less than 1s).

Am I missing something? Could it be much faster? Following is the code snippet of the test program:

...

#pragma DATA_SECTION(x_i,  ".ddr_mem");
#pragma DATA_SECTION(y_i,  ".ddr_mem");
#pragma DATA_SECTION(w_i,  ".ddr_mem");

#pragma DATA_ALIGN(x_i,  8);
#pragma DATA_ALIGN(w_i,  8);
#pragma DATA_ALIGN(y_i,  8);

#define MAXN  (2048*2048*8)
#define M     (2*MAXN)
#define PAD   (0)
#define TEST_SIZE	(2048*2048*8)

float x_i [M + 2 * PAD];
float y_i [M + 2 * PAD];
float w_i [4*2048 + 2 * PAD];

...

float const CLOCK_SPEED = 1e9;

int main ()
{
    int     j, N = TEST_SIZE;
    clock_t t_start, t_stop, t_overhead, t_opt, t_mset_x, t_mset_y, t_fill_x, t_cplan, t_tplan;
    float t_opt_s, t_total_s;
    fft_plan_t p;
    fft_callout_t plan_fxns;

    TSCL=0;TSCH=0;

    plan_fxns.memoryRequest   = fft_memory_request;
    plan_fxns.memoryRelease   = fft_memory_release;

    t_start = _itoll(TSCH, TSCL);
    memset (x_i,  0x55, sizeof (x_i) );
    t_stop  = _itoll(TSCH, TSCL);
    t_mset_x = t_stop - t_start;

    t_start = _itoll(TSCH, TSCL);
    memset (y_i,  0xA5, sizeof (y_i) );
    t_stop  = _itoll(TSCH, TSCL);
    t_mset_y = t_stop - t_start;

    t_start = _itoll(TSCH, TSCL);
    for (j = 0; j < N; j++) {
      x_i[PAD + j] = cos (2 * 3.1415 * 50 * j / (double) N);
    }
    t_stop  = _itoll(TSCH, TSCL);
    t_fill_x = t_stop - t_start;

    t_start = _itoll(TSCH, TSCL);
    t_stop  = _itoll(TSCH, TSCL);
    t_overhead = t_stop - t_start;

    plan_fxns.ecpyRequest = NULL;
    plan_fxns.ecpyRelease = NULL;

    t_start = _itoll(TSCH, TSCL);
    p = fft_sp_plan_1d_r2c (N, FFT_DIRECT, plan_fxns);
    t_stop  = _itoll(TSCH, TSCL);
    t_cplan = t_stop - t_start;

    t_start = _itoll(TSCH, TSCL);
    fft_execute (p);
    t_stop = _itoll(TSCH, TSCL);
    t_opt  = (t_stop - t_start) - t_overhead;
    t_opt_s = t_opt / CLOCK_SPEED;

    t_start = _itoll(TSCH, TSCL);
    fft_destroy_plan (p);
    t_stop  = _itoll(TSCH, TSCL);
    t_tplan = t_stop - t_start;
        
    printf("fft_sp_1d_r2c_direct\tsize= %d\n", N);
    printf("\tN = %d\tCycle: %u (%.2fs)\n", N, t_opt, t_opt_s);
    printf("\tt_mset_x: %u (%.2fs)\n", t_mset_x, t_mset_x / CLOCK_SPEED);
    printf("\tt_mset_y: %u (%.2fs)\n", t_mset_y, t_mset_y / CLOCK_SPEED);
    printf("\tt_fill_x: %u (%.2fs)\n", t_fill_x, t_fill_x / CLOCK_SPEED);
    printf("\tt_cplan: %u (%.2fs)\n", t_cplan, t_cplan / CLOCK_SPEED);
    printf("\tt_tplan: %u (%.2fs)\n", t_tplan, t_tplan / CLOCK_SPEED);

    t_total_s = t_opt / CLOCK_SPEED
    		+ t_mset_x / CLOCK_SPEED
		+ t_mset_y / CLOCK_SPEED
		+ t_fill_x / CLOCK_SPEED
		+ t_cplan / CLOCK_SPEED
		+ t_tplan / CLOCK_SPEED;

    printf("\tTotal: %fs\n\n", t_total_s);
}

...

The outputs are as follows.

- With 20M-point input:

 fft_sp_1d_r2c_direct size= 20000000
 N = 20000000 Cycle: 3755507597 (3.76s)
 t_mset_x: 43494258 (0.04s)
 t_mset_y: 43432856 (0.04s)
 t_fill_x: 2978519097 (2.98s)
 t_cplan: 4229220388 (4.23s)
 t_tplan: 288 (0.00s)
 Total: 11.050175s

- With 2^25-point input:

 fft_sp_1d_r2c_direct size= 33554432
 N = 33554432 Cycle: 3711867346 (3.71s)
 t_mset_x: 42884486 (0.04s)
 t_mset_y: 42929794 (0.04s)
 t_fill_x: 1658446567 (1.66s)
 t_cplan: 857928992 (0.86s)
 t_tplan: 288 (0.00s)
 Total: 6.314057s

Thank you!