2D FFT float point with c6a8168 and c6accel memory slow

Patrick Emer

Hi,

i have implemted a 2D FFT with the first Devel Board and with c6accel and with floating points. The implementation of the 2D FFT runs exlusivly on the DSP link side.

Now i have realiset that the first dimionsion of the FFT if it runs on the same memory, runs well fast.

For example:

512 x 512 N FFT Input = Output Buff and Input only on the same 512 x single presion Float point = 8192 us

If i am turn the pointer higher and if it points to a new memory the result is at follows: ~ 13800 us

The figures show that the menory access is expensive.

If i turn the picture round for the secound dimension and run the two FFT for 2 dimension it takes: ~ 58000 us

I know that i mess up the cache line in the DSP l1 cache.

I think the Memory access is to slow.

Does anybody know a solution for this Problem?

Or is it a hardware bug ?

My Env: ezsdk 05_01_01_80

regards

Patrick

over 14 years ago

0 Rahul Prabhu over 14 years ago

TI__Guru** 116170 points

Patrick,

Are you calling the floating point FFT function in C6accel using the unit server we have provided in the package or have you created your own setup of calling the function over SYSLINK instead of the codec engine? In case you have created your own setup using SYSLINK, I suspect the cache is not being enabled.

Using the unit server we have provided in the package, the 8192 pt floating point FFT was timed ar 1.586 ms on C6A8168. You can replicate this by running c6accel test application included in the package under soc/app. You can see the bios configuration of the unit server under $(C6accel_install_dir)/soc/packages/ti/c6accel_unitservers/TI816x/bios6.cfg to see the cache configurations.

Regards,

Rahul

0 Patrick Emer over 14 years ago in reply to Rahul Prabhu

Prodigy 110 points

Dear Rahul,

thank you for your answear,

I forgot to tell you that i am aready use the current version of c6acce of ethe ezsdkl.

I will be back in the office on wednesday and will give you a detail answear and implementaion information then.

regards

Patrick

0 Patrick Emer over 14 years ago in reply to Rahul Prabhu

Prodigy 110 points

Dear Rahul,

the implementaion in c6accel of the single float fft are not good, because the FFT SP Alg is a inplace FFT Alg and they didn't use the InBuf and the OutBuf as the same memory.

You become a better performence if you are useing the same memory.

Please test:

in file: (c6accel_2_01_00_08) /soc/app/c6accel_testfxns.c

at line 6373:

replace:

inX = (float *)pSrcBuf_16bpp;

with

inX = (float *)pOutBuf_16bpp;

But this is not my Problem.

My Problem is that i am using the current ezsdk and c6accel and the speed of the memory request are not very fast.

My example code in file: (c6accel_2_01_00_08)/dsp/alg/src/C6accel_ti_dspFunctionCall.c

/* Get two lines of 512 *2 Complex single float */

void copy_8_df(float *dst_1,float *dst_2, float *src, int n, const int size)
{
    const int size_2 = size*2*2;
    do {

    register int64_t x1 = ((int64_t *) src) [0];
    register int64_t x2 = ((int64_t *) src) [1];
    register int64_t x3 = ((int64_t *) src) [size];
    register int64_t x4 = ((int64_t *) src) [1+size];

    src += size_2;

    ((int64_t *) dst_1)[0] = x1;
    ((int64_t *) dst_2)[0] = x2;
    ((int64_t *) dst_1)[1] = x3;
    ((int64_t *) dst_2)[1] = x4;

    dst_1 += 4;
    dst_2 += 4;

    } while (--n>0);
    // n = 256 = 512 * 2 / 4
}

/* Paste In Switch case */

                  dsp_calc_p *C6ACCEL_TI_IMG_dsp_calc_pparamPtr;
                      C6ACCEL_TI_IMG_dsp_calc_pparamPtr = pFnArray;
                      /*Parameter check*/
                      if(((C6ACCEL_TI_IMG_dsp_calc_pparamPtr->iptr) > INBUF15)   ||
                       ((C6ACCEL_TI_IMG_dsp_calc_pparamPtr->ptr_brev)>INBUF15) ||
                       ((C6ACCEL_TI_IMG_dsp_calc_pparamPtr->tw_w)>INBUF15)     ||
                         ((C6ACCEL_TI_IMG_dsp_calc_pparamPtr->optr)>OUTBUF15)    ||
                         ((C6ACCEL_TI_IMG_dsp_calc_pparamPtr->height)<= 0)       ||
                         ((C6ACCEL_TI_IMG_dsp_calc_pparamPtr->radix)<= 0)        ||
                         ((C6ACCEL_TI_IMG_dsp_calc_pparamPtr->width)<= 0)){
                           return(IUNIVERSAL_EPARAMFAIL);
                       }
                      else
                        { // Use Row to loop through image line by line using existing DSP API
                         int r;
                         float *pInFloat=(float *)inBufs->descs[C6ACCEL_TI_IMG_dsp_calc_pparamPtr->iptr].buf;
                         float *pOutFloat=(float *)outBufs->descs[C6ACCEL_TI_IMG_dsp_calc_pparamPtr->optr].buf;
                         float *tw_w = (float *)inBufs->descs[C6ACCEL_TI_IMG_dsp_calc_pparamPtr->tw_w].buf;
                         float *tw_h = (float *)inBufs->descs[C6ACCEL_TI_IMG_dsp_calc_pparamPtr->tw_h].buf;
                         unsigned char *brev = (unsigned char *)inBufs->descs[C6ACCEL_TI_IMG_dsp_calc_pparamPtr->ptr_brev].buf;
                         int radix = C6ACCEL_TI_IMG_dsp_calc_pparamPtr->radix;
                         int offset = C6ACCEL_TI_IMG_dsp_calc_pparamPtr->offset;
                         int width = C6ACCEL_TI_IMG_dsp_calc_pparamPtr->width;
                         int height = C6ACCEL_TI_IMG_dsp_calc_pparamPtr->height;

                         /* Call underlying kernel */

                         for(r=0;r<height;r++)
                         {
                            DSPF_sp_fftSPxSP(width,
                                    pInFloat + r*width*2,
                                    tw_w,
                                    pInFloat + r*width*2,
                                    brev,
                                    radix,
                                    offset,
                                    width);

                         }

                         int count = width * 2 /4;

                         float *outbuf;
                            outbuf = memalign(16, width*sizeof(float)*2*2);

                            for(r=0;r<width;r+=2)
                            {
                                copy_8_df(outbuf,&outbuf[width*2],pInFloat,count,width);

                                DSPF_sp_fftSPxSP(height,
                                    outbuf,
                                    tw_h,
                                    outbuf,
                                    brev,
                                    radix,
                                    offset,
                                    height);

                                DSPF_sp_fftSPxSP(height,
                                    &outbuf[width * 2],
                                    tw_h,
                                    &outbuf[width * 2],
                                    brev,
                                    radix,
                                    offset,
                                    height);

                                memcpy(pOutFloat,outbuf,width * sizeof(float) *2 * 2);
                            }

                        }

                         Memory_cacheWbInv(inBufs->descs[C6ACCEL_TI_IMG_dsp_calc_pparamPtr->iptr].buf,inBufs->descs[C6ACCEL_TI_IMG_dsp_calc_pparamPtr->iptr].bufSize);

My Question is:

Is there a Hardware Bug, because the memory copy is very slow? ( Please note, if you copy with two for loop the secound dimension of fft is it slow sub-total.)

Is there a way to rotate the Image of the first dimension of FFT by using a special lib or function or special edma setting ?

What is the best setting for this function in bios6.cfg ?

thx in advance for a answer.

regards

Patrick

p.s. i updated the time of the first post.

Processors

Processors forum

2D FFT float point with c6a8168 and c6accel memory slow