This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

J722SXH01EVM: TDA4 poor FFTLIB performance

Part Number: J722SXH01EVM
Other Parts Discussed in Thread: FFTLIB, SYSCONFIG

Tool/software:

Hello,
I have built the 9.02 version of FFTLIB to test performance of our J722SXH01EVM reference board. I am running 16 1024-point FFTs, trying to match the documented performance of 31650 cycles for a 1024x16 FFT. I am using the FFTLIB_fft1dBatched_i32fc_c32fc_o32cf function to test this, but the performance I am getting is 568,441 cycles to run the batch. This is about 16x slower than I'm expecting, which makes me believe perhaps batching, or vectorization, or something like that is not working, and hence we are seeing 16x reduced performance. But the documentation is minimal and the code is very hard to read, so I'm finding it difficult to debug this issue.

The code snippet I am running is here:

        curTime = ClockP_getTimeUsec(); /* get time as measured by timer associated with ClockP module */

        
        #define NUM_POINTS 1024
        #define NUM_CHANNELS 16


        uint8_t pBlock[FFTLIB_FFT1DBATCHED_I32FC_C32FC_O32FC_PBLOCK_SIZE];
    	FFTLIB_F32* pX = (FFTLIB_F32 *) memalign (128, NUM_POINTS*NUM_CHANNELS*2 * sizeof (FFTLIB_F32));
    	FFTLIB_F32* pW = (FFTLIB_F32 *) memalign (128, NUM_POINTS*2 * sizeof (FFTLIB_F32));
    	FFTLIB_F32* pY = (FFTLIB_F32 *) memalign (128, NUM_POINTS*NUM_CHANNELS*2 * sizeof (FFTLIB_F32));
    	FFTLIB_bufParams1D_t bufParamsData;
    	FFTLIB_bufParams1D_t bufParamsTw;
    	uint32_t numPoints = NUM_POINTS;
    	uint32_t numChannels = NUM_CHANNELS;
    	
    	bufParamsData.data_type = FFTLIB_FLOAT32;
    	bufParamsData.dim_x = NUM_POINTS*NUM_CHANNELS*2;
    	bufParamsTw.data_type = FFTLIB_FLOAT32;
    	bufParamsTw.dim_x = NUM_POINTS*2;
    	
    	tw_gen_f32(pW, numPoints); // Generate twiddle factors
    	
    	
    	FFTLIB_STATUS status = FFTLIB_fft1dBatched_i32fc_c32fc_o32fc_checkParams(pX, &bufParamsData, pW, &bufParamsTw, pY, &bufParamsData, numPoints, numChannels, pBlock);
    	uint64_t checkTime = ClockP_getTimeUsec();
    	DebugP_log("FFTLIB_STATUS = %d TIME = %d usecs\r\n", status, (uint32_t)(checkTime-curTime));
    	checkTime = ClockP_getTimeUsec();
    	
    	int i = 0;
    	status = FFTLIB_fft1dBatched_i32fc_c32fc_o32fc_init(pX, &bufParamsData, pW, &bufParamsTw, pY, &bufParamsData, numPoints, numChannels, pBlock);
    	status = FFTLIB_fft1dBatched_i32fc_c32fc_o32fc_kernel(pX, &bufParamsData, pW, &bufParamsTw, pY, &bufParamsData, numPoints, numChannels, pBlock);
    	
    	uint64_t fftTime = ClockP_getTimeUsec();
    	DebugP_log("FFTLIB_STATUS = %d TIME = %d usecs\r\n", status, (uint32_t)(fftTime-checkTime));
    	
    	

        curTime = ClockP_getTimeUsec() - curTime; /* get time and calculate diff, ClockP returns 64b value so there wont be overflow here */

        DebugP_log("FFT FLOAT32 ... DONE (Measured time = %d usecs) !!!\r\n",
            (uint32_t)curTime);

  • It also occurs to me that I don't know if the L2 cache is enabled. I see it marked as "unused" in the linker .map file but I don't know how to enable it. How can I check/ enable the L2 cache? I can't find any documentation on it. I am using one of the examples from the mcu+ sdk as a baseline project, running the C7524 on its own.

    I have also tried linking such that the heap is contained in the L2 SRAM but nothing works if I do that.

  • Hi Tyler,

    FFTLIB is currently not supported for J722S and is not part of the MCU+SDK package for the 9.2 SDK, so I will not be able to guarantee any performance for the platform. However, I will give you a few pointers in this area:

    I would expect you to be able to achieves cycle counts in the ballpark of Sitara's AM62A device which uses a similar C75x variant. A few things to point out, the numbers given in the datasheet profile the optimized kernel execution function alone - meaning just the FFTLIB_fft1dBatched_i32fc_c32fc_o32fc_kernel() function. You can see this in the driver file, which are the _d files within the test/ folder for each kernel. From the numbers you have provided, I would suspect that you are not allocating your input buffers within L2SRAM in lines 9-11, and potentially DDR, which would explain the significantly higher cycle count. You should verify the address space in which you are allocating these buffers, as well as your linker file. 

    With regards to the L2 Memory, there is no L2 cache for C7524, and from my understanding, any C75x variant currently. This is why the TRM does not mention it. References to an L2 cache are pertaining to C71x (the C7x variant on other Jacinto devices) which has a configurable L2 cache. 

    Best,

    Asha

  • Thanks for the clarification. I'm now trying to allocate the buffers in SRAM but having an issue.

    I'm allocating the buffer like this:

    __attribute__ ((section (".l2mem"), aligned (64)))
    uint8_t FFTLIB_fft1dBatched_i32fc_c32fc_o32fc_pBlock
        [FFTLIB_FFT1DBATCHED_I32FC_C32FC_O32FC_PBLOCK_SIZE];

    and I've got SRAM in the linker file like this:

    MEMORY
    {
        L2SRAM (RWX):  org = 0x7E000000,                len = 0x200000
        ...
    }
    
    SECTIONS
    {
        ...
        .l2mem      >       L2SRAM
        ...
    
    }

    It links successfully, but I get an exception when I access the buffer. Do I need to enable the memory somehow? I haven't been able to find material on this - if you can point me in the right direction I'd appreciate it.

  • Hi Tyler,

    The L2SRAM address looks correct to me - if you suspect any issues with the linker file I would recommend looking at linker files given in MMALIB or vision_apps. 

    For allocating in L2SRAM, you can see how this is done in the driver files. The TI_memalign function is used to allocate the input buffers into L2SRAM.

    Best,

    Asha

  • The issue is that the SRAM was not mapped in the sysconfig file, I've got it working now. Odd that the example had it set up in the linker file but not in the sysconfig.