PROCESSOR-SDK-AM62X: Unexpectedly High FFT Execution Time on AM62x – Is FPU Enabled?

soheil jabb

Tool/software:

Hi TI Team,

Due to a suggestion from a TI agent, I have removed the FFT-related part from my previous post (linked below) to keep the discussion focused:

https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1504886/processor-sdk-am62x-gpio-toggling-delay-inaccuracy-on-am62x-m4f-core/5785320#5785320

Now, I'm creating this separate post to ask specifically about the issue I'm facing with FFT processing performance.

I'm using the AM62x platform and executing an FFT calculation on 128 real double-precision values. The function I use is fft_calculation_128(x, mag, phase, 3), and I measure execution time with ClockP_getTimeUsec(). The output is consistently around 25 ms, which seems unusually high:

#include <stdio.h>
#include <math.h>
#include <drivers/soc/am62x/soc.h>
#include <kernel/dpl/DebugP.h>
#include <kernel/dpl/TimerP.h>
#include <kernel/dpl/ClockP.h>
#include <drivers/ipc_notify.h>
#include <drivers/ipc_rpmsg.h>
#include </drivers/mcspi/v0/cslr_mcspi.h>
#include "ti_drivers_config.h"
#include "ti_drivers_open_close.h"
#include "ti_board_open_close.h"
#include "ti_dpl_config.h"
#include "/ADE9000/ADE9000RegMap.h"
#include "/ADE9000/ADE9000.h"
#include "s_Pripherals_Config.h"
#include <kernel/dpl/HwiP.h>
#include "Calibration.h"
#include "/FFT/fft_opt.h"

ClockP_Params clockParams;

ClockP_Object clockObj;

void CS_GPIO_Init()
{
M4F_GPIO->DIR &= ~(0x2000);
}

void Clock_Init()
{
ClockP_Params_init(&clockParams);

clockParams.timeout = ClockP_usecToTicks(1);

clockParams.period = clockParams.timeout;

clockParams.start = 1;
}

const double sine_half_cycle_LUT[128] = {
0.000000, 0.024541, 0.049068, 0.073565, 0.098017, 0.122411, 0.146730, 0.170962,
0.195090, 0.219101, 0.242980, 0.266713, 0.290285, 0.313682, 0.336890, 0.359895,
0.382683, 0.405241, 0.427555, 0.449611, 0.471397, 0.492898, 0.514103, 0.534998,
0.555570, 0.575808, 0.595699, 0.615232, 0.634393, 0.653173, 0.671559, 0.689541,
0.707107, 0.724247, 0.740951, 0.757209, 0.773010, 0.788346, 0.803208, 0.817585,
0.831470, 0.844854, 0.857729, 0.870087, 0.881921, 0.893224, 0.903989, 0.914210,
0.923880, 0.932993, 0.941544, 0.949528, 0.956940, 0.963776, 0.970031, 0.975702,
0.980785, 0.985278, 0.989177, 0.992480, 0.995185, 0.997290, 0.998795, 0.999699,
1.000000, 0.999699, 0.998795, 0.997290, 0.995185, 0.992480, 0.989177, 0.985278,
0.980785, 0.975702, 0.970031, 0.963776, 0.956940, 0.949528, 0.941544, 0.932993,
0.923880, 0.914210, 0.903989, 0.893224, 0.881921, 0.870087, 0.857729, 0.844854,
0.831470, 0.817585, 0.803208, 0.788346, 0.773010, 0.757209, 0.740951, 0.724247,
0.707107, 0.689541, 0.671559, 0.653173, 0.634393, 0.615232, 0.595699, 0.575808,
0.555570, 0.534998, 0.514103, 0.492898, 0.471397, 0.449611, 0.427555, 0.405241,
0.382683, 0.359895, 0.336890, 0.313682, 0.290285, 0.266713, 0.242980, 0.219101,
0.195090, 0.170962, 0.146730, 0.122411, 0.098017, 0.073565, 0.049068, 0.024541
};

void hello_world_main(void *args)
{

double test = 1.0;

const uint8_t hamonics = 3;

double x[256];
double mag[hamonics];
double phase[hamonics];

/* Open drivers to open the UART driver for console */
Drivers_open();
Board_driversOpen();

Clock_Init();

CS_GPIO_Init();

Timer_Init();

for(uint16_t i =0; i < 128; i++)
{
x[i * 2] = sine_half_cycle_LUT[i];
x[i * 2 + 1] = 0.0;
}

DebugP_log("test is %f\r\n",test);

uint16_t strTime = ClockP_getTimeUsec();

fft_calculation_128(x, mag, phase, 3);

uint16_t endTime = ClockP_getTimeUsec();

for(uint8_t i = 0; i < 128; i ++)
{
DebugP_log("fft amounts of x[%d] is %f + %f j\r\n ",i,x[i * 2],x[i*2 + 1]);

ClockP_usleep(100);
}

DebugP_log("the time spend of fft calculation is %d us\r\n",endTime - strTime);

while(1)
{

}

Board_driversClose();
Drivers_close();
}

fft amounts of x[0] is 81.483242 + 0.000000 j
fft amounts of x[1] is -27.166536 + 0.000000 j
fft amounts of x[2] is -5.436583 + 0.000000 j
fft amounts of x[3] is -2.332302 + 0.000000 j
fft amounts of x[4] is -1.297546 + 0.000000 j
fft amounts of x[5] is -0.827208 + 0.000000 j
.
.
.
.
.
the time spend of fft calculation is 25271 us

This high latency is causing issues for real-time processing in my application.

I would like to ask:

Could this performance issue be related to system clock configuration or CPU frequency not being properly set?
Is the FPU (Floating Point Unit) enabled by default on AM62x? If not, how can I confirm and enable it properly in my environment?

I’d appreciate any guidance to improve the FFT performance on this platform.

Best regards,
Soheil

5 months ago

0 Tushar Thakur 4 months ago

TI__Mastermind 49558 points

Hi Soheil,

From which core are you doing the above FFT calculation?

Also at what frequency the core is running?

Regards,

Tushar

0 soheil jabb 4 months ago in reply to Tushar Thakur

Prodigy 40 points

Hi Tushar,

Thanks for your support.
I am running the FFT calculation from the M4F core on the AM62x.

I observed during my tests that the M4F core requires around 10 cycles just to perform a simple multiplication between two floating-point numbers. This seems very slow for such basic operations, and I believe one of the reasons my FFT code takes a long time is related to this.

Here is the small test code I used:

M4F_GPIO->DIR &= ~(0x2000);

float val_1 = 5.55556;
float val_2 = 6.35978;
float val_3 = 0.0;

while(1)
{
M4F_GPIO->SET_DATA = 0x2000;

// I repeat this line 100 times
val_3 = val_1 * val_2;
.
.
.

M4F_GPIO->CLR_DATA = 0x2000;
// I repeat this line 100 times
val_3 = val_1 * val_2;
.
.
.

}

Is there any way to reduce the number of cycles needed for these float operations?
Or is this slow behavior due to the memory type (e.g., internal/external memory) that the code is using?
Also, could you please confirm whether the FPU (Floating Point Unit) is enabled by default on the M4F core?

Thanks again for your help!

Best regards,
Soheil

0 Tushar Thakur 4 months ago in reply to soheil jabb

TI__Mastermind 49558 points

Hi Soheil,

Please allow some time to check the above and revert back.

Regards,

Tushar

0 Tushar Thakur 4 months ago in reply to Tushar Thakur

TI__Mastermind 49558 points

Hi Soheil,

Thanks for your patience.

Can you check how many assembly instruction it takes to execute the above multiplication?

For me it takes only 4 assembly instructions. Please refer below image.

Regards,

Tushar

Processors

Processors forum

PROCESSOR-SDK-AM62X: Unexpectedly High FFT Execution Time on AM62x – Is FPU Enabled?