I am trying to calculate FFT using CFFT_f32 for 512 points. I am toggling GPIO to get the time taken to execute the function. It is coming out to be around 4 ms, which is very high compared to the time given in FPU library user guide (24k cycles). How can I improve the performance of my code?