Hi,
I may have discovered a bug that cause cache coherence problem in fft_sp_2d_r2c and similar functions.
When N1 and N2 are small enough like 16 or 32, after fft_execute returns, the first part of the input/output array may be unchanged (as the same before calling fft_execute()).
I think what causing this is the size of the input is no large enough to flush the in/out array out of cache and ecpy copy the transposed data directly to memory without invalidating the cache.
The problem can be solved by manually invalidating the L2 cache for the input/output buffer after calling fft_execute().
Regards,
Li