This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
My customer has implemented a 4096 point real FFT and then wanted to check the execution time. Here is the scenario:
"I have written FFT code putting the input & output buffers and twiddle
factors in External Memory. I just now moved it all into fast on-chip RAM,
combining RAML4&5 for input buffer, RAML6&7 for output buffer, and moving
the previous stuff in RAML5 to RAML2. I left Twiddle factors and window
functions in Flash. I was able to get exactly the same result.
Now I want to compare the time taken by the 2 methods.
I used GPIO32 and measured the time between pulses on the
oscilloscope.
Doing a 4096 point real FFT using only on-chip memory took 3.65 ms without
any window function, 4.15 ms with window multiply.
If the input / output buffers are in external memory, with window function
it took 15.75 ms.
"
How come the same FFT code takes longer with external (fast) RAM compared to on-chip RAM? Though XINTF bus is slower the difference in the timings is quite large.
Any thoughts/feedback will be appreciated.
Regards,
Pradeep Shinde
DCAT, Dallas
Due to the CPU pipeline, things can get a lot slower from external RAM depending on where you have everything linked. The external RAM appears as a single block of memory, allowing only one access per cycle. Worst case is code and data external. Then, the pipeline could be trying to do a fetch, a read, and a write to external RAM at the same time. The external interface will bottleneck here, and everything will slow down. For the same scenario in internal RAM, the fetch, read, and write could all be to different RAM blocks and there would be no bottleneck and everything would run single cycle.
Regards,
David
Thanks, David.
Then ext RAM is essentially additional memory. I was under the impression that XINTF bus speed is the only criteria while calculating execution speed.
Regards,
Pradeep Shinde
Pradeep,
Something very key in David's message is allocating to "different RAM blocks". Each RAM block is single access, but you can access different RAM blocks at the same time. So if you could somehow manage to put all the code and data in one RAM block you would also introduce stalls. If you put code and one buffer in the same block you could see additional cycles. The best senerio has everything in seperate RAM blocks so nobody has to wait for another access to complete.
Lori