Hi,
I am using a special compiler that synthesizes generic gcc-compatible DSP code, and I would like to get it running as efficiently as possible without resorting to hand-coding special instructions. I have compiled my own custom code using C6RunApp for the Experimenter L138 board--as I understand it, this way the program runs practically exclusively on the C674x unit. However, my own program is only getting 19 MFLOPS. (Or, if I use a more standard benchmark like http://www.netlib.org/benchmark/linpackc.new it reports only 37 MFLOPS or fewer for less optimal dataset sizes.)
My question is: do I need to change the setup in order to get closer to 300 MFLOPS, which is the clockspeed of the ARM core anyway?
For instance, I couldn't find out about any special flags I should pass to the compiler to specify which processor to help it optimize the code better. Already I am using release mode.
Or how much would it help to disable the DSP/BIOS? I don't need to specifically use C6Run to compile my program.
Or is there some reason why Code Composer Studio would do a much better job of optimizing the DSP code?
Is there a document I should read about managing the memory layout better? The program that generates the gcc code is just allocating a large number of scalar floats, so they aren't on the heap. Or should I assume that cache misses are causing this massive slowdown?
Thanks in advance for your response,
John