This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Maximizing Generic Floating-Point Performance for C674x on L138

Hi,

I am using a special compiler that synthesizes generic gcc-compatible DSP code, and I would like to get it running as efficiently as possible without resorting to hand-coding special instructions. I have compiled my own custom code using C6RunApp for the Experimenter L138 board--as I understand it, this way the program runs practically exclusively on the C674x unit. However, my own program is only getting 19 MFLOPS. (Or, if I use a more standard benchmark like http://www.netlib.org/benchmark/linpackc.new it reports only 37 MFLOPS or fewer for less optimal dataset sizes.)

My question is: do I need to change the setup in order to get closer to 300 MFLOPS, which is the clockspeed of the ARM core anyway?

For instance, I couldn't find out about any special flags I should pass to the compiler to specify which processor to help it optimize the code better. Already I am using release mode.

Or how much would it help to disable the DSP/BIOS? I don't need to specifically use C6Run to compile my program.

Or is there some reason why Code Composer Studio would do a much better job of optimizing the DSP code?

Is there a document I should read about managing the memory layout better? The program that generates the gcc code is just allocating a large number of scalar floats, so they aren't on the heap. Or should I assume that cache misses are causing this massive slowdown?

Thanks in advance for your response,

John

  • Hello all,

    I did some further testing about cache misses and the program size. For example, consider the sample C6RunApp program emqbit for benchmarking. The executable bench_arm is only 12959 bytes and runs entirely on the ARM. However, the executable bench_dsp, which runs the same program but primarily on the DSP, is 331031 bytes. (Sizes are similar for C6RunLib compilation, and for other programs the DSP-runnable executable seems to always be about this size.) The binary code for the inter-processor communication seems to be quite large. However, the L2 cache on the DSP is only 256kbytes. This means that the L2 DSP cache could well be missing, especially taking into account data used by the program, explaining why we are only achieving low performance as the program has to go all the way back to the DRAM across the system bus.

    Is it possible for me to avoid using C6Run to decrease the binary size and see if my code runs faster? Is my understanding correct that I can do that by simply implementing all the inter-processor communication myself using dsplink utilities?

    Thanks again very much for your response. I will write back to report on my progress when you make a suggestion.

    Best,

    Edgar

    PS. Just to emphasize the problem, I have run the same code on the Beagle Board xM's DSP using C6Run. In fact, it runs 50% as fast as currently on the L138 experimenter, even though the Beagle Board xM's DSP is emulating floating point instructions. The Beagle Board xM would suffer approx. the same overhead if it is due to C6Run code size eating up the DSP L2 cache.