This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Linpack Benchmarks Numbers on C6678

Hi,

I'm trying to run the Linpack benchmark on C6678 evm board. This is the source file of Linpack I'm using.

 I saw some slides from TI claiming that you got 11GFLOPs /core using the same chip running that benchmark.

(page 5 of the slides)

So I compiled and run the linpack benchmark on C6678 with same configuration, it could only get 130MFLOPs/core without optimization. Then I manually optimized the source code and turn on all the compiler optimizations but could only get up to 500MFLOPs. (theoretical limit is 16GFLOPs @1GHz)

Even though I didn't use OpenMP and only run the benchmark on one core, it should get to at least some GFLOPs instead of MFLOPs. 

So I'm wondering maybe there's something important that I missed to get high performance? Or would you please points to me some directions on how to get the numbers in the slides?

Thanks in advance,

Shang

  • Hi Shang,

    With reference to the attached presentation, the maximum performance achieved is 25.3 GFLOPS @ 1GHz which means that the performance is measured for C6678(8 cores) device not for the C6678/core.

    Theoretical limit is 16GFLOPs/Core for Floating Point @ 1GHz and 20GFLOPs/Core for Floating Point @ 1.25 GHz.

    Thank you.
  • You can download one example here for Keystone I device(C6678):

    The datasheet numbers are accurate for what they are, the design goal and true capability of the C66x architecture at the stated instruction cycle rate. Your application may use the architecture at this peak rate or your application may use other high-performance features of the device such as caching, wide internal memory buses, 128-bit data types, 8 parallel/simultaneous functional units, and so on.

    Please refer below threads which may help to understand more,

    The first suggestion that I would make is to take a look at the assembly code that is being generated.  You need to give the compiler in order to let it optimize as much as possible.

    We have a 4-day workshop on Optimization on the C6000.  You can get the whole thing here: 

    http://processors.wiki.ti.com/index.php/TMS320C6000_DSP_Optimization_Workshop 

    If the assembly is actually optimized correctly, then you're likely getting stalls.  This could be due to specifics of the cache configuration, data memory placement, data alignment.  All of these issues are also covered in the workshop above. 

    Thank you.

  • I'd also suggest taking a look at the following application notes:

    - Throughput Performance Guide for C66x KeyStone Devices

    - Optimizing Loops on the C66x DSP

  • Thank you Raja, I'll try their linpack and see whether I would get similar numbers.
  • Hi Shang,

    Please refer below post from Asheesh,

    Thank you.