This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

About the C6678 20 GFLOPS performances

Hi,

I am working currently with the C6678 and I am exploring the processing performance of the processor. According to the user guide, each core running at 1.25Ghz supports up to 20 GFLOPS. However, when looking at the core architecture I am not sure to understand how one can exploits this 20 GFLOPS. Indeed, as far as I understand it (but the reason why I am posting this message, is that I am not sure about it and would like to get the opinion of a TI expert), each core has two register files and each register file is connected with a 64-bit bus meaning that only 2 single-precision load\store operations can take place every cycle in each register file. Consequently only one single precision multiplication (whose two 32-bit operands were loaded from L1 at the previous cycle) can be performed. Given that there are two register files, it means that only two single precision multiplications can be performed by a core in one cycle when the operations are done on new operands loaded from L1 at the previous cycle. In conclusion in a realistic scenario each Core running at 1.25 Ghz would have a 2.5 GFLOPS capacity. Therefore 100% (i.e. 20GFLOPS per core), utilization is not practical unless calculations are done on the same data over and over.

Is it right? or am I missing something?

Thank you very much in advance,

Mike

  • Mike,

    I have same questions to TI.

    20GFLOPS (for single precision), just look inside datasheets (SPRUGH7—November 2010, p. 29) "1.1.1 4x Multiply
    The new C66x Core ISA significantly improves the maximum number multiply
    operations that can be executed per cycle. The core can now execute up to 32
    (16x16-bit) multiplies per cycle or up to 8 single-precision floating-point multiplies per
    cycle."

    So, for float multiply peak performance = 8*1.25GHz=10GFLOPS(.M units)         + 10GFLOPS on addition/substraction (other units) = 20GFLOPS

    but I can't archive even 30% of this

  • Mike,

    I may not be the most experienced expert on this topic, but I will make an attempt to answer your question.  

    First, lets look at the FLOPS measurement just to figure where the 20 GFLOPS comes from.  FLOPS is a measurement of Floating Point Operations Per Second.  This would include not only floating point multiplications, but also floating point additions as far as I understand.  

    So, now lets look at the available instructions and the units that they consume.  Using the QMPYSP instruction, we can do 4 single precision multiplications in one cycle on the .M1 unit and 4 single precision multiplications on the .m2 unit.  Additionally, using the DADDSP unit, we can do 2 SP additions on the .L1 unit, 2 on the .L2 unit, 2 on the .S1 unit, and 2 on the .S2 unit.  So, that's a total of 16 Floating Point Operations per cycle.  These are all SIMD instructions (Single Instruction Multiple Data)

    16 FLOP/Cycle * 1.25 GCycles/second = 20 GFLOP/second.  So, this is where the calculation comes from.

    Now, addressing your point on the "utilization is not practical" issue.  This is not necessarily true, and it may not work for every DSP algorithm. but I will give you a very simple example of one that is realistic.  Consider an FIR filter.  If you pre-load the coefficients into registers, then you never have to fetch them again.  So, every cycle, you can bring in 2 samples and then generate the next output sample.

    You'll definitely have to design your algorithm to take advantage of the architecture, which is why understanding the architecture is important.  

    I'd also suggest taking a look at the following application notes:

    http://www.ti.com/lit/an/sprabk5/sprabk5.pdf - Throughput Performance Guide for C66x KeyStone Devices

    http://www.ti.com/lit/an/sprabg7/sprabg7.pdf - Optimizing Loops on the C66x DSP

    Regards,

    Dan

  • Ivan,

    See the comments I posted for Mike.  The first suggestion that I would make is to take a look at the assembly code that is being generated.  (Assuming you're using C).   If the kernel of your  algorithm is not efficiently using the functional units, then there is likely some more hints that you need to give the compiler in order to let it optimize as much as possible.

    We have a 4-day workshop on Optimization on the C6000.  You can get the whole thing here: 

    http://processors.wiki.ti.com/index.php/TMS320C6000_DSP_Optimization_Workshop 

    If the assembly is actually optimized correctly, then you're likely getting stalls.  This could be due to specifics of the cache configuration, data memory placement, data alignment.  All of these issues are also covered in the workshop above.  

    Regards,
    Dan