This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Compiler/TMS320C6678: Any Compiler Option to Speed Up Double Division of C66X DSP Core

Part Number: TMS320C6678
Other Parts Discussed in Thread: MATHLIB

Tool/software: TI C/C++ Compiler

Currently I am stuck at slow DSP calculating double division, which costs me almost 250 cycles(with O3 option and memory cache) for one double type division operation. In the datasheet of TCI6614, it has below description:

The C66x core incorporates 90 new instructions targeted for floating point (FPi) and vector math oriented (VPi) processing....The C66x CPU also

supports SIMD for floating-point operations.

This show several instruction enhancement of C66x. I am wondering if any special compiler option can switch on these enhancements to speed up my float-point division calculation? Or any suggestion to speed up the double type division will be appreciated.

  • Hi, 

    I meet similar issue and I suspect that the C66x float-point instructions are not used even I set -O3 in the compiler option. Please share me your solution if you solve this problem. Thanks!

    BRs

  • There is no compiler flag that can help with speeding up division operation. The division operation is not natively done in HW like the multiply, accumulate/addition operations and is a software implementation in the real time support (RTS) libraries in the TI C6000 compiler. 

    To speed up this operation, we provide an optimized version of this function dvidp that can be used by users. Please checkout the MATHLIB and associated benchmarks to see the improvement  in performance over RTS library:

    http://www.ti.com/tool/MATHLIB

    Benchmarks for double precision division:

    Hope this helps.

    Regards,

    Rahul

  • HI Rahul,

    Thanks for your value information.  I suddenly found the param "--fp_mode=relaxed" can speed up the double division a lot if division is in the loop. However, if the division is in the normal non-loop code(e.g. mixed with other add/minus operation, or in the std::complex division), it has no effect at all. In my case, speed is more important than precision, and I do not want to re-write the third part library. If this param works for common situation, I think my problem could be solved.

    I will try the "divdp" method in mathlib.  One question here, which column is the cycle for mathlib division? 281 is for RTS. Is 66 or 48 cycles for mathlib division?

  • Here is clarification on the three columns highlighted in the divsp benchmark shared. 

    • RTS: This is the native compiler benchmark using division operation from the RTS library for C66x architecture
    • C: This is the optimized C implementation benchmark  in MATHLIB  for single division operation where the code branches to the function.
    • Inline: This is the optimized MATHLIB function where the division operation is inlined. This saves branching latency but increases code size. 

    Hope this helps.

    Regards,

    Rahul