Compiler/TMS320C6678: Any Compiler Option to Speed Up Double Division of C66X DSP Core

PapaDog

Part Number: TMS320C6678
Other Parts Discussed in Thread: MATHLIB

Tool/software: TI C/C++ Compiler

Currently I am stuck at slow DSP calculating double division, which costs me almost 250 cycles(with O3 option and memory cache) for one double type division operation. In the datasheet of TCI6614, it has below description:

The C66x core incorporates 90 new instructions targeted for floating point (FPi) and vector math oriented (VPi) processing....The C66x CPU also

supports SIMD for floating-point operations.

This show several instruction enhancement of C66x. I am wondering if any special compiler option can switch on these enhancements to speed up my float-point division calculation? Or any suggestion to speed up the double type division will be appreciated.

over 4 years ago

0 Lin Hu1 over 4 years ago

Prodigy 10 points

Hi,

I meet similar issue and I suspect that the C66x float-point instructions are not used even I set -O3 in the compiler option. Please share me your solution if you solve this problem. Thanks!

BRs

0 Rahul Prabhu over 4 years ago

TI__Guru** 114410 points

There is no compiler flag that can help with speeding up division operation. The division operation is not natively done in HW like the multiply, accumulate/addition operations and is a software implementation in the real time support (RTS) libraries in the TI C6000 compiler.

To speed up this operation, we provide an optimized version of this function dvidp that can be used by users. Please checkout the MATHLIB and associated benchmarks to see the improvement in performance over RTS library:

http://www.ti.com/tool/MATHLIB

Benchmarks for double precision division:

Hope this helps.

Regards,

Rahul

0 PapaDog over 4 years ago in reply to Rahul Prabhu

Prodigy 190 points

HI Rahul,

Thanks for your value information. I suddenly found the param "--fp_mode=relaxed" can speed up the double division a lot if division is in the loop. However, if the division is in the normal non-loop code(e.g. mixed with other add/minus operation, or in the std::complex division), it has no effect at all. In my case, speed is more important than precision, and I do not want to re-write the third part library. If this param works for common situation, I think my problem could be solved.

I will try the "divdp" method in mathlib. One question here, which column is the cycle for mathlib division? 281 is for RTS. Is 66 or 48 cycles for mathlib division?

0 Rahul Prabhu over 4 years ago in reply to PapaDog

TI__Guru** 114410 points

Here is clarification on the three columns highlighted in the divsp benchmark shared.

RTS: This is the native compiler benchmark using division operation from the RTS library for C66x architecture
C: This is the optimized C implementation benchmark in MATHLIB for single division operation where the code branches to the function.
Inline: This is the optimized MATHLIB function where the division operation is inlined. This saves branching latency but increases code size.

Hope this helps.

Regards,

Rahul

Processors

Processors forum

Compiler/TMS320C6678: Any Compiler Option to Speed Up Double Division of C66X DSP Core