This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CCS/TMS320C6678: Hardware Setting for C66x DSP Float-point Division Calculation

Part Number: TMS320C6678


Tool/software: Code Composer Studio

Hi experts,

I am working on C66x DSP double type division, and one operation costs me a lot(almost 250 cycles). I am thinking is there any hardware chip configuration(register or some cable connection configuration) can reduce the cycle to a reasonable amount. 

  • Hello!

    Old wisdom is don't divide, add and multiply :-) With that, I would first check, whether division is absolutely necessary. Perhaps you divide with some fixed set of denominators, then probably you could precalculate their inverse variants and replace division with multiplication, i.e. instead of Ni/Di first calculate set of (1/Di), and then proceed with Ni*(1/Di).

    If that's not a case and you have to divide with all a priori unknown denominators, first check, whether you really need double precision, and if single suffices, proceed with floats.

    Regardless of above decision next thing is to get www.ti.com/.../sprabg7.pdf. Particularly, pay attention to clause 3.1.5 Usage of Division Instructions. There is an example of use for _rcpsp() intrinsic instruction, which calculates 1/x in 1 (literally - one) cycle and gives you exact exponent but only 8 bits of mantissa. Next you may apply Newton iterations and after just second one get full precision in float. If you absolutely need to go with doubles, consider _rcpdp() intrinsic, which does the same with doubles. After initial approximation it would take 3 Newton's iterations to get full precision of double. 

    These techniques pipeline very well under O3 and you should see considerable gain. However, this happens mainly in loops. Sometimes it pays back to calculate all necessary quotients in a loop, perhaps sacrificing some memory for buffer to store them and then proceed with the rest of the algorithm.

    Hope this helps.

  • O3 has ready been applied. "--fp_mode=relaxed" only works for large loop code.

    So take your answer as it does not exist any special DSP register or hardware configuration to speed up the whole application double division? The choice for me is to refine the code. 

  • Hello!

    To my knowledge, there is no better way. There is no native division acceleration beyond already mentioned _rcpsp()/_rcpdp() instructions.