This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Bugs in sqrt_16() sqrtv.asm, and non-optimal performance

There is an off-by-one error in the sqrt_16() routine of dsplib for all output values, e.g., sqrt(0) = 0x0001 (0.000030517578125), instead of the expected 0.

In addition, the code is not taking full advantage of the C55x instruction set, and thus its memory, stack, and CPU usage are higher than necessary.  The two problems are actually related, although it is not necessary to optimize the routine to fix the bug.

The bug is that the final result must be rounded and then multiplied by 2, but the rounding is applied to the wrong digit because it is applied before the shift/multiply.  In other words, instead of actually rounding the value, the code is merely adding 1 to every result, which explains why this routine claims that the square root of 0 is 1.

The fix is to alter the value added to the final result so that it is rounded without excessive offset.  Even though sqrt_16() is only an estimation, and it is only carried out to fiveiterations, there's no need to have the additional error of being off by one.

The non-optimal performance is related to two things.  First, an intermediate value is temporarily saved in the output vector and accessed once per loop, when there are plenty of registers available that could hold the value.  Second, the ROUND opcode and round modes of various other opcodes are never used, and instead a manual rounding is done via creating a temporary stack variable for the offset and inserting additional instructions.  It is fairly simple to rewrite this code to use the ROUND function where necessary, and even more importantly to use the R (Round) modifier for instructions that can round a result without incurring an additional processor cycle.

Note that sqrt_16() is interesting because the typically useful FRCT mode of the processor is not used because the iteration involves an explicit divide-by-two on paper than ends up being a free operation when FRCT mode is off.  The only catch is that the final result must be multiplied by two (which is where the whole bug comes into play).

I found all of this in the process of writing and testing a sqrt_32() variation which accepts a 32-bit vector and calculates the 16-bit square root of each vector element.  Since sqrt_16() was already an off-place operation, the 32-bit to 16-bit operation fits well.  There were surprisingly few changes needed to support 32-bit input, and due to the nature of fixed-point division, it makes more sense to start with a 32-bit dividend.