Tool/software: TI C/C++ Compiler
Hi everybody ,
I need to exploit the full 8 floating point operations per cycle.
For example, as far as I know the single-precision multiply in 4-way SIMD form "QMPYSP" can only be generated by the compiler if we use the intrinsic function "_qmpysp".
To start writing clean code, I'm trying to put 2 of these QMPYSP in a function (and get the 8 FLOP/cycle) and call this function from the inner loop of my code. I'd like to have the compiler generate a pair of SPLOOP/SPKERNEL as if the 2 QMPYSP were directly written in the inner loop, without any function call.
I've tried the following code, without success:
typedef struct {
__x128_t lo;
__x128_t hi;
} f32_x8_operand_t;
#pragma FUNC_IS_PURE (f32_mul_x8)
#pragma NO_HOOKS (f32_mul_x8)
#pragma FUNC_ALWAYS_INLINE (f32_mul_x8)
static f32_x8_operand_t f32_mul_x8(f32_x8_operand_t src0, f32_x8_operand_t src1)
{
f32_x8_operand_t dst;
dst.lo = _qmpysp(src0.lo, src1.lo);
dst.hi = _qmpysp(src0.hi, src1.hi);
return dst;
}
In summary, the function body is properly inlined, but never between the pair of SPLOOP/SPKERNEL, rather there is a normal branch operation. I imagine that the normal branch cannot take advantage of loop pipelining in hardware (only out-of-order pipelines can achieve that).
Do you have any recommendation?
thks
Carlo