Compiler/DRA829V: c66 how write code properly to exploit the full 8 floating point operations per clock cycle

Colombo Carlo

Part Number: DRA829V

Tool/software: TI C/C++ Compiler

Hi everybody ,

I need to exploit the full 8 floating point operations per cycle.

For example, as far as I know the single-precision multiply in 4-way SIMD form "QMPYSP" can only be generated by the compiler if we use the intrinsic function "_qmpysp".

To start writing clean code, I'm trying to put 2 of these QMPYSP in a function (and get the 8 FLOP/cycle) and call this function from the inner loop of my code. I'd like to have the compiler generate a pair of SPLOOP/SPKERNEL as if the 2 QMPYSP were directly written in the inner loop, without any function call.

I've tried the following code, without success:

typedef struct {

__x128_t lo;

__x128_t hi;

} f32_x8_operand_t;

#pragma FUNC_IS_PURE (f32_mul_x8)

#pragma NO_HOOKS (f32_mul_x8)

#pragma FUNC_ALWAYS_INLINE (f32_mul_x8)

static f32_x8_operand_t f32_mul_x8(f32_x8_operand_t src0, f32_x8_operand_t src1)

{

f32_x8_operand_t dst;

dst.lo = _qmpysp(src0.lo, src1.lo);

dst.hi = _qmpysp(src0.hi, src1.hi);

return dst;

}

In summary, the function body is properly inlined, but never between the pair of SPLOOP/SPKERNEL, rather there is a normal branch operation. I imagine that the normal branch cannot take advantage of loop pipelining in hardware (only out-of-order pipelines can achieve that).

Do you have any recommendation?

thks

Carlo

over 5 years ago

0 George Mock over 5 years ago

TI__Guru**** 251220 points

To get an idea of how the _qmpysp intrinsic is typically used, I recommend you install DSPLIB for C66x. Inspect the source file DSPF_sp_mat_mul_gemm.c. It is located in a directory with a path similar to ...

C:\ti\dsplib_c66x_3_4_0_0\packages\ti\dsplib\src\DSPF_sp_mat_mul_gemm\c66

See how it uses the intrinsic _qmpysp. Your use should be similar.

Colombo Carlo said:
I'd like to have the compiler generate a pair of SPLOOP/SPKERNEL as if the 2 QMPYSP were directly written in the inner loop, without any function call.

Assembly code generated by the compiler for this DSPLIB function contains two QMPYSP instructions in parallel, inside a software pipelined loop.

Perhaps a better way to solve your problem is to call this function, or a similar function, in DSPLIB.

Thanks and regards,

-George

0 Colombo Carlo over 5 years ago in reply to George Mock

TI__Mastermind 24640 points

Hi George

thank you but I need something more .

I used the _qmpysp intrinsic, just as I showed in the example code. what I wanted is to encapsulate it into a function "f32_mul_x8" and call this function inside a C loop. I'm expecting this C loop to be translated into a HW loop SPLOOP/SPKERNEL, through compiler optimization (i.e., function inlining and then HW loop passes).

From the point of view of compiler optimization, as far as I've tried, it seems that the HW loop pass comes before the function inlining, and that explains why it doesn't work as I'm expecting. If I rewrite "f32_mul_x8" into a C macro (dirty solution), my experiment works as I expect. I'm wondering if there is a #pragma that solves the issue.

any suggestion ?

best regards

Carlo

0 George Mock over 5 years ago in reply to Colombo Carlo

TI__Guru**** 251220 points

Consider using either #pragma FORCEINLINE or #pragma FUNC_ALWAYS_INLINE. Both of the pragmas affect inlining. Otherwise, they are quite different. Please search for them in the C6000 compiler manual.

Thanks and regards,

-George

Code Composer Studio™︎

Code Composer Studio forum

Compiler/DRA829V: c66 how write code properly to exploit the full 8 floating point operations per clock cycle