This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Very inefficient code generation: FUNC_ALWAYS_INLINE pragma ignored and overhead from unexpected conditional code



In the artificial testcase below, the compiler (cl6x 7.4.2 with options  -mv6600 -os -k -o3) refuses to inline ffswap() into test_ffswap(), even with a FUNC_ALWAYS_INLINE pragma.

#pragma FUNC_ALWAYS_INLINE(ffswap);

static inline DFF ffswap(DFF dff) {

    float tmp = dff.x; dff.x = dff.y;; dff.y = tmp; return dff; }

DFF test_ffswap(DFF dff) { return ffswap(dff); }

 

The result is that test_ffswap() takes 30 cpu cycles (19 within test_ffswap() and then another 21 within ffswap() which is 5 times longer than it should take.  I.e., with inlining (and proper optimization), test_ffswap() should be reduced to 3 moves (to do the swap), return instruction and, perhaps a nop for a total of 6 cycles. 

4505.t.asm

There are two sources of the inefficiency.  First, the function is not being inlined.  Second, the function ffswap() contains some conditional code that is being generated by the compiler performing some additional loads and stores that significantly increases the compile-time for the function. 

1) Why isn't the function ffswap() being inlined?  How do I get the function to be inlined?

2) What is the conditional code in ffswap() doing?  It's obviously coming from the backend of the compiler because it does not show up in the optimizer comments.  Is this alignment related, something else?  How can I avoid this code?

I noticed that these two problems show up together.  When I come across a function that cannot be inlined it usually contains this conditional code, so I assume that the problems are related.

I can force inlining by passing the structure in pieces (see below), but this is very ugly and then the unexpected conditional code is moved to test_ffswap().  The resulting function test_ffswap() is now twice as fast as the original code (15 cycles compared to 30 cycles) but it still takes 2.5 times longer than it should.  If I write the ffswap() function in assembly, then it cannot be inlined, so this does not solve the problem.

4035.tt.asm

For this case, I could avoid the problem by converting back and forth to the x128_t type for each call, but I need a solution that will work for structures that are not exactly 128 bytes as well.

  • I believe everything you're seeing is a consequence of the way the TI compiler handles structs passed by value.  You can ignore the conditional code, it's not really adding to the problem.  When a struct is passed by value, the compiler actually passes a pointer to the struct, and it is up to the called function to make a copy, which is exactly what's going on here.  In addition, when a struct is returned by value, the compiler actually passes an extra first parameter which is a pointer to the intended destination of the struct; the caller is responsible for either creating space for this struct and passing its address, or passing NULL, which indicates that the struct return value will not be used.  That's what the conditional code is checking.

    The compiler is forced to make a copy; it must preserve the value of the struct passed to test_ffswap, so at least one copy must be made.  The real problem here is that the compiler is making yet another copy of the parameter to ffswap, which is a missed optimization opportunity.

    I believe it's refusing to inline because it takes a struct-typed parameter.  I'm not sure offhand why this is a restriction.

    I've submitted SDSCM00047505 to analyze this test case.  It may require a big green alien thinking cap (Shh!)

    As a workaround, consider rewriting the code in this fashion, which inlines very well (it doesn't make test_ffswap any better, but it would be quite efficient where the struct is not passed by value):

    static inline void ffswap(DFF *dff) { float tmp = dff->x; dff->x = dff->y; dff->y = tmp; }

  • Archaeologist,

    Thanks for your explanation and for submitting the report.   Can you please elaborate a little bit on when/why this occurs?  In particular, consider the following test program:

    #include <complex.h>
    
    complex float
    test64(float x, float y, float z)
    {
        return x * y + z * __I__;
    }
    
    complex double
    test128(double x, double y, double z)
    {
        return x * y + z * __I__;
    }
    
    .

    There are two functions.  test64() returns a 64 bit structure. test128() returns a 128 bit structure.

    Using compiler version 7 .4.2 and options -mv6600 -o, the conditional code overhead only shows up in test128(), not test64()  - see attached.  Why?

    3681.bug66.asm

     

    Using compiler version 7 .4.2 and options -mv6740 -o, the conditional code overhead only shows up in both test128() and test64() - see attached.  Why?

    0638.bug674.asm

     

    Presumably there is a difference in calling conventions between the 674x and 66x.  Can you explain what this is and why the behavior differs?

     

     

  • Yes, the calling convention differs between C674x and C66x.  For C66x EABI, small structs (64 bits or less) passed (or returned) by value are actually passed (or returned) by value, not as a pointer.  Complex float types are represented internally as a structure.  In the "complex float" case, the structure fits in 64 bits and is thus returned directly in A5:A4.  In the "complex double" case, the structure does not fit in 128 bits and must by copied to a structure pointed to by A3.

  • Some additional comments that may prove helpful ... When you build for a C66x device by using the build option --silicon_version=6600, the default ABI is the newer EABI, and not the older COFF ABI.  When you build for a C674x device with --silicon_version=6740, the default ABI is COFF ABI, and not EABI.  This difference in ABI is why it is possible for the calling convention to be different.  

    Thanks and regards,

    -George

  • The ffswap function isn't being inlined because of an issue in the compiler.  I don't believe there is a workaround.  This issue will be addressed in the 8.0 version of the compiler (to be released in 2014).