This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

static inlining inconsistent performance

Hi,

In order to make our c6678 code faster, we have replaced a Dot_product() function with a DSPLIB equivalent as shown.

for loop ... {
#if 1
L_tmp = DSP_dotprod ((short *)y2_fx,(short *)y2_fx,L_SUBFR);
#else
L_tmp = Dot_product(y2_fx, y2_fx, L_SUBFR);
#endif
}

We added DSP_dotprod() source code as a static inline function. This always qualifies the loop, results faster performance (as expected) in some cases, but dramatically slower performance in others.

What intermediate or report information can the c66x compiler generate so I might get some idea of what's making the difference ? If I should post specific code examples (e.g. a "fast" one vs. a "slow" one) please let me know.

-- 

Thanks! Regards, Sarvani Chadalapaka HPC Systems Engineer Signalogic Inc.

  • Sarvani Chadalapaka said:
    We added DSP_dotprod() source code as a static inline function. This always qualifies the loop, results faster performance (as expected) in some cases, but dramatically slower performance in others.

    What is different among these cases?  Different compiler versions?  Different build options?  Different memory configuration?  Different device?

    Thanks and regards,

    -George

  • George Mock said:

    What is different among these cases?  Different compiler versions?  Different build options?  Different memory configuration?  Different device?

    George,

    None, compiler versions are same, I am building for c6678 device.  I am building the same library - so memory configuration and build options are the same. Is it okay to either post source code examples or attach them separately ? 

    -- 
    Thanks!
    
    Regards,
    Sarvani Chadalapaka
    HPC Systems Engineer
    Signalogic Inc.
  • Dramatic performance deltas on C6x are typically due to software pipelining. Have the compiler keep the assembly code (--keep_asm option) and look at the assembly code for the loop. You may find that one version is has been software pipelined, and the other has not. If so, that would account for the delta, and you'd need to figure out why one did not get software pipelined.
  • Hi Archaeologist,

    I have generated assembly code as you suggested and I see that there are no loop disqualifications . However, the run time increases when I call static inlined DSP_dotprod() instead of Dot_product() as I mentioned in my previous post.

    Excerpt of generated assembly code without static inline. Note that in this case, when Dot_Product() is called, it takes fewer cycles.

    $C$L114:
    $C$DW$L$pitch_ol_fx$148$B:
    LDW .D2T1 *+SP(2780),A4
    LDW .D2T1 *+SP(2688),A31 ; |851|
    LDW .D2T1 *+SP(2652),A5 ; |859|
    NOP 2
    SADD2 .S1 A3,A4,A3 ; |851|
    EXT .S1 A3,16,16,A11 ; |851|
    SADD2 .S1 A11,A31,A3 ; |853|
    EXT .S1 A3,16,16,A4 ; |853|
    STH .D1T1 A3,*A5 ; |853|
    SUB .L2X B6,A4,B4 ; |859|
    ADDAH .D2 B10,B4,B4 ; |859|
    NOP 1
    $C$DW$126 .dwtag DW_TAG_TI_branch
    .dwattr $C$DW$126, DW_AT_low_pc(0x00)
    .dwattr $C$DW$126, DW_AT_name("Dot_product")
    .dwattr $C$DW$126, DW_AT_TI_call

    CALLP .S2 Dot_product,B3
    || MV .L1X B4,A4 ; |859|

    $C$RL11: ; CALL OCCURS {Dot_product} {0} ; |3378|
    $C$DW$L$pitch_ol_fx$148$E:

    The following is the asm excerpt when inlined DSPLIB function DSP_dotprod() is used.  Note that this version of code is slower by about 1458019 cycles :

    $C$L112:
    $C$DW$L$pitch_ol_fx$144$B:

    LDW .D2T2 *+SP(2664),B31 ; |851|
    || CMPGT .L1 A25,0,A0 ; |2730|
    || SADD2 .S2X B4,A21,B4 ; |851|
    || ZERO .S1 A3 ; |2722|
    || ZERO .D1 A6 ; |2722|

    [ A0] ADD .L1 3,A25,A7 ; |2731|
    EXT .S2 B4,16,16,B9 ; |851|

    [!A0] B .S1 $C$L116 ; |2730|
    || [ A0] SHR .S2X A7,2,B7 ; |2731|

    [ A0] MVC .S2 B7,ILC
    SADD2 .S2 B9,B31,B4 ; |853|

    EXT .S2 B4,16,16,B5 ; |853|
    || STH .D1T2 B4,*A18 ; |853|

    SUB .L2 B6,B5,B5 ; |859|
    ADDAH .D2 B10,B5,B8 ; |859|
    ; BRANCHCC OCCURS {$C$L116} ; |2730|
    $C$DW$L$pitch_ol_fx$144$E:

    $C$L113: ; PIPED LOOP PROLOG
    .dwpsn file "/root/Signalogic/SIG_C6X/Voice/EVS/src/lib_com/prot_fx.h",line 2730,column 0,is_stmt,isa 0

    SPLOOP 1 ;10 ; (P)
    || MV .L1X B8,A7

    ;** --------------------------------------------------------------------------*
    $C$L114: ; PIPED LOOP KERNEL
    $C$DW$L$pitch_ol_fx$146$B:

    LDDW .D1T1 *A7++,A5:A4 ; |2731| (P) <0,0>
    || LDDW .D2T2 *B8++,B5:B4 ; |2731| (P) <0,0>

    NOP 3

    SPMASK L2
    || MV .L2X A3,B7

    DOTP2 .M2X B4,A4,B6 ; |2731| (P) <0,5>
    || DOTP2 .M1X B5,A5,A3 ; |2732| (P) <0,5>

    NOP 2
    NOP 1
    .dwpsn file "/root/Signalogic/SIG_C6X/Voice/EVS/src/lib_com/prot_fx.h",line 2733,column 0,is_stmt,isa 0

    SPKERNEL 9,0
    || ADD .L2 B6,B7,B7 ; |2731| <0,9>
    || ADD .L1 A3,A6,A6 ; |2732| <0,9>

    $C$DW$L$pitch_ol_fx$146$E:

    Normally, static inlining a small function should reduce the cycles and improve performance. I am trying to understand what it is that I am not doing right to cause this behavior? 

    -- 
    Thanks!
    
    Regards,
    Sarvani Chadalapaka
    HPC Systems Engineer
    Signalogic Inc.
  • How many times does DSP_dotprod get called over the course of the entire program?
  • Archaeologist,

    DSP_dotprod gets called about 388569 times over the course of entire program.

    --
    Thanks!

    Regards,
    Sarvani Chadalapaka
    HPC Systems Engineer
    Signalogic Inc.

  • Okay. I am trying to figure the magnitude of the for loop and the magnitude of the slowdown. You said the "code" is slower by 1458019 cycles; I need to figure out if that is the delta of the total program cycles, or for each instance of the for loop. The numbers don't make much sense if we consider 1458019 whole program cycles, so I must conclude that you mean that each of the 388569 occurrences of the for loop is on average 1458019 cycles slower; is that correct?
  • Archaeologist,

    Yes, that is correct, total slow down of code on an average of 388569 function calls is 1458019 program cycles.
    Here is some more information. We have examined closely a few test cases and what appears to happen is that in some cases inlining can increase the size of a section of code more than if function calls were used, and if the increase is enough, then there can be L1P cache misses that impact performance more than function call overhead. This leads to some (hopefully better) questions:

    1) If we aim to keep inline functions as small as possible, and let the compiler do the best job possible of inlining, would this preclude using pragmas like MUST_ITERATE in the function code? Especially for "library" type functions that cannot make assumptions about loop length or multiple-of-iterations, it would seem that any guidelines we arbitrarily place on inline function code, while beneficial in some cases, might hinder the compiler in other cases.

    2) Re. auto-inlining, TI docs frequently mention the compiler needs to "see" function source in order to make a decision whether to inline. What exactly does this mean ? Does the compiler actually compile function source in order to decide ? Or does it use rule-of-thumb estimates, for example how many lines of source in the function ?

    3) Is there a TI tool or script that can parse compiler asm output and collect and organize total build statistics ? For example, currently we take the following steps:

    -generate asm files for a library consisting of say, 100 C or C++ files

    -grep these files for loop disqualifications due to a function call

    -manually look at each asm file disqualification to find which function call, get some idea of number of cycles, and mark candidate functions for possible inlining

    -look at each candidate function C/C++ source, inline and/or optimize if suitable

    As you can imagine, this is extremely tedious. It would seem the first 3 steps could be automated with a script.

    -- 
    Thanks!
    
    Regards,
    Sarvani Chadalapaka
    HPC Systems Engineer
    Signalogic Inc.
  • Sarvani Chadalapaka said:

    1) If we aim to keep inline functions as small as possible, and let the compiler do the best job possible of inlining, would this preclude using pragmas like MUST_ITERATE in the function code?

    No.  A pragma takes no code space, and isn't consulted until after inlining has happened.

    Sarvani Chadalapaka said:

    2) Re. auto-inlining, TI docs frequently mention the compiler needs to "see" function source in order to make a decision whether to inline. What exactly does this mean ?

     

    Inlining replaces a call to a function with the body of the called function.  The compiler must know what that body contains if it is to insert it;  that is true for all inlining.

     

    Sarvani Chadalapaka said:

    3) Is there a TI tool or script that can parse compiler asm output and collect and organize total build statistics ?

    The compiler can generate advice indicating which functions are called from inner loops and might be profitably inlined, but that code has not been kept up to date and may not tell you as much as you'd like to know.  Try --gen_opt_info=2 and look for a file ending in .nfo.

    Sarvani Chadalapaka said:

    -generate asm files for a library consisting of say, 100 C or C++ files

    -grep these files for loop disqualifications due to a function call

    -manually look at each asm file disqualification to find which function call, get some idea of number of cycles, and mark candidate functions for possible inlining

    I could imagine doing these steps, aside from the cycle estimate, with an awk or perl script -- a quick-and-crude version would be something like "egrep -w 'global|disqualified|B'" and then for each disqualification, look up to the .global for the caller and down to the B for the callee.