This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

C6000-CGT: Loop carried dependency optimization / C6000

Part Number: C6000-CGT

I have a simple loop that the compiler refuses to optimize due to a high loop carried dependency. I boiled down the problem to this very short demonstration loop:

void LoopCarriedDependency(float *restrict * restrict list)
{ int i;
  float * restrict p1;
  float * restrict p2;
  for (i=0; i<10; i++ )
  { p1 = *list;
    p2 = *list++;
   *p2 =*p1 + 2;
   }
}

Of course, the compiler assumes that in the list array there could be entries pointing to the same memory locations, thus creating a dependency, but I want to tell the compiler that this will never be the case. I put restrict keywords whereever the compiler accepted it but with no success. How can I remove the dependency, knowinf that all list entries point to different locations?

Here the assembly output:

;*----------------------------------------------------------------------------*
;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop found in file : ../test.c
;* Loop source line : 5
;* Loop opening brace source line : 6
;* Loop closing brace source line : 9
;* Known Minimum Trip Count : 10
;* Known Maximum Trip Count : 10
;* Known Max Trip Count Factor : 10
;* Loop Carried Dependency Bound(^) : 10
;* Unpartitioned Resource Bound : 2
;* Partitioned Resource Bound(*) : 3
;* Resource Partition:
;* A-side B-side
;* .L units 0 0
;* .S units 0 0
;* .D units 0 3*
;* .M units 0 0
;* .X cross paths 0 0
;* .T address paths 0 0
;* Logical ops (.LS) 0 1 (.L or .S unit)
;* Addition ops (.LSD) 0 0 (.L or .S or .D unit)
;* Bound(.L .S .LS) 0 1
;* Bound(.L .S .D .LS .LSD) 0 2
;*
;* Searching for software pipeline schedule at ...
;* ii = 10 Schedule found with 2 iterations in parallel
;*
;* Register Usage Table:
;* +-----------------------------------------------------------------+
;* |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB|
;* |00000000001111111111222222222233|00000000001111111111222222222233|
;* |01234567890123456789012345678901|01234567890123456789012345678901|
;* |--------------------------------+--------------------------------|
;* 0: | | **** |
;* 1: | | *** |
;* 2: | | *** |
;* 3: | | *** |
;* 4: | | **** |
;* 5: | | *** |
;* 6: | | *** |
;* 7: | | *** |
;* 8: | | *** |
;* 9: | | *** |
;* +-----------------------------------------------------------------+
;*
;* Done
;*
;* Loop will be splooped
;* Collapsed epilog stages : 0
;* Collapsed prolog stages : 0
;* Minimum required memory pad : 0 bytes
;*
;* Minimum safe trip count : 1
;* Min. prof. trip count (est.) : 3
;*
;* Mem bank conflicts/iter(est.) : { min 0.000, est 0.000, max 0.000 }
;* Mem bank perf. penalty (est.) : 0.0%
;*
;*
;* Total cycles (est.) : 10 + min_trip_cnt * 10 = 110
;*----------------------------------------------------------------------------*
;* SINGLE SCHEDULED ITERATION
;*
;* $C$C104:
;* 0 LDW .D2T2 *B6++(4),B5 ; [B_D64P] |6|
;* 1 NOP 4 ; [A_L674]
;* 5 LDW .D2T2 *B5(0),B4 ; [B_D64P] |6| ^
;* 6 NOP 4 ; [A_L674]
;* 10 ADDSP .L2 B7,B4,B4 ; [B_L674] |6| ^
;* 11 NOP 3 ; [A_L674]
;* 14 STW .D2T2 B4,*B5(0) ; [B_D64P] |6| ^
;* || SPBR $C$C104 ; []
;* 15 NOP 5 ; [A_L674]
;* 20 ; BRANCHCC OCCURS {$C$C104} ; [] |5|
;*----------------------------------------------------------------------------*

  • What version of the compiler is being used?  Please show all the build options exactly as the compiler sees them.  Please copy and paste the text, and do not use a screen shot.

    Thanks and regards,

    -George

  • I'm using version 7.4.24 (because I need the COFF file format).

    ;******************************************************************************
    ;* TMS320C6x C/C++ Codegen PC v7.4.24 *
    ;* Date/Time created: Mon Feb 15 18:04:17 2021 *
    ;******************************************************************************
    .compiler_opts --abi=coffabi --c64p_l1d_workaround=off --endian=little --hll_source=on --long_precision_bits=40 --mem_model:code=near --mem_model:const=data --mem_model:data=far_aggregates --object_format=coff --silicon_version=6740 --symdebug:dwarf

    ;******************************************************************************
    ;* GLOBAL FILE PARAMETERS *
    ;* *
    ;* Architecture : TMS320C674x *
    ;* Optimization : Enabled at level 3 *
    ;* Optimizing for : Speed *
    ;* Based on options: -o3, no -ms *
    ;* Endian : Little *
    ;* Interrupt Thrshld : Disabled *
    ;* Data Access Model : Far Aggregate Data *
    ;* Pipelining : Enabled *
    ;* Speculate Loads : Enabled with threshold = 0 *
    ;* Memory Aliases : Presume not aliases (optimistic) *
    ;* Debug Info : DWARF Debug w/Optimization *
    ;* *
    ;******************************************************************************

    But a newer version (3.8.2)gives the same result:

    ;******************************************************************************
    ;* G3 TMS320C6x C/C++ Codegen PC v8.3.2 *
    ;* Date/Time created: Mon Feb 15 18:08:52 2021 *
    ;******************************************************************************
    .compiler_opts --abi=eabi --array_alignment=8 --c64p_l1d_workaround=off --endian=little --hll_source=on --long_precision_bits=32 --mem_model:code=near --mem_model:const=data --mem_model:data=far_aggregates --object_format=elf --silicon_version=6740 --symdebug:dwarf --symdebug:dwarf_version=3

    ;******************************************************************************
    ;* GLOBAL FILE PARAMETERS *
    ;* *
    ;* Architecture : TMS320C674x *
    ;* Optimization : Enabled at level 3 *
    ;* Optimizing for : Speed *
    ;* Based on options: -o3, no -ms *
    ;* Endian : Little *
    ;* Interrupt Thrshld : Disabled *
    ;* Data Access Model : Far Aggregate Data *
    ;* Pipelining : Enabled *
    ;* Speculate Loads : Enabled with threshold = 0 *
    ;* Memory Aliases : Presume not aliases (optimistic) *
    ;* Debug Info : DWARF Debug *
    ;* *
    ;******************************************************************************

    .asg A15, FP
    .asg B14, DP
    .asg B15, SP
    .global $bss


    $C$DW$CU .dwtag DW_TAG_compile_unit
    .dwattr $C$DW$CU, DW_AT_name("D:/Daten/CHR_NG/ChrocoNextDev/DSP/App/src/test.c")
    .dwattr $C$DW$CU, DW_AT_producer("TI G3 TMS320C6x C/C++ Codegen PC v8.3.2 Copyright (c) 1996-2018 Texas Instruments Incorporated")
    .dwattr $C$DW$CU, DW_AT_TI_version(0x01)
    .dwattr $C$DW$CU, DW_AT_comp_dir("D:\Daten\CHR_NG\ChrocoNextDev\DSP\App\ccs\Debug")
    ; C:\ti\ccs900\ccs\tools\compiler\ti-cgt-c6000_8.3.2\bin\opt6x.exe C:\\Users\\C02FA~1.DIE\\AppData\\Local\\Temp\\{5A289050-465B-46DA-A537-357A97D8CC27} C:\\Users\\C02FA~1.DIE\\AppData\\Local\\Temp\\{FD87524B-C092-437F-B865-70D883C323A0}
    .sect ".text"
    .clink
    .global LoopCarriedDependency

    $C$DW$1 .dwtag DW_TAG_subprogram
    .dwattr $C$DW$1, DW_AT_name("LoopCarriedDependency")
    .dwattr $C$DW$1, DW_AT_low_pc(LoopCarriedDependency)
    .dwattr $C$DW$1, DW_AT_high_pc(0x00)
    .dwattr $C$DW$1, DW_AT_TI_symbol_name("LoopCarriedDependency")
    .dwattr $C$DW$1, DW_AT_external
    .dwattr $C$DW$1, DW_AT_TI_begin_file("D:/Daten/CHR_NG/ChrocoNextDev/DSP/App/src/test.c")
    .dwattr $C$DW$1, DW_AT_TI_begin_line(0x01)
    .dwattr $C$DW$1, DW_AT_TI_begin_column(0x06)
    .dwattr $C$DW$1, DW_AT_decl_file("D:/Daten/CHR_NG/ChrocoNextDev/DSP/App/src/test.c")
    .dwattr $C$DW$1, DW_AT_decl_line(0x01)
    .dwattr $C$DW$1, DW_AT_decl_column(0x06)
    .dwattr $C$DW$1, DW_AT_TI_max_frame_size(0x00)
    .dwpsn file "D:/Daten/CHR_NG/ChrocoNextDev/DSP/App/src/test.c",line 2,column 1,is_stmt,address LoopCarriedDependency,isa 0

    .dwfde $C$DW$CIE, LoopCarriedDependency
    $C$DW$2 .dwtag DW_TAG_formal_parameter
    .dwattr $C$DW$2, DW_AT_name("list")
    .dwattr $C$DW$2, DW_AT_TI_symbol_name("list")
    .dwattr $C$DW$2, DW_AT_type(*$C$DW$T$30)
    .dwattr $C$DW$2, DW_AT_location[DW_OP_reg4]


    ;******************************************************************************
    ;* FUNCTION NAME: LoopCarriedDependency *
    ;* *
    ;* Regs Modified : B4,B5,B6,B7 *
    ;* Regs Used : A4,B3,B4,B5,B6,B7 *
    ;* Local Frame Size : 0 Args + 0 Auto + 0 Save = 0 byte *
    ;******************************************************************************
    LoopCarriedDependency:
    ;** --------------------------------------------------------------------------*
    $C$DW$3 .dwtag DW_TAG_variable
    .dwattr $C$DW$3, DW_AT_name("$O$C1")
    .dwattr $C$DW$3, DW_AT_TI_symbol_name("$O$C1")
    .dwattr $C$DW$3, DW_AT_type(*$C$DW$T$27)
    .dwattr $C$DW$3, DW_AT_location[DW_OP_reg21]

    $C$DW$4 .dwtag DW_TAG_variable
    .dwattr $C$DW$4, DW_AT_name("$O$K6")
    .dwattr $C$DW$4, DW_AT_TI_symbol_name("$O$K6")
    .dwattr $C$DW$4, DW_AT_type(*$C$DW$T$18)
    .dwattr $C$DW$4, DW_AT_location[DW_OP_reg23]

    $C$DW$5 .dwtag DW_TAG_variable
    .dwattr $C$DW$5, DW_AT_name("list")
    .dwattr $C$DW$5, DW_AT_TI_symbol_name("list")
    .dwattr $C$DW$5, DW_AT_type(*$C$DW$T$30)
    .dwattr $C$DW$5, DW_AT_location[DW_OP_reg22]

    .dwcfi cfa_offset, 0
    ; EXCLUSIVE CPU CYCLES: 2
    ;** 2 ----------------------- list = list;
    ;** 5 ----------------------- L$1 = 10;
    ;** ----------------------- K$6 = 2.0F;
    ;** ----------------------- #pragma MUST_ITERATE(10, 10, 10)
    ;** ----------------------- #pragma LOOP_FLAGS(4096u)
    ;** -----------------------g2:
    ;** 6 ----------------------- C$1 = *list++;
    ;** 6 ----------------------- *C$1 = *C$1+K$6;
    ;** 5 ----------------------- if ( L$1 = L$1-1 ) goto g2;
    .dwpsn file "D:/Daten/CHR_NG/ChrocoNextDev/DSP/App/src/test.c",line 5,column 13,is_stmt,isa 0
    MVK .L2 9,B6 ; [B_L674] |5|
    MVC .S2 B6,ILC ; [B_Sb674]
    ;*----------------------------------------------------------------------------*
    ;* SOFTWARE PIPELINE INFORMATION
    ;*
    ;* Loop found in file : D:/Daten/CHR_NG/ChrocoNextDev/DSP/App/src/test.c
    ;* Loop source line : 5
    ;* Loop opening brace source line : 6
    ;* Loop closing brace source line : 9
    ;* Known Minimum Trip Count : 10
    ;* Known Maximum Trip Count : 10
    ;* Known Max Trip Count Factor : 10
    ;* Loop Carried Dependency Bound(^) : 10
    ;* Unpartitioned Resource Bound : 2
    ;* Partitioned Resource Bound(*) : 3
    ;* Resource Partition:
    ;* A-side B-side
    ;* .L units 0 0
    ;* .S units 0 0
    ;* .D units 0 3*
    ;* .M units 0 0
    ;* .X cross paths 0 0
    ;* .T address paths 0 0
    ;* Logical ops (.LS) 0 1 (.L or .S unit)
    ;* Addition ops (.LSD) 0 0 (.L or .S or .D unit)
    ;* Bound(.L .S .LS) 0 1
    ;* Bound(.L .S .D .LS .LSD) 0 2
    ;*
    ;* Searching for software pipeline schedule at ...
    ;* ii = 10 Schedule found with 2 iterations in parallel
    ;*
    ;* Register Usage Table:
    ;* +-----------------------------------------------------------------+
    ;* |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB|
    ;* |00000000001111111111222222222233|00000000001111111111222222222233|
    ;* |01234567890123456789012345678901|01234567890123456789012345678901|
    ;* |--------------------------------+--------------------------------|
    ;* 0: | | **** |
    ;* 1: | | *** |
    ;* 2: | | *** |
    ;* 3: | | *** |
    ;* 4: | | **** |
    ;* 5: | | *** |
    ;* 6: | | *** |
    ;* 7: | | *** |
    ;* 8: | | *** |
    ;* 9: | | *** |
    ;* +-----------------------------------------------------------------+
    ;*
    ;* Done
    ;*
    ;* Loop will be splooped
    ;* Collapsed epilog stages : 0
    ;* Collapsed prolog stages : 0
    ;* Minimum required memory pad : 0 bytes
    ;*
    ;* Minimum safe trip count : 1
    ;* Min. prof. trip count (est.) : 3
    ;*
    ;* Mem bank conflicts/iter(est.) : { min 0.000, est 0.000, max 0.000 }
    ;* Mem bank perf. penalty (est.) : 0.0%
    ;*
    ;*
    ;* Total cycles (est.) : 10 + min_trip_cnt * 10 = 110
    ;*----------------------------------------------------------------------------*
    ;* SINGLE SCHEDULED ITERATION
    ;*
    ;* $C$C104:
    ;* 0 LDW .D2T2 *B6++(4),B5 ; [B_D64P] |6|
    ;* 1 NOP 4 ; [A_L674]
    ;* 5 LDW .D2T2 *B5(0),B4 ; [B_D64P] |6| ^
    ;* 6 NOP 4 ; [A_L674]
    ;* 10 ADDSP .L2 B7,B4,B4 ; [B_L674] |6| ^
    ;* 11 NOP 3 ; [A_L674]
    ;* 14 STW .D2T2 B4,*B5(0) ; [B_D64P] |6| ^
    ;* || SPBR $C$C104 ; []
    ;* 15 NOP 5 ; [A_L674]
    ;* 20 ; BRANCHCC OCCURS {$C$C104} ; [] |5|

    Christoph Dietz
    ;*----------------------------------------------------------------------------*

  • Thank you for the additional information.  I filed EXT_EP-10251 to have this investigated.  You are welcome to follow it with that link.  Because the generated code is correct, the issue is filed not as a bug, but as a enhancement request.  In this specific case, it requests the compiler to generate code that runs much faster.

    Christoph Dietz said:
    I'm using version 7.4.24 (because I need the COFF file format).

    The 7.4.x series of releases is inactive, i.e. no further releases are planned.  Version 7.4.24 is the last version of the C6000 compiler to support the old COFF ABI.  Unfortunately, this means there will be no fix supplied in a compiler version that supports COFF ABI.  In similar cases, we usually find a workaround that allows you to move past the issue.

    Thanks and regards,

    -George

  • Thank you for filing the enhancement request! In the meantime I found a workaround that might be interesting for other users:

    if I introduce whatever condition in the writeback operation, the compiler correctly does not assume loop carried dependency and optimizes as expected, however at the price of the condition eveluation ressource cost. But the result is worth it (especially for longer loops than this testcase): 10cycles -> 2 cycles

    void LoopCarriedDependency(float ** list)
    { int i;
    float * p1;
    float * restrict p2;
    for (i=0; i<10; i++ )
    { p1 = *list;
    p2 = *list++;
    if (*p1 < FLT_MAX)
    *p2 =*p1 + 2;
    }
    }

    is compiled to:

    ;*----------------------------------------------------------------------------*
    ;* SOFTWARE PIPELINE INFORMATION
    ;*
    ;* Loop found in file : D:/Christoph/Precitec/chr_NG/soft/Dasganzezeugs/Chroconextdev/DSP/App/src/test.c
    ;* Loop source line : 8
    ;* Loop opening brace source line : 9
    ;* Loop closing brace source line : 13
    ;* Known Minimum Trip Count : 10
    ;* Known Maximum Trip Count : 10
    ;* Known Max Trip Count Factor : 10
    ;* Loop Carried Dependency Bound(^) : 0
    ;* Unpartitioned Resource Bound : 2
    ;* Partitioned Resource Bound(*) : 2
    ;* Resource Partition:
    ;* A-side B-side
    ;* .L units 0 0
    ;* .S units 1 0
    ;* .D units 2* 1
    ;* .M units 0 0
    ;* .X cross paths 0 2*
    ;* .T address paths 2* 1
    ;* Long read paths 0 0
    ;* Long write paths 0 0
    ;* Logical ops (.LS) 0 1 (.L or .S unit)
    ;* Addition ops (.LSD) 0 1 (.L or .S or .D unit)
    ;* Bound(.L .S .LS) 1 1
    ;* Bound(.L .S .D .LS .LSD) 1 1
    ;*
    ;* Searching for software pipeline schedule at ...
    ;* ii = 2 Schedule found with 8 iterations in parallel
    ;*
    ;* Register Usage Table:
    ;* +-----------------------------------------------------------------+
    ;* |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB|
    ;* |00000000001111111111222222222233|00000000001111111111222222222233|
    ;* |01234567890123456789012345678901|01234567890123456789012345678901|
    ;* |--------------------------------+--------------------------------|
    ;* 0: |* * ** | ***** |
    ;* 1: |* **** | * * * |
    ;* +-----------------------------------------------------------------+
    ;*
    ;* Done
    ;*
    ;* Loop will be splooped
    ;* Collapsed epilog stages : 0
    ;* Collapsed prolog stages : 0
    ;* Minimum required memory pad : 0 bytes
    ;*
    ;* Minimum safe trip count : 1
    ;* Min. prof. trip count (est.) : 2
    ;*
    ;* Mem bank conflicts/iter(est.) : { min 0.000, est 0.125, max 1.000 }
    ;* Mem bank perf. penalty (est.) : 5.9%
    ;*
    ;* Effective ii : { min 2.00, est 2.13, max 3.00 }
    ;*
    ;*
    ;* Total cycles (est.) : 14 + min_trip_cnt * 2 = 34
    ;*----------------------------------------------------------------------------*
    ;* SINGLE SCHEDULED ITERATION
    ;*
    ;* $C$C29:
    ;* 0 LDW .D1T1 *A6++,A3 ; |10|
    ;* 1 NOP 4
    ;* 5 MV .S2X A3,B5 ; |10|
    ;* || LDW .D1T1 *A3,A3 ; |11|
    ;* 6 ROTL .M2 B5,0,B6 ; |10| Split a long life
    ;* 7 NOP 2
    ;* 9 MVD .M2 B6,B4 ; |10| Split a long life
    ;* 10 CMPLTSP .S1 A3,A5,A4 ; |11|
    ;* || ADDSP .L2X B8,A3,B7 ; |12|
    ;* 11 ROTL .M1 A4,0,A0 ; |11| Split a long life
    ;* 12 NOP 2
    ;* 14 [ A0] STW .D2T2 B7,*B4 ; |12|
    ;* || SPBR $C$C29
    ;* 15 NOP 1
    ;* 16 ; BRANCHCC OCCURS {$C$C29} ; |8|
    ;*----------------------------------------------------------------------------*

  • Thank you for posting a workaround.  I added it to the entry I filed.

    Thanks and regards,

    -George