This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Simple nested loop not pipelining on TMS320C64x+

Hello,

The following loop is not pipelining. The message CCS is giving me is "advice #30003:  (Performance) Loop cannot be scheduled efficiently, as it contains complex conditional expression. Try to simplify condition."

	int a, b, ar[100] = {0};

	for (a = 0; a < ST_N_ROWS; a++)
		for (b = 0; b <= 100; b++)
			ar[b]++;

Searching online, solutions always deal with fixing nested if-statements, but not nested loops. I've tried unrolling, and a variety of minor tweaks, with no success. -O3 is set during compilation.

Does anyone have any advice?

  • The processor is the TMS320C64x+.

  • Hi Andrei,

    Welcome to the TI E2E forum. I hope you will find many good answers here and in the TI.com documents and in the TI Wiki Pages (for processor issues). Be sure to search those for helpful information and to browse for the questions others may have asked on similar topics (e2e.ti.com).

    I have tried to reproduce the issue by creating a simple hello world example and copy-paste the code of yours, but not able to. I am using CCSv5.5 and compiler v7.4.4

    What is your compiler version, and CCS version?

    You can also refer Optimizing compiler user guide.
    Section 3.2 introduces optimizing software pipe-lining.
    and Section 3.6 talks about Performance file-level Optimization.

    Check-out below mentioned link for user guide.
    www.ti.com/.../spru187v.pdf
  • Hi Arvind,

    Some notes:

    • CCS: v6.1.0.00104

    • Compiler: TI v8.0.3, but error also present on v7.4.12, v6.1.23

    • The following compiler flags are set:
      • --mv6400+ --abi=eabi -O3 --opt_for_speed=4 --include_path="D:/TI/ccsv6/tools/compiler/ti-cgt-c6000_8.0.3/include" --advice:performance -g  --issue_remarks --verbose_diagnostics --diag_warning=225 --gen_func_subsections=on --debug_software_pipeline --gen_opt_info=2 --gen_profile_info -k --c_src_interlist --asm_listing --output_all_syms

    • The following linker flags are set:
      • -mv6400+ -g -O3 --verbose_diagnostics --diag_warning=225 --issue_remarks --gen_func_subsections --debug_software_pipeline --opt_for_speed=4 --gen_opt_info=2 --gen_profile_info -k --c_src_interlist --asm_listing --output_all_syms -z -m"dsp.map" --warn_sections -i"D:/TI/ccsv6/tools/compiler/ti_cgt_c6000_6.1.23/lib" -i"D:/TI/ccsv6/tools/compiler/ti_cgt_c6000_6.1.23/include" --reread_libs --xml_link_info="dsp_linkInfo.xml" --rom_model

    • The file with the remark has been stripped to:
      • void main ()
        {
        	int a, b, ar[100] = {0};
        
        	for (a = 0; a < 1000; a++)
        	    for (b = 0; b < 100; b++)
        	        ar[b]++;
        
        	while(1);
        }

    • This loop is a test case for a more complex example that exhibits the same remark

  • Hello,

    I'm afraid there is a compiler flaw. I have quickly tried your snippet in existing projects. I used it in a form

    void foo(void)
    {
        int a, b, ar[100] = {0};
    
        for (a = 0; a < 1000; a++)
            for (b = 0; b < 100; b++)
                ar[b]++;
    
        while(1);
        //return;
    }
    

    In CCS 3.3 with v6.0.31 for C641x I see the loop is unrolled and I see the following message in the .nfo advice:

    		======Unroll-and-jam Result Summary======
    
    LOOP#1 in foo() is unroll-and-jammed by a factor of 8
    	old_II = 1   new_II = 1 
    	reg_pre_before = 6 reg_pre_after = 0
    	inner loop size = 24   outer loop size = 48
    
    

    and in assembly there is a huge code matching that unrollment.

    With CCS 6.1 and v7.4.14 I see the message that loop was disqualified because of control code. However, I believe the loop was unrolled. If I see interlisted assembly, there is something like this:

    ;*----------------------------------------------------------------------------*
    ;*   SOFTWARE PIPELINE INFORMATION
    ;*      Disqualified loop: Loop contains control code
    ;*      Disqualified loop: Bad loop structure
    ;*----------------------------------------------------------------------------*
    $C$L1:    
    $C$DW$L$foo$3$B:
    ;**	-----------------------g2:
    ;** 8	-----------------------    ((int (*)[2])ar)[0] = _dadd_c(1, ((int (*)[2])ar)[0]);
    ;** 8	-----------------------    ((int (*)[2])ar)[0] = _dadd_c(1, ((int (*)[2])ar)[0]);
    ;** 8	-----------------------    ((int (*)[2])ar)[1] = _dadd_c(1, ((int (*)[2])ar)[1]);
    ;** 8	-----------------------    ((int (*)[2])ar)[1] = _dadd_c(1, ((int (*)[2])ar)[1]);
    ;** 8	-----------------------    ((int (*)[2])ar)[2] = _dadd_c(1, ((int (*)[2])ar)[2]);
    ;** 8	-----------------------    ((int (*)[2])ar)[2] = _dadd_c(1, ((int (*)[2])ar)[2]);
    ;** 8	-----------------------    ((int (*)[2])ar)[3] = _dadd_c(1, ((int (*)[2])ar)[3]);
    ;** 8	-----------------------    ((int (*)[2])ar)[3] = _dadd_c(1, ((int (*)[2])ar)[3]);
    ;** 8	-----------------------    ((int (*)[2])ar)[4] = _dadd_c(1, ((int (*)[2])ar)[4]);
    ;** 8	-----------------------    ((int (*)[2])ar)[4] = _dadd_c(1, ((int (*)[2])ar)[4]);
    ;** 8	-----------------------    ((int (*)[2])ar)[5] = _dadd_c(1, ((int (*)[2])ar)[5]);
    ;** 8	-----------------------    ((int (*)[2])ar)[5] = _dadd_c(1, ((int (*)[2])ar)[5]);
    ;** 8	-----------------------    ((int (*)[2])ar)[6] = _dadd_c(1, ((int (*)[2])ar)[6]);
    ;** 8	-----------------------    ((int (*)[2])ar)[6] = _dadd_c(1, ((int (*)[2])ar)[6]);
    ;** 8	-----------------------    ((int (*)[2])ar)[7] = _dadd_c(1, ((int (*)[2])ar)[7]);
    ;** 8	-----------------------    ((int (*)[2])ar)[7] = _dadd_c(1, ((int (*)[2])ar)[7]);
    ;** 8	-----------------------    ((int (*)[2])ar)[8] = _dadd_c(1, ((int (*)[2])ar)[8]);
    ;** 8	-----------------------    ((int (*)[2])ar)[8] = _dadd_c(1, ((int (*)[2])ar)[8]);
    ;** 8	-----------------------    ((int (*)[2])ar)[9] = _dadd_c(1, ((int (*)[2])ar)[9]);
    ;** 8	-----------------------    ((int (*)[2])ar)[9] = _dadd_c(1, ((int (*)[2])ar)[9]);
    ;** 8	-----------------------    ((int (*)[2])ar)[10] = _dadd_c(1, ((int (*)[2])ar)[10]);
    ;** 8	-----------------------    ((int (*)[2])ar)[10] = _dadd_c(1, ((int (*)[2])ar)[10]);
    ;** 8	-----------------------    ((int (*)[2])ar)[11] = _dadd_c(1, ((int (*)[2])ar)[11]);
    ;** 8	-----------------------    ((int (*)[2])ar)[11] = _dadd_c(1, ((int (*)[2])ar)[11]);
    ;** 8	-----------------------    ((int (*)[2])ar)[12] = _dadd_c(1, ((int (*)[2])ar)[12]);
    ;** 8	-----------------------    ((int (*)[2])ar)[12] = _dadd_c(1, ((int (*)[2])ar)[12]);
    ;** 8	-----------------------    ((int (*)[2])ar)[13] = _dadd_c(1, ((int (*)[2])ar)[13]);
    ;** 8	-----------------------    ((int (*)[2])ar)[13] = _dadd_c(1, ((int (*)[2])ar)[13]);
    ;** 8	-----------------------    ((int (*)[2])ar)[14] = _dadd_c(1, ((int (*)[2])ar)[14]);
    ;** 8	-----------------------    ((int (*)[2])ar)[14] = _dadd_c(1, ((int (*)[2])ar)[14]);
    ;** 8	-----------------------    *(C$5 = &ar+120) = _dadd_c(1, *(C$6 = &ar+120));
    ;** 8	-----------------------    ((int (*)[2])ar)[15] = _dadd_c(1, ((int (*)[2])ar)[15]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[1] = _dadd_c(1, ((int (*)[2])C$6)[1]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[1] = _dadd_c(1, ((int (*)[2])C$6)[1]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[2] = _dadd_c(1, ((int (*)[2])C$6)[2]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[2] = _dadd_c(1, ((int (*)[2])C$6)[2]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[3] = _dadd_c(1, ((int (*)[2])C$6)[3]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[3] = _dadd_c(1, ((int (*)[2])C$6)[3]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[4] = _dadd_c(1, ((int (*)[2])C$6)[4]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[4] = _dadd_c(1, ((int (*)[2])C$6)[4]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[5] = _dadd_c(1, ((int (*)[2])C$6)[5]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[5] = _dadd_c(1, ((int (*)[2])C$6)[5]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[6] = _dadd_c(1, ((int (*)[2])C$6)[6]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[6] = _dadd_c(1, ((int (*)[2])C$6)[6]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[7] = _dadd_c(1, ((int (*)[2])C$6)[7]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[7] = _dadd_c(1, ((int (*)[2])C$6)[7]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[8] = _dadd_c(1, ((int (*)[2])C$6)[8]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[8] = _dadd_c(1, ((int (*)[2])C$6)[8]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[9] = _dadd_c(1, ((int (*)[2])C$6)[9]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[9] = _dadd_c(1, ((int (*)[2])C$6)[9]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[10] = _dadd_c(1, ((int (*)[2])C$6)[10]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[10] = _dadd_c(1, ((int (*)[2])C$6)[10]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[11] = _dadd_c(1, ((int (*)[2])C$6)[11]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[11] = _dadd_c(1, ((int (*)[2])C$6)[11]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[12] = _dadd_c(1, ((int (*)[2])C$6)[12]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[12] = _dadd_c(1, ((int (*)[2])C$6)[12]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[13] = _dadd_c(1, ((int (*)[2])C$6)[13]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[13] = _dadd_c(1, ((int (*)[2])C$6)[13]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[14] = _dadd_c(1, ((int (*)[2])C$6)[14]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[14] = _dadd_c(1, ((int (*)[2])C$6)[14]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[15] = _dadd_c(1, ((int (*)[2])C$6)[15]);
    ;** 8	-----------------------    ((int (*)[2])C$5)[15] = _dadd_c(1, ((int (*)[2])C$6)[15]);
    ;** 8	-----------------------    *(C$3 = &ar+248) = _dadd_c(1, *(C$4 = &ar+248));
    	.dwpsn	file "../test.c",line 8,column 13,is_stmt,isa 0
    
               LDDW    .D2T2   *+SP(104),B9:B8   ; |8| 
    ||         MVK     .S1     128,A4
    
               LDDW    .D2T2   *+SP(96),B17:B16  ; |8| 
    ||         ADD     .L1X    SP,A4,A4
    
               LDDW    .D2T2   *+SP(88),B19:B18  ; |8| 
    ||         LDDW    .D1T1   *+A4(120),A1:A0   ; |8| 
    
               LDDW    .D2T2   *+SP(80),B21:B20  ; |8| 
               LDDW    .D2T2   *+SP(72),B23:B22  ; |8| 
    
    

    and further there is assembly matching those _dadd's.

    Interesting to note that if I replace the very last statement while(1) with bare return, the situation changes. This time the loops is both unrolled and pipelined:

    ;** --------------------------------------------------------------------------*
    ;**   BEGIN LOOP $C$L1
    ;** --------------------------------------------------------------------------*
    $C$L1:    
    $C$DW$L$foo$3$B:
    ;**	-----------------------g2:
    ;** 7	-----------------------    L$2 = 50;
    ;**  	-----------------------    U$9 = &ar;
    ;**  	-----------------------    #pragma MUST_ITERATE(50, 50, 50)
    ;**  	-----------------------    // LOOP BELOW UNROLLED BY FACTOR(2)
    ;**  	-----------------------    #pragma LOOP_FLAGS(4098u)
    ;**	-----------------------g3:
    ;** 8	-----------------------    *U$9 = _dadd_c(1, *U$9);
    ;** 8	-----------------------    *U$9 = _dadd_c(1, *U$9);
    ;** 8	-----------------------    *U$9 = _dadd_c(1, *U$9);
    ;** 8	-----------------------    *U$9 = _dadd_c(1, *U$9);
    ;** 8	-----------------------    *U$9 = _dadd_c(1, *U$9);
    ;** 8	-----------------------    *U$9 = _dadd_c(1, *U$9);
    ;** 8	-----------------------    *U$9 = _dadd_c(1, *U$9);
    ;** 8	-----------------------    *U$9 = _dadd_c(1, *U$9);
    ;** 7	-----------------------    U$9 += 8;
    ;** 7	-----------------------    if ( L$2 = L$2-1 ) goto g3;
    	.dwpsn	file "../test.c",line 7,column 21,is_stmt,isa 0
               ADD     .L1X    8,SP,A3
               LDDW    .D1T1   *A3++,A7:A6       ; |8| (P) <0,0> 
               ADD     .L2     8,SP,B16
               DINT                              ; interrupts off
               NOP             2
    
               DADD    .L2X    1,A7:A6,B5:B4     ; |8| (P) <0,6> 
    ||         LDDW    .D1T1   *A3++,A7:A6       ; |8| (P) <1,0> 
    
               STDW    .D2T2   B5:B4,*B16++      ; |8| (P) <0,11> 
               LDDW    .D2T2   *-B16(8),B9:B8    ; |8| (P) <0,12> 
               NOP             2
    
               DADD    .L2X    1,A7:A6,B5:B4     ; |8| (P) <1,6> 
    ||         LDDW    .D1T1   *A3++,A7:A6       ; |8| (P) <2,0> 
    
               STDW    .D2T2   B5:B4,*B16++      ; |8| (P) <1,11> 
               NOP             1
    $C$DW$L$foo$3$E:
    ;*----------------------------------------------------------------------------*
    ;*   SOFTWARE PIPELINE INFORMATION
    ;*
    ;*      Loop found in file               : ../test.c
    ;*      Loop source line                 : 7
    ;*      Loop opening brace source line   : 8
    ;*      Loop closing brace source line   : 8
    ;*      Loop Unroll Multiple             : 2x
    ;*      Known Minimum Trip Count         : 50                    
    ;*      Known Maximum Trip Count         : 50                    
    ;*      Known Max Trip Count Factor      : 50
    ;*      Loop Carried Dependency Bound(^) : 0
    ;*      Unpartitioned Resource Bound     : 8
    ;*      Partitioned Resource Bound(*)    : 8
    ;*      Resource Partition:
    ;*                                A-side   B-side
    ;*      .L units                     0        0     
    ;*      .S units                     1        0     
    ;*      .D units                     8*       8*    
    ;*      .M units                     0        0     
    ;*      .X cross paths               0        2     
    ;*      .T address paths             8*       8*    
    ;*      Long read paths              0        0     
    ;*      Long write paths             0        0     
    ;*      Logical  ops (.LS)           4        5     (.L or .S unit)
    ;*      Addition ops (.LSD)          0        0     (.L or .S or .D unit)
    ;*      Bound(.L .S .LS)             3        3     
    ;*      Bound(.L .S .D .LS .LSD)     5        5     
    ;*
    ;*      Searching for software pipeline schedule at ...
    ;*         ii = 8  Schedule found with 10 iterations in parallel
    ;*      Done
    ;*
    ;*      Epilog not removed
    ;*      Collapsed epilog stages       : 0
    ;*
    ;*      Prolog not entirely removed
    ;*      Collapsed prolog stages       : 5
    ;*
    ;*      Minimum required memory pad   : 0 bytes
    ;*
    ;*      For further improvement on this loop, try option -mh72
    ;*
    ;*      Minimum safe trip count       : 9 (after unrolling)
    ;*----------------------------------------------------------------------------*
    

    I wish someone from TI take a look on this case.

  • Removing '--gen_profile_info' from the compiler flags solved the issue. My loops have been splooped. Many thanks for the help!

  • Hi,

    We are not sure that it is a compiler flaw but inorder to confirm this, we will move this thread to compiler forum and it will be better answered by experts over compiler forum.

    Thanks & regards,
    Sivaraj K
  • Andrei Khramtsov said:
    Removing '--gen_profile_info' from the compiler flags solved the issue.

    I agree that this was the problem the whole time.

    Thanks and regards,

    -George

  • George,

    I had no '--gen_profile_info' option in neither case, and the output was different for infinite while(1) and bare return behind the loop. Isn't that worth attention?

  • rrlagic said:
    With CCS 6.1 and v7.4.14 I see the message that loop was disqualified because of control code.

    I cannot reproduce this result.  Please send the exact source and build options used.

    Thanks and regards,

    -George

  • Hello,

    The source code is

    void foo(void)
    {
        int a, b, ar[100] = {0};
    
        for (a = 0; a < 1000; a++)
            for (b = 0; b < 100; b++)
                ar[b]++;
    
        //while(1);
        return;
    }
    

    It is in separate file. For that file I have set file specific options as follows:

    -mv64+ --abi=eabi -O3 -ms0 -g --include_path="C:/TI/ccsv6/tools/compiler/c6000_7.4.14/include" --include_path="C:/TI/pdk_C6670_1_1_2_6/packages/ti/drv/cppi" --include_path="C:/TI/pdk_C6670_1_1_2_6/packages/ti/drv/qmss" --include_path="C:/TI/dsplib_c66x_3_2_0_1" --include_path="C:/TI/pdk_C6670_1_1_2_6/packages/ti/platform" --define=_TMS320C66 --define=_MTP300_ --define=_L2SRAM2_4_ --define=TC_PLATFORM_V_0_5 --define=_FFT_WINDOW_COMPENSATION_ --define=_LOOPBACK_DATA_ --display_error_number --diag_warning=225 -k --quiet

    I do build only this file. Please see the source with generated assembly for the case of while(1) and return ans the end of function.

    6330.test.zip

  • Thank you for the test case.  I'm sorry to report I still cannot reproduce the problem.  I build that exact source with those exact options.  I look at the assembly and I see one loop in it.  It is scheduled with a typical software pipeline.  I do not see any comments about a loop being disqualified because it contains control code.  I'm sure you find this frustrating.  But I don't know what else to do.

    Thanks and regards,

    -George

  • Well, I don't really mind until that happen to my production code. What makes me worry is how do reproduce exact behaviour. Output assembly I have shown has been received with test.c added to existing SYS/BIOS project. I have tried to created fresh empty project with main.c and placed the code there. The loop does pipeline. I have created fresh SYS/BIOS project with typical template, dropped the loop there - it does pipeline. However, in our production project the loop is disqualified with either 7.4.12 or 7.4.14 compiler. Was tried on 3 different machines. If you might be curious about that, could you please suggest, what else to check?

  • rrlagic said:
    However, in our production project the loop is disqualified with either 7.4.12 or 7.4.14 compiler.

    Please preprocess the source file which contains the problem loop, and attach it to your next post. Let me know which function contains the loop. Also show exactly how the compiler is invoked.  I can usually reproduce the problem then.

    Thanks and regards,

    -George

  • Hello George,

    Thank you for keep watching.

    In the attachment please find the source itself, its preprocessed version, build console output. To build that file I have added it to existing project, then set file specific options to be o3 optimization, keep assembly and interlist source. There is just one function foo() in that file and no any preprocessing stuff. No surprise .pp is identical to C source.

  • Hello George,

    Thank you for keep watching.

    In the attachment please find the source itself, its preprocessed version, build console output. To build that file I have added it to existing project, then set file specific options to be o3 optimization, keep assembly and interlist source. There is just one function foo() in that file and no any preprocessing stuff. No surprise .pp is identical to C source. src.zip