This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Poor code generation for C674X vs C671x

Hi all,


I'm having trouble generating compact code using CCS5.5 vs CCS2.2 The attached file details the issue.


CCS 2.2 generates a 6 cycle loop for C671x

line 2609 of CCS2.2 of .lst file is the start of loop at line 998 that iterates with a 6 cycle kernel in attached .zip file.

CCS 5.5 generates an 18 cycle loop for C674x

line 6242 of CCS5.5 section of .lst file is the start of loop at line 998 that iterates with an 18 cycle kernel in attached .zip file.

what gives?

Also some loop code is repeated in the CCS5.5 section. There are 2 code sections that implement looping for line 998 of C source. Why is that.


Any insights appreciated.


Thanks,

Andrew

C674x_compiler_opt.zip
  • Andrew,

    I will check this with the CCS team, to see if they have something to share.

    Best regards,
    Pavel

  • I move this thread to the TI C/C++ Compiler forum.

    Regards,
    Pavel

  • Andrew,

    Have you tried with the C674x optimization described in the below post:

    http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/716/p/319115/1144579.aspx#1144579

    If you are using CCStudio project for your DSP benchmark, you can also try to enable the DSP compiler and linker optimization with the -O3 flag.

    In CCS, Project -> Properties -> Build -> C6000 Compiler -> Optimization
    Optimization level (--opt_level, -O) 3
    
    Select "3" from the drop down menu. 

    Then Project -> Build All You should have in the console window: Invoking: C6000 Compiler "/home/users/pbotev/ti/ccsv5/tools/compiler/c6000_7.4.2/bin/cl6x" -mv6740 --abi=coffabi -O3 -g .... Invoking: C6000 Linker "/home/users/pbotev/ti/ccsv5/tools/compiler/c6000_7.4.2/bin/cl6x" -mv6740 --abi=coffabi -O3 -g Then load and run the new *.out file.

  • Switching to "-O3" made no difference what-so-ever.

    Next idea?


    Thanks,

    Andrew

  • I can't guarantee it, but I'm confident adding the restrict keyword to the pointer pSampleOutput will make things better.  For details see this wiki article on tuning loops for C6000.  

    Andrew Elder said:
    CCS 5.5 generates an 18 cycle loop for C674x

    I don't see it in the file you attached.

    Andrew Elder said:
    Also some loop code is repeated in the CCS5.5 section. There are 2 code sections that implement looping for line 998 of C source. Why is that.

    Sometimes the compiler generates multiple versions of a loop.  Some runtime check prior to the loops chooses between them.  This check may make sure certain pointers are not aliased, or that the iteration count is above some threshold, or something similar.

    Thanks and regards,

    -George

  • You've sent us the listing file for compiler version 4.20, which is very old.  The compiler has had many performance and correctness improvements.  It may be the case that the old schedule wasn't correct in all cases, and the compiler is now generating correct code.

    The loop in the 7.4.4 output you point out has ii=56; I can't find a loop with ii=18.  In this ii=56 loop, it shows a loop-carried dependence bound of 56, which forces the ii to be at least 56.  Loop-carried dependences are usually due to pointers.  The 7.4.4 compiler is more conservative in some possibly aliasing cases, such as const pointers.  It may be the case that your code needs to use the restrict keyword appropriately to tell the optimizer it may reorder certain operations.

    There isn't enough source code here to compile the test case.  To analyze the problem, we'd need a compilable test case that demonstrates the problem.  Please preprocess the file as described at http://processors.wiki.ti.com/index.php/Preprocess_Complex_Source_Code_for_Bug_Submissions and post the corresponding preprocessed output.

  • George Mock said:

    I can't guarantee it, but I'm confident adding the restrict keyword to the pointer pSampleOutput will make things better.  For details see this wiki article on tuning loops for C6000.  

    CCS 5.5 generates an 18 cycle loop for C674x

    I don't see it in the file you attached.

    Thanks and regards,

    -George

    [/quote]

    Hi George,

    Thanks for the feedback. First let me double check I'm reading the .lst file correctly.

        6243                    ;*----------------------------------------------------------------------------*
        6244                    ;*   SOFTWARE PIPELINE INFORMATION
        6245                    ;*
        6246                    ;*      Loop found in file               : /home/andrew/asi/sw/dsp/ax6/axmixengine.c
        6247                    ;*      Loop source line                 : 998
        6248                    ;*      Loop opening brace source line   : 998
        6249                    ;*      Loop closing brace source line   : 1007
        6250                    ;*      Known Minimum Trip Count         : 16                    
        6251                    ;*      Known Maximum Trip Count         : 32                    
        6252                    ;*      Known Max Trip Count Factor      : 2
        6253                    ;*      Loop Carried Dependency Bound(^) : 0
        6254                    ;*      Unpartitioned Resource Bound     : 6
        6255                    ;*      Partitioned Resource Bound(*)    : 6
        6256                    ;*      Resource Partition:
        6257                    ;*                                A-side   B-side
        6258                    ;*      .L units                     0        0     
        6259                    ;*      .S units                     0        0     
        6260                    ;*      .D units                     6*       6*    
        6261                    ;*      .M units                     2        2     
        6262                    ;*      .X cross paths               3        2     
        6263                    ;*      .T address paths             6*       6*    
        6264                    ;*      Long read paths              0        0     
        6265                    ;*      Long write paths             0        0     
        6266                    ;*      Logical  ops (.LS)           3        1     (.L or .S unit)
        6267                    ;*      Addition ops (.LSD)          0        1     (.L or .S or .D unit)
        6268                    ;*      Bound(.L .S .LS)             2        1     
        6269                    ;*      Bound(.L .S .D .LS .LSD)     3        3     
        6270                    ;*
        6271                    ;*      Searching for software pipeline schedule at ...
        6272                    ;*         ii = 6  Schedule found with 3 iterations in parallel
        6273                    ;*      Done
        6274                    ;*
        6275                    ;*      Loop will be splooped
        6276                    ;*      Collapsed epilog stages       : 0
        6277                    ;*      Collapsed prolog stages       : 0
        6278                    ;*      Minimum required memory pad   : 0 bytes
        6279                    ;*
        6280                    ;*      Minimum safe trip count       : 1
        6281                    ;*----------------------------------------------------------------------------*

    Has schedule ii =6, but the loop code below

        6289 00000940           $C$L64:    ; PIPED LOOP KERNEL
        6290 00000940           $C$DW$L$_AxMixerSrc2Dest$21$B:
        6291 00000940 049496e6             LDW     .D2T2   *B5++(16),B9      ; |999| (P) <0,0>
        6292                    
        6293 00000944     2ce7             SPMASK          L1,L2
        6294 00000948 03d90058  ||         ADD     .L1     8,A22,A7
        6295 00000946     fb07  ||         MV      .L2X    A22,B7
        6296                    
        6297 0000094c     2ce7             SPMASK          L1,L2
        6298 0000094e     6a06  ||         MV      .L1     A20,A3
        6299 00000950 0955105b  ||         ADD     .L2X    8,A21,B18
        6300 00000954 021c96e7  ||         LDW     .D2T2   *B7++(16),B4      ; |1001| (P) <0,2>
        6301 00000958 021c9664  ||         LDW     .D1T1   *A7++(16),A4      ; |1005| (P) <0,2>
        6302                    
        6303 00000960 088c9665             LDW     .D1T1   *A3++(16),A17     ; |999| (P) <0,3>
        6304 00000964 04c896e6  ||         LDW     .D2T2   *B18++(16),B9     ; |1003| (P) <0,3>
        6305                    
        6306 00000968     0c6e             NOP             1
        6307                    
        6308 0000096a     2c67             SPMASK          L1
        6309 0000096c 04510059  ||         ADD     .L1     8,A20,A8
        6310 00000970 09a4ce02  ||         MPYSP   .M2     B6,B9,B19         ; |999| (P) <0,5>
        6311                    
        6312 00000974     2de7             SPMASK          L1,S1,L2
        6313 00000978 04d08059  ||         ADD     .L1     4,A20,A9
        6314 00000976     18ce  ||         MV      .S1X    B17,A16
        6315 00000980 08d1905b  ||         ADD     .L2X    12,A20,B17
        6316 00000984 09209664  ||         LDW     .D1T1   *A8++(16),A18     ; |1003| (P) <0,6>
        6317                    
        6318 00000988 02249665             LDW     .D1T1   *A9++(16),A4      ; |1001| (P) <0,7>
        6319 0000098c 024496e7  ||         LDW     .D2T2   *B17++(16),B4     ; |1005| (P) <0,7>
        6320 00000990 08920e00  ||         MPYSP   .M1     A16,A4,A17        ; |1005| (P) <0,7>
        6321                    
        6322 00000994 09921e01             MPYSP   .M1X    A16,B4,A19        ; |1001| (P) <0,8>
        6323 00000998 04a4ce02  ||         MPYSP   .M2     B6,B9,B9          ; |1003| (P) <0,8>
        6324                    
        6325 0000099c 00000000             NOP             1
        6326                    
        6327 000009a0     2ce6             SPMASK          L2
        6328 000009a2     1647  ||         MV      .L2X    A20,B8
        6329 000009a4 024e3218  ||         ADDSP   .L1X    B19,A17,A4        ; |999| (P) <0,10>
        6330                    
        6331 000009a8 001f0001             SPMASK          L1,S1,L2
        6332 000009ac 02d10059  ||         ADD     .L1     8,A20,A5
        6333 000009b0 035081a1  ||         ADD     .S1     4,A20,A6
        6334 000009b4 0851905a  ||         ADD     .L2X    12,A20,B16
        6335                    
        6336 000009b8 02126219             ADDSP   .L1     A19,A4,A4         ; |1001| <0,12>
        6337 000009c0 04c4921a  ||         ADDSP   .L2X    A17,B4,B9         ; |1005| <0,12>
        6338                    
        6339 000009c4 02265218             ADDSP   .L1X    B9,A18,A4         ; |1003| <0,13>
        6340 000009c8     0c6e             NOP             1
        6341 000009ca     924f             MV      .S2X    A4,B4             ; |999| <0,15> Define a twin register
        6342                    
        6343 000009cc 02189675             STW     .D1T1   A4,*A6++(16)      ; |1001| <0,16>
        6344 000009d0 04c096f6  ||         STW     .D2T2   B9,*B16++(16)     ; |1005| <0,16>
        6345                    
        6346 000009d4 08034001             SPKERNEL 1,0
        6347 000009d8 02149675  ||         STW     .D1T1   A4,*A5++(16)      ; |1003| <0,17>
        6348 000009e0 022096f6  ||         STW     .D2T2   B4,*B8++(16)      ; |999| <0,17>

    takes 18 cycles. I don't understand where the ii = 6 comes from? Am I not understanding the loop constraints somehow?


    Regards,

    Andrew

  • The C674x includes a feature called a loop buffer.  It is a different way to program software pipelined loops that requires far less code size and power (among other things).  The instructions used to program the loop buffer include SPLOOP, SPMASK, and SPKERNEL.  It is all quite complicated.  I've never learned it beyond the description I'm giving you now.  The main point is: You can't simply count cycles in the assembly to work out the II.  Instead, rely on this comment ...

    Andrew Elder said:
       6272                    ;*         ii = 6  Schedule found with 3 iterations in parallel

    The loop II is 6.  

    Thanks and regards,

    -George

  • How different are the cycle counts for this loop when you run it?