This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Compiler: CGT8 optimizations for K2 and Nyquist

Tool/software: TI C/C++ Compiler

Hello!

We have observed a big increase of code size when moving from using CGT 7.3.23 to CGT 8.3.4. In both compilations, we are using the same compiler flags:

--mem_model:data=far -pdse9 -pdse48 -pdse190 -pdse225 -pdse262 -pdse849 -pdse994 -mi1000 -mv6600 -mo --strip_coff_underscore --disable_push_pop -o3 -ms0 

One reason why binary's size is increased is that compiler adds unnecessary NOPs to the code, to align fetch packet boundaries.

And that is visible when it is using so called compact instructions,  i.e.16-bit op-codes

CGT 8.3.x

840006f0   10102413           CALLP.S2      $Tramp$S$$AaMemCheckTag (PC+33056 = 0x84008800),B3

840006f4       0c6e ||        NOP           1

840006f6       0c6e ||        NOP           1

840006f8       0c6e ||        NOP           1

840006fa       0c6e ||        NOP           1

840006fc   ec401c0c           .fphead       n, l, W, BU, nobr, nosat, 1100010b

84000700             $C$RL204:

84000700       2226           CMPEQ.L1      1,A4,A0

84000702       3a76 ||        MVK.D1        1,A4

84000704   0473902b ||        MVK.S2        0xffffe720,B8

84000708   04737c29 ||        MVK.S1        0xffffe6f8,A8

8400070c       0727 ||        MVK.L2        0,B6

CGT 7.3.x

821f616c   100c1612           CALLP.S2      $Tramp$S$$AaMemCheckTag (PC+24752 = 0x821fc210),B3

821f6170       2226           CMPEQ.L1      1,A4,A0

821f6172       48aa    [ A0]  BNOP.S1       $C$L63 (PC+68 = 0x821f61a4),2

821f6174   0446aca9           MVK.S1        0xffff8d59,A8

821f6178   020017aa ||        MVK.S2        0x002f,B4

821f617c   e2208003           .fphead       n, l, W, BU, br, nosat, 0010001b

And an another problem which is visible, which also adds unnecessary NOPs, which is NOT visible with CGT 7.3.x compilations either:

1081ec50   0fff6410           B.S1          odo_send_helper (PC-1248 = 0x1081e760)  -> here we call "odo_send_helper" and the actual "return" is done inside odo_send_helper
1081ec54       71f7           LDW.D2T2      *++B15[2],B3
1081ec56       8047           MV.L2         B0,B4
1081ec58       1313           MVK.S2        16,B6
1081ec5a       16c6           MV.L1X        B5,A8
1081ec5c   ec000000           .fphead       n, l, W, BU, nobr, nosat, 1100000b
1081ec60   00000000           NOP  

And the compiler generates an extra return symbol + NOPs, which is never used!!         

1081ec64             $C$RL24:
1081ec64   00000000           NOP          
1081ec68   00000000           NOP          
1081ec6c   00000000           NOP          
1081ec70   00000000           NOP          
1081ec74   00000000           NOP          
1081ec78   00000000           NOP          
1081ec7c   00000000           NOP          
1081ec80             send:
1081ec80             .text:send:
1081ec80   01ab0228           MVK.S1        0x5604,A3
1081ec84   018843e8           MVKH.S1       0x10870000,A3
1081ec88   018c0264           LDW.D1T1      *+A3[0],A3
1081ec8c       d246           MV.L1X        B4,A6
1081ec8e       cf27           MVK.L2        14,B6
1081ec90   0400a358           MVK.L1        0,A8
1081ec94   0fff5c10           B.S1          odo_send_helper (PC-1312 = 0x1081e760)
1081ec98   020c4266           LDW.D1T2      *+A3[2],B4
1081ec9c   e1000000           .fphead       n, l, W, BU, nobr, nosat, 0001000b
1081eca0   00006000           NOP           4
1081eca4             $C$RL26:
1081eca4   00000000           NOP          
1081eca8   00000000           NOP          
1081ecac   00000000           NOP          
1081ecb0   00000000           NOP          
1081ecb4   00000000           NOP          
1081ecb8   00000000           NOP          
1081ecbc   00000000           NOP          
1081ecc0             odo_send_w_s_safe:

Would you be able to explain why the new compiler generates such unoptimal code?

Second problem is that we are seeing unnecessary symbols (visible in the code but used nowhere) in DWARF structure:

0x00000000 0x00000004 poolIdPrivate
0x00000000 0x00000004 DSP2ARMSender1_
0x00000000 0x00000004 DSP2ARMReceiver1_
0x00000000 0x00000019 $P$T1$2
0x00000000 0x00000004 ret_addr
0x00000000 0x00000004 bufferSize
0x00000000 0x00000004 bufferSize
0x00000000 0x00000004 CHIPDSP_MASK
0x00000000 0x00000800 fftcHostDesc
0x00000000 0x00000004 TEST_LENGTH_IN_SECS
0x00000000 0x00000004 HwSemProcess5_
0x00000000 0x00000004 TBTS_TEST_COMPLETE_IND_MSG
0x00000000 0x00000018 gEthernetLoopbackStatsLastPeriod
0x00000000 0x00000008 IpAddr
0x00000000 0x00000400 gEthFrameBuffer
0x00000000 0x00000002 fragmentIdentification

Br,

Risto Alasaarela

  • The compiler option -ms0 (the long form equivalent is --opt_for_space=0) says you prefer to optimize for speed over size.  I suspect that has something to do with it.

    For each problem that causes unnecessary NOP instructions, I presume you can identify one source file with that problem.  For each such source file, please follow the directions in the article How to Submit a Compiler Test Case.  In case it is not obvious which function contains the problem, please add the comment // PROBLEM FUNCTION.  

    Regarding ...

    Risto Alasaarela1 said:
    Second problem is that we are seeing unnecessary symbols (visible in the code but used nowhere) in DWARF structure

    I presume by comparing the Dwarf output of the two source files you submit, that I will see the difference.  What problem is caused by these extra symbols?

    Thanks and regards,

    -George

  • Hello!

    Just to emphasize, that we are seeing tens of kBs more NOPs in the targets compiled with CGT8 than with targets compiled with CGT7.3.23.

    Unfortunately I cannot provide my source files due to the fact that it would reveal Nokia IP. However, we have found mentioned problems in RTS compilation too:

    [ralasaar@ouling36 lib]$ ./../bin/dis6x catrigf.c.obj |grep "RL50" -A 16 | head -n 16
    00000090 01888163 ADDKPC.S2 $C$RL50 (PC+32 = 0x000000a0),B3,4
    00000094 0c6e || NOP 1
    00000096 0c6e || NOP 1
    00000098 0c6e || NOP 1
    0000009a 0c6e || NOP 1
    0000009c ec201c00 .fphead n, l, W, BU, nobr, nosat, 1100001b
    000000a0 $C$RL50:
    000000a0 02341fdb MV.L2X A13,B4
    000000a4 05100fd9 || MV.L1 A4,A10
    000000a8 10000013 || CALLP.S2 $C$RL50 (PC+0 = 0x000000a0),B3
    000000ac 023006a0 || MV.S1 A12,A4
    000000b0 $C$RL52:
    000000b0 10000013 CALLP.S2 $C$RL50 (PC+0 = 0x000000a0),B3
    000000b4 02101fdb || MV.L2X A4,B4
    000000b8 02280fd8 || MV.L1 A10,A4
    000000bc $C$RL54:


    And another type:


    [ralasaar@ouling36 lib]$ ./../bin/dis6x algorithm.cpp.obj |grep 000000bc -A 10
    000000bc c8180344 [ A0] STDW.D1T1 A17:A16,*+A6[0]
    000000c0 $C$L2:
    000000c0 008c8363 BNOP.S2 B3,4
    000000c4 020c0fd8 || MV.L1 A3,A4
    000000c8 $C$L3:
    000000c8 00000000 NOP
    000000cc 00000000 NOP
    000000d0 00000000 NOP
    000000d4 00000000 NOP
    000000d8 00000000 NOP
    000000dc 00000000 NOP
    --

    Then for the second problem:

    Unnecessary DWARF symbols are not fatal problem for us, but they are somehow interfering our internal tool analysing the compiler output. We would just like to get some kind of explanation for the phenomenon...

    Br,

    Risto

  • I tried a similar experiment on a program I have.  I see about 16% more single cycle NOP instructions when building with version 8.3.4 than when building with version 7.3.23.  I filed the entry CODEGEN-6929 to have this investigated.  You are welcome to follow it with the SDOWP link below in my signature.

    Thanks and regards,

    -George

  • Please add the option --no_compress, rebuild your project for both versions, and compare the code size.  If they are about the same, then what we're looking at is a deficiency in opcode compression.  If they are significantly different, I think the problem lies elsewhere.

  • I cannot find any plan in the referred CODEGEN ticket for providing improvement for my finding. Are you able to estimate when a new version for CGT8.4 would be available?

    Br,

    Risto

  • At this time, there are no plans for a future release of the C6000 compiler.  That being the case, the point of this investigation is to find the root cause, and recommend a workaround.

    Thanks and regards,

    -George

  • Hello!

    Would it be possible for you to provide plan for the WA? Or at least update the status more actively? This finding starts to be a blocker for Nokia to take this new toolset in use.

    Br,

    Risto

  • If we determine that the only way to solve your problem is by issuing a compiler release, then we'll discuss it.  However, we are not at that point yet.  

    What about the experiment with --no_compress requested by Archaeologist.  What happened?

    Risto Alasaarela1 said:
    Would it be possible for you to provide plan for the WA?

    Sorry, but what does WA stand for?

    Thanks and regards,

    -George

  • Hello!

    I can confirm that the generated code sizes between CGT7.3.23 and CGT8.3.4 are quite much different when using --no_compress option in the compilation. CGT8 code size is bigger.

    WA = workaround

    Br,

    Risto

  • Then it is clear your code is quite different from the substitute test case I submitted with CODEGEN-6929.  Building it shows different results.

    Rather than focus on the difference in NOP's, it is better to focus on understanding the reason for the overall code size difference.  The only way to pursue that is with a test case from you.  To avoid you sending me the entire project, I need you to do a bit of work to identify one file to send.

    Please use the technique described in the article Find Source of Code Size Increase to determine which functions increased in size the most.  For one source file that contains some of those functions, please follow the directions in the article How to Submit a Compiler Test Case.  Especially note the part about protecting intellectual property.

    Thanks and regards,

    -George

  • In this post, I explain the reasons for many of the NOP instructions you see.  I hope it will convince you the cause of the code size increase must lie elsewhere.

    The reason for these NOP instructions ...

    Risto Alasaarela1 said:

    840006f0   10102413           CALLP.S2      $Tramp$S$$AaMemCheckTag (PC+33056 = 0x84008800),B3

    840006f4       0c6e ||        NOP           1

    840006f6       0c6e ||        NOP           1

    840006f8       0c6e ||        NOP           1

    840006fa       0c6e ||        NOP           1

    840006fc   ec401c0c           .fphead       n, l, W, BU, nobr, nosat, 1100010b

    84000700             $C$RL204:

    and for these NOP instructions ...

    Risto Alasaarela1 said:

    And the compiler generates an extra return symbol + NOPs, which is never used!!         

    1081ec64             $C$RL24:
    1081ec64   00000000           NOP          
    1081ec68   00000000           NOP          
    1081ec6c   00000000           NOP          
    1081ec70   00000000           NOP          
    1081ec74   00000000           NOP          
    1081ec78   00000000           NOP          
    1081ec7c   00000000           NOP          
    1081ec80             send:
    1081ec80             .text:send:

    ... is related.

    C6000 instructions are organized into execute packets and fetch packets.  An execute packet is a set of 1-8 instructions that are all in parallel.  A fetch packet is a group of 8 instructions, on a 32-byte boundary, that are fetched for execution all at once.  An execute packet that is the target of a branch may not span a fetch packet boundary (mostly).  For the details, search the C6600 CPU manual for the section titled Execute Packet Restrictions.

    The assembler enforces this restriction.  A partial explanation of how this works is in appendix section A.10 of the application note Advanced Linker Techniques for Convenient and Efficient Memory Usage.  That explains the NOP instructions, not in parallel, contained in the second example I quote above.  They fill out that text subsection to a multiple of 32-bytes.

    The NOP 1 instructions in the first example I quote, all in parallel, are added by the assembler to cause the next execute packet, which always has a label, to start on a fetch packet boundary.  Another way to see this is to search the disassembly for other instances of .fphead.  The .fphead directives that are not followed by a label do not have the extra NOP 1 instructions.  The .fphead directives that are followed by a label often have the extra NOP 1 instructions.

    The C6000 compiler tools have behaved this way from the beginning.  In particular, the version 7.3.x tools and the 8.3.x tools are no different regarding this detail.  Thus, it is unlikely this is the cause of the increase in code size.

    One experiment to consider ... As the appendix in the linker application note explains, using the compiler feature to put functions in subsections may cause a code size increase.  The compiler option is --gen_func_subsections, or -mo for short.  You build with this option enabled.  Consider turning it off.   While I am skeptical it will solve the problem, it can't hurt to try.

    That being the case, I renew the request for a single file test case I made in the post dated Dec 13.  I continue to think that is the best way forward.

    Thanks and regards,

    -George