This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

C6000 code gen tool

Hi,

I have recently upgraded my code gen tool from 6.0.28 to 6.1.19 then to 7.0.4 and found that my optimized code which was fit into the sploop (with optimization flag set to -o2) prior to the upgrade no longer fit into the sploop any more.  And I have also found that even the asm code generated are different, in fact more lengthy and inefficient.  Let me attach the asm file headers generated from different versions of the code gen tool on the same C source file:

Using TMS320C6x Assembler PC v7.0.4:

    2546                    ;*----------------------------------------------------------------------------*
    2547                    ;*   SOFTWARE PIPELINE INFORMATION
    2548                    ;*
    2549                    ;*      Loop source line                 : 191
    2550                    ;*      Loop opening brace source line   : 192
    2551                    ;*      Loop closing brace source line   : 253
    2552                    ;*      Known Minimum Trip Count         : 1                   
    2553                    ;*      Known Maximum Trip Count         : 65536                   
    2554                    ;*      Known Max Trip Count Factor      : 1
    2555                    ;*      Loop Carried Dependency Bound(^) : 19
    2556                    ;*      Unpartitioned Resource Bound     : 14
    2557                    ;*      Partitioned Resource Bound(*)    : 14
    2558                    ;*      Resource Partition:
    2559                    ;*                                A-side   B-side
    2560                    ;*      .L units                     0        0    
    2561                    ;*      .S units                    11       11    
    2562                    ;*      .D units                    13       12    
    2563                    ;*      .M units                     7        6    
    2564                    ;*      .X cross paths               9        9    
    2565                    ;*      .T address paths            13       13    
    2566                    ;*      Long read paths              0        0    
    2567                    ;*      Long write paths             0        0    
    2568                    ;*      Logical  ops (.LS)           5        4     (.L or .S unit)
    2569                    ;*      Addition ops (.LSD)         12       13     (.L or .S or .D unit)
    2570                    ;*      Bound(.L .S .LS)             8        8    
    2571                    ;*      Bound(.L .S .D .LS .LSD)    14*      14*   
    2572                    ;*
    2573                    ;*      Searching for software pipeline schedule at ...
    2574                    ;*         ii = 19 Did not find schedule
    2575                    ;*         ii = 20 Did not find schedule
    2576                    ;*         ii = 21 Schedule found with 3 iterations in parallel
    2577                    ;*      Done
    2578                    ;*
    2579                    ;*      Epilog not entirely removed
    2580                    ;*      Collapsed epilog stages       : 1
    2581                    ;*
    2582                    ;*      Prolog not removed
    2583                    ;*      Collapsed prolog stages       : 0
    2584                    ;*
    2585                    ;*      Minimum required memory pad   : 0 bytes
    2586                    ;*
    2587                    ;*      For further improvement on this loop, try option -mh14
    2588                    ;*
    2589                    ;*      Minimum safe trip count       : 2
    2590                    ;*----------------------------------------------------------------------------

 

Using TMS320C6x Assembler PC v6.1.19 and TMS320C6x Assembler PC v6.1.12:

    2637                    ;*----------------------------------------------------------------------------*
    2638                    ;*   SOFTWARE PIPELINE INFORMATION
    2639                    ;*
    2640                    ;*      Loop source line                 : 191
    2641                    ;*      Loop opening brace source line   : 192
    2642                    ;*      Loop closing brace source line   : 253
    2643                    ;*      Known Minimum Trip Count         : 1                   
    2644                    ;*      Known Maximum Trip Count         : 65536                   
    2645                    ;*      Known Max Trip Count Factor      : 1
    2646                    ;*      Loop Carried Dependency Bound(^) : 14
    2647                    ;*      Unpartitioned Resource Bound     : 13
    2648                    ;*      Partitioned Resource Bound(*)    : 14
    2649                    ;*      Resource Partition:
    2650                    ;*                                A-side   B-side
    2651                    ;*      .L units                     0        0    
    2652                    ;*      .S units                    11       11    
    2653                    ;*      .D units                    13       12    
    2654                    ;*      .M units                     7        6    
    2655                    ;*      .X cross paths               9        9    
    2656                    ;*      .T address paths            13       13    
    2657                    ;*      Long read paths              0        0    
    2658                    ;*      Long write paths             0        0    
    2659                    ;*      Logical  ops (.LS)           5        4     (.L or .S unit)
    2660                    ;*      Addition ops (.LSD)         11       13     (.L or .S or .D unit)
    2661                    ;*      Bound(.L .S .LS)             8        8    
    2662                    ;*      Bound(.L .S .D .LS .LSD)    14*      14*   
    2663                    ;*
    2664                    ;*      Searching for software pipeline schedule at ...
    2665                    ;*         ii = 14 Did not find schedule
    2666                    ;*         ii = 15 Did not find schedule
    2667                    ;*         ii = 16 Did not find schedule
    2668                    ;*         ii = 17 Schedule found with 4 iterations in parallel
    2669                    ;*      Done
    2670                    ;*
    2671                    ;*      Epilog not entirely removed
    2672                    ;*      Collapsed epilog stages       : 2
    2673                    ;*
    2674                    ;*      Prolog not entirely removed
    2675                    ;*      Collapsed prolog stages       : 1
    2676                    ;*
    2677                    ;*      Minimum required memory pad   : 0 bytes
    2678                    ;*
    2679                    ;*      For further improvement on this loop, try option -mh28
    2680                    ;*
    2681                    ;*      Minimum safe trip count       : 2
    2682                    ;*----------------------------------------------------------------------------\

 

Using TMS320C6x Assembler PC v6.0.28:

    2061                    ;*----------------------------------------------------------------------------*
    2062                    ;*   SOFTWARE PIPELINE INFORMATION
    2063                    ;*
    2064                    ;*      Loop source line                 : 191
    2065                    ;*      Loop opening brace source line   : 192
    2066                    ;*      Loop closing brace source line   : 253
    2067                    ;*      Known Minimum Trip Count         : 1                   
    2068                    ;*      Known Maximum Trip Count         : 65536                   
    2069                    ;*      Known Max Trip Count Factor      : 1
    2070                    ;*      Loop Carried Dependency Bound(^) : 12
    2071                    ;*      Unpartitioned Resource Bound     : 11
    2072                    ;*      Partitioned Resource Bound(*)    : 13
    2073                    ;*      Resource Partition:
    2074                    ;*                                A-side   B-side
    2075                    ;*      .L units                     0        0    
    2076                    ;*      .S units                    10       10    
    2077                    ;*      .D units                    11        9    
    2078                    ;*      .M units                     7        6    
    2079                    ;*      .X cross paths               9        7    
    2080                    ;*      .T address paths            13*      13*   
    2081                    ;*      Long read paths              0        0    
    2082                    ;*      Long write paths             0        0    
    2083                    ;*      Logical  ops (.LS)           3        5     (.L or .S unit)
    2084                    ;*      Addition ops (.LSD)          9        9     (.L or .S or .D unit)
    2085                    ;*      Bound(.L .S .LS)             7        8    
    2086                    ;*      Bound(.L .S .D .LS .LSD)    11       11    
    2087                    ;*
    2088                    ;*      Searching for software pipeline schedule at ...
    2089                    ;*         ii = 13 Did not find schedule
    2090                    ;*         ii = 14 Schedule found with 3 iterations in parallel
    2091                    ;*      Done
    2092                    ;*
    2093                    ;*      Loop will be splooped
    2094                    ;*      Collapsed epilog stages     : 0
    2095                    ;*      Collapsed prolog stages     : 0
    2096                    ;*      Minimum required memory pad : 0 bytes
    2097                    ;*
    2098                    ;*      Minimum safe trip count     : 1
    2099                    ;*----------------------------------------------------------------------------*

 

Note also the differences in ii, it is changed from 14 (with 6.0.28) to 17 (with 6.1.19) and then to 21 (with 7.0.4).  And this compiler degradation behavior is happening in not only this loop but other loops also.  Having these loops fit into sploop is essential to our project, and we need to update our code gen tool to 7.0.4.  Lots of effort has been spent on optimizing these loops into sploop, we can't afford to exam unexpected behavior and make major changes in every compiler/tool upgrade.

Do you have any recommendation on what compiler flag change (other than -o2) I need to make such that I get the same performance out of it?  Or is it a bug (or undocumented feature) that the latest code gen tool carries, but will soon be fixed?

Thanks,

-- Louis

  • To make progress on this, we need a test case we can build ourselves.  Please submit one.  It does not have to run, only build down to assembly.  Further details are in the last part of the forum welcome message.

    Thanks and regards,

    -George

  • Here is the test case (I have verified that it compiles and runs):

    You can try it with those 3 versions of code gen tool that I mentioned and see the differences in performance.  And thanks for your help:

    -------------------------------------------------------------------------------------------------------------------------------------------------

    #include <csl_types.h>

    typedef long long Int64;

    typedef struct
    {
        Int16 r;
        Int16 c;
        Int16 c1;
        Int16 c2;
        Int16 c3;
        Int16 c4;
        Int16 c5;
    } LP;

    void test(Int16 *restrict pDst,
              const Uint16 *restrict pSrc1,
              const Uint16 *restrict pSrc2,
              const LP *restrict pLP,
              const int length)
    {
        int i;
        Int16 sr, sc, co = 0, ro = 0;

        for (i = 0; i < length; i++)
        {
            Int16 *restrict pDst0, *restrict pDst2, *restrict pDst3, *restrict pDst4;
            const Int16 *restrict pC;
            Int64 c3210, p0, p1;
            Int32 pSh;
            Int32 rco, i1, i2, c32, c10, o10, o32;
            Int16 o4;
            Uint16 e;

            e = *pSrc1++;
            i1 = e | (e << 16);

            e = *pSrc2++;
            i2 = e | (e << 16);

            pDst += (co * 600 + ro);
            o10 = (_mem2((void*)pDst)  << 16) |
                     _mem2((void*)(pDst - 600));
            o32 = (_mem2((void*)(pDst + 2*600)) << 16) |
                     _mem2((void*)(pDst + 600));
            o4  = (_mem2((void*)(pDst + 3*600))) ;

            pC = (Int16*)(2 + (Int16*)pLP);
            c3210 = _mem8_const(pC);
            c10 = _loll(c3210);
            c32 = _hill(c3210);

            p0 = _smpy2ll(i1, c10);
            pSh = _packh2(_mpy32(_hill(p0), 4), _mpy32(_loll(p0), 4));
            o10 = _sadd2(o10, pSh);

            p0 = _smpy2ll(i1, c32);
            pSh = _packh2(_mpy32(_hill(p0), 4), _mpy32(_loll(p0), 4));
            o32 = _sadd2(o32, pSh);
            p0 = _smpy2ll(i2, c10);
            p1 = _smpy2ll(i2, c32);
            o10 = _sadd2(o10, _packh2(_mpy32(_loll(p0), 4), 0));
            o32 = _sadd2(o32, _packh2(_mpy32(_loll(p1), 4), _mpy32(_hill(p0), 4)));
            o4  = _sadd2(o4, _packh2(0, _mpy32(_hill(p1), 4)));

            pDst0 = pDst - 600;
            pDst2 = pDst + 600;
            pDst3 = pDst + 2*600;
            pDst4 = pDst + 3*600;
            _mem2((void*)(pDst0)) = (Uint16)(o10 & 0x0FFFF);
            _mem2((void*)(pDst))  = (Uint16)(o10 >> 16);
            _mem2((void*)(pDst2)) = (Uint16)(o32 & 0x0FFFF);
            _mem2((void*)(pDst3)) = (Uint16)(o32 >> 16);
            _mem2((void*)(pDst4)) = (Uint16)o4;

            pLP++;
            rco = _mem4((void*)pLP);
            co = (Int16)_packh2(0, rco);
            ro = (Int16)_pack2(0, rco);
            sc += co;
            sr += ro;
        }
    }

  • I cannot match the results you show.  Please tell me exactly which build options you use.

    Thanks and regards,

    -George

  • Here are the options.  There is nothing special and I think the main one is -O2.

    -mv64+ -g -O2 --define="_DEBUG" --define="TYPES_BIOS" --define="USE_BIOS_6" --define="_OS_SUPPORT" --define="_COM_BENCH" --define="CHIP_C6472" --include_path="C:/Program Files/Texas Instruments/C6000 Code Generation Tools 7.0.4/include" --include_path="C:/Program Files/Texas Instruments/ccsv4/emulation/boards/evmc6472/bsl/inc" --include_path="D:/vws/ll_M2_integ4/spDsp/EVM/M2_EVM6472/../../../spTools/Lib/csl_c6472_03_00_06_03/inc"  --diag_warning="225" --mem_model:data=far -k --asm_listing

     

    Thanks,

    -- Louis

  • I cannot reproduce your results exactly.  But I do see the same ii values you get.  So I filed a performance bug against the compiler.  It is SDSCM00039298 in the SDOWP system, which you can track with the SDOWP link in my sig below.

    Thanks and regards,

    -George

  • I have the exact same problem as mentioned above when migrating code to a newer chip generation with a newer compiler.

    /Daniel

  • George,

      I am a co-worker of Louis who posted this originally. We are seeing impact (higher cycle count & hence poorer performance) in many algorithms when we move to the later generation compiler. The impact seems to be substantial and could significantly impact the performance of our system overall.

    What is the timeframe for fixing this issue ?

    Thanks,

    Somnath Banik 

  • There are two distinct issues in this test case.

    First, the restricted pointers pDst0, pDst2, pDst3, and pDst4 are defined within the loop.  That means that the restriction covers only a single iteration.  Are you sure this is what you want?  It turns out to be significant, because we had a bug in this situation in which we missed an alias *across* iterations, and as a result had to make a conservative choice that effectively removes the "restrict" qualifier on variables defined *within* a loop.  To work around it -- or perhaps express what you really mean -- define the variables outside the loop and assign them within it.

    Second, _mem2 is somewhat inefficient, because while there are instructions for unaligned 32-bit and 64-bit accesses, there aren't any for unaligned 16-bit accesses, and the compiler implements _mem2 with two 8-bit instructions.  The 6.0.28 compiler is converting the _mem2 to _mem4 which takes only one instruction.  I can mimic that by replacing _mem2 with (Uint16)_mem4 in the test case, though I haven't completely convinced myself that it's correct to do so.

    If I make both of these changes to the try1.c file, I get ii=14 and SPLOOP with the latest 7.x compiler.

    The first issue won't be "fixed."  We had a bug before, and now we are preserving correct behavior at the cost of performance in some cases.  The second issue will be fixed when I understand the right way to proceed and the patch goes through the release process.

    If you aren't using both restricted pointers defined within a loop and _mem2, you don't have the exact same problem.  You may have a different problem that we can solve, in which case I encourage you to start another thread with your own test case.  There were substantial changes in the compiler between 6.0.x and 7.x, and while there were many performance improvements, there were also some degradations and we would like to fix them.

  • In our case, we can't just change _mem2 to _mem4 without making some significant changes to the algorithm. (I am curious if (Uint16)_mem4 is equivalent to _mem2.  And if there is any side effect.)

    If I leave the _mem2 there, but moving the restrict pointer declarations out of the loop, I got the following result.

    Here is the header of the loop in the listing file after compiling it:

        2628                    ;*----------------------------------------------------------------------------*
        2629                    ;*   SOFTWARE PIPELINE INFORMATION
        2630                    ;*
        2631                    ;*      Loop source line                 : 193
        2632                    ;*      Loop opening brace source line   : 194
        2633                    ;*      Loop closing brace source line   : 253
        2634                    ;*      Known Minimum Trip Count         : 1                   
        2635                    ;*      Known Maximum Trip Count         : 65536                   
        2636                    ;*      Known Max Trip Count Factor      : 1
        2637                    ;*      Loop Carried Dependency Bound(^) : 14
        2638                    ;*      Unpartitioned Resource Bound     : 13
        2639                    ;*      Partitioned Resource Bound(*)    : 14
        2640                    ;*      Resource Partition:
        2641                    ;*                                A-side   B-side
        2642                    ;*      .L units                     0        0    
        2643                    ;*      .S units                    11       11    
        2644                    ;*      .D units                    13       12    
        2645                    ;*      .M units                     7        6    
        2646                    ;*      .X cross paths               9        9    
        2647                    ;*      .T address paths            13       13    
        2648                    ;*      Long read paths              0        0    
        2649                    ;*      Long write paths             0        0    
        2650                    ;*      Logical  ops (.LS)           5        4     (.L or .S unit)
        2651                    ;*      Addition ops (.LSD)         11       13     (.L or .S or .D unit)
        2652                    ;*      Bound(.L .S .LS)             8        8    
        2653                    ;*      Bound(.L .S .D .LS .LSD)    14*      14*   
        2654                    ;*
        2655                    ;*      Searching for software pipeline schedule at ...
        2656                    ;*         ii = 14 Did not find schedule
        2657                    ;*         ii = 15 Did not find schedule
        2658                    ;*         ii = 16 Did not find schedule
        2659                    ;*         ii = 17 Schedule found with 4 iterations in parallel
        2660                    ;*      Done
        2661                    ;*
        2662                    ;*      Epilog not entirely removed
        2663                    ;*      Collapsed epilog stages       : 2
        2664                    ;*
        2665                    ;*      Prolog not entirely removed
        2666                    ;*      Collapsed prolog stages       : 1
        2667                    ;*
        2668                    ;*      Minimum required memory pad   : 0 bytes
        2669                    ;*
        2670                    ;*      For further improvement on this loop, try option -mh28
        2671                    ;*
        2672                    ;*      Minimum safe trip count       : 2
        2673                    ;*----------------------------------------------------------------------------*

    It improves a little (ii drops from 21 to 17).  But still the loop does not fit into sploop like it did with the older version (6.0.28) of the compiler.  I hope your later compiler release would include the patch that fixes the problem.

    Thanks,

    -- Louis

  • I took a closer look at your code.  The _mem2 intrinsic does not appear necessary.  Removing all _mem2, plus adding the build option -ms0, results in a SPLOOP with ii=12.

    The memory intrinsics are intended to be used with pointers that are declared to access a type smaller than the type accessed via the intrinsic.  Usually, the pointer is of type "char *", and the intrinsic is accessing a 32-bit word, or 64-bit double word.  But this code always applies _mem2 to pointers of type "Int16 *".  That is, the types are a match.  Another reason to possibly use _mem2 is that the pointers could somehow become unaligned.  I looked for a way in which any of these pointers could become unaligned, and could not find any.  Unless the function could be invoked something like this ...

    test((Int16 *) &array_of_char[index], ...);

    and index can be an odd number.   That seems very unlikely.

    Thus it is OK for you to remove all the _mem2 intrinsics.  Change all _mem2(expression) to just *(expression).

    This alone improves the loop quite a bit.  But it does not result in an SPLOOP.  For that, add the build option -ms0.  This changes the trade-off between speed and size.  The default is for the compiler to optimize for speed without any concern for code size.  Adding -ms0 changes that trade-off by one notch away from speed and in favor of size.  This trade-off must have been weighed a bit differently in the v6.0.x compiler.

    Thanks and regards,

    -George