This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

#pragmaFUNCTION_OPTIONS

Other Parts Discussed in Thread: TM4C123BH6ZRB, TM4C123GH6PGE

I am using Compiler Version TI 5.2.5 targeting multiple Tiva processors.

I need to optimize a single function.  The optimization levels are for --opt_level=n --opt_for_speec=m, which for short hand I will specify as (n,m).

This function takes the following times:

project level, file level, function level

(off,0), (off,0), (no pragmas): 451msec

(off,0), (3,5), (no pragmas): 151msec

(off,0), (off,0), (3,5): 353msec

BTW, the function below is the only function defined in the file.  There are no global or static data defined at the file level.

I have used pragma statements as below and neither achieves the 151msec speed of file optimization.:

#pragma FUNCTION_OPTIONS ( CRC_Calc, "--opt_level=3" );

#pragma FUNCTION_OPTIONS ( CRC_Calc, "--opt_for_speed=5" );

and

#pragma FUNCTION_OPTIONS ( CRC_Calc, "--opt_level=3 --opt_for_speed=5" );

 

The function reads (standard CRC 16 bit CCITT calculation):

UINT16 CRC_Calc(UINT16 seed,UINT8 *p, UINT32 len)
{
    UINT16 rv = seed;
    while (len--) {
        UINT8 i;
        UINT8 byte = *p++;
        for (i = 0; i < 8; ++i) {
            UINT32 osum = rv;
            rv <<= 1;
            if (byte & 0x80)
                rv |= 1 ;
            if (osum & 0x8000)
                rv ^= 0x1021;
            byte <<= 1;
        }
    }
    return rv;
}

I do not want to optimize an entire library or an entire file of multiple functions. If I have the space, I desire non-optimized code for the ease of debugging.
I can see wanting to optimize a single function out of a file of many functions.

So I have several questions:

Why are the pragmas coming up with a different performance than the file optimizations?

Is the string I am handing to the pragma of the correct syntax?

Should I conclude at this point, if you need to optimize a single function, drag it into its own file and optimize at the file level?

 

 

  • With --opt_level=3, the compiler does automatic inlining of function calls.  When applied to one function, this same inlining cannot be done.  So, is it possible that some calls to CRC_calc are inlined?  And, in that context, optimized yet more than otherwise?

    Thanks and regards,

    -George

  • John Osen said:
    I am using Compiler Version TI 5.2.5 targeting multiple Tiva processors.

    The hardware CRC module in the TM4C129 series Tiva processors supports CRC16-CCITT as used by CCITT/ITU X.25.

    i.e. if all your Tiva processors are TM4C129 devices you may be able to use the hardware CRC module to get faster performance than the software function.

  • That is good to know. Unfortunately, we are using all 123 parts.
  • Just a quick question. On the 129 CRC module, do you kick it off and it gives you an interrupt or flag when done? Or is it a function call that returns when it is done?
  • I agree, opt_level=3 does nothing for me. I dropped it back to 2 and received the same results. I have rerun the data and get:

    TM4C123BH6ZRB – 128KB part
    206 msec, 12KB monitor, max. optimization on all components: (4,0). (Focused on size.)
    457 msec, 32KB monitor, optimization on all components: (off,0).
    150 msec, 32KB monitor, commonlib project level optimization (2,5)
    150 msec, 32KB monitor, CRC.c file level optimization (2,5) – BEST CASE
    353 msec, 32KB monitor, CRC_Calc function level optimization (2,5)
    236 msec, 32KB monitor, CRC_Calc hand optimized using 'register' keywords, all optimization turned off.

    I am still left with the question on why
    #pragma FUNCTION_OPTIONS ( CRC_Calc, "--opt_level=2 --opt_for_speed=5" );
    does not duplicate the performance of setting the same options at the file level for a file containing only a single function definition.
  • John Osen said:
    I am still left with the question on why
    #pragma FUNCTION_OPTIONS ( CRC_Calc, "--opt_level=2 --opt_for_speed=5" );
    does not duplicate the performance of setting the same options at the file level for a file containing only a single function definition.

    Just to be clear, I want to restate your case.  The function CRC_Calc is in a source file with no other functions.  When you build that source file with --opt_level=2 --opt_for_speed=5 you get better results than if you apply the same options with the FUNCTION_OPTIONS pragma.  Is this correct?  For now, I presume it is.

    I agree that's odd.  To explain it, I need to build that source file down to assembly both ways, and compare that assembly code.  Please submit a preprocessed form of that source file.  Show the build options exactly as the compiler sees them. 

    Thanks and regards,

    -George

  • John Osen said:
    On the 129 CRC module, do you kick it off and it gives you an interrupt or flag when done?

    On the TM4C129 CRC module you have to feed the input data into the CRC module, either with:

    a) A software loop, which returns when it is done. E.g. see the CRCDataProcess() function in the TivaWare driverlib\crc.c source file.

    b) DMA, which could raise an interrupt upon completion.

    SInce the CRC module calculations are single-cycle, the limiting factor on the CRC calculation is how quickly the input can be written to the CRC module. If you have a large block of flash to CRC, not sure if using DMA or a software loop could feed the CRC input module the fastest.

  • You restated my case perfectly. I am attaching file now.
  • // The preprocessed file:
    //******************************************************************************
    //
    // Toro Confidential and Proprietary
    //
    // This work contains valuable confidential and proprietary information.
    // Disclosure, use or reproduction without the written authorization of Toro
    // is prohibited. This unpublished work by Toro is protected by the laws of
    // the United States and other countries. If publication of this work should
    // occur the following notice shall apply:
    //
    // Copyright © 2015 The Toro Company, All Rights Reserved.
    //
    //******************************************************************************
    //
    // File: CRC.c
    // Originated by: osenjw
    //
    // Description: CRC support functions
    //

    // ************* NOTE **********************
    // For this function to meet performance expectations it is assumed that the file
    // Properties -> Build -> ARM Compiler -> Optimization level = 1 Local Optimizations
    // Properties -> Build -> ARM Compiler -> Speed vs. size trade-offs = 5 speed
    // ************* NOTE **********************


    // This algorithm can be found in various places. It is the one used in XModem standard.
    // Often referred to its 'polynomial identifier' 0x1021. This infers: x^16 + x^12 + x^5 + 1
    // This algorithm is used as it does not require 512 bytes of flash that can bloat the
    // Monitor. It is the same CRC algorithm used on the AVR.
    // Initial timing results is that for an application between 16KB start and 126KB end, it
    // takes 206 msec.

    // Also note that Tiva 129 variants have a CRC module. So at the time a Tiva 129 is to be
    // supported, the available CRC module will need to be investigated.

    //******************************************************************************
    //
    // Toro Confidential and Proprietary
    //
    // This work contains valuable confidential and proprietary information.
    // Disclosure, use or reproduction without the written authorization of Toro
    // is prohibited. This unpublished work by Toro is protected by the laws of
    // the United States and other countries. If publication of this work should
    // occur the following notice shall apply:
    //
    // Copyright © 2015 The Toro Company, All Rights Reserved.
    //
    //******************************************************************************
    //
    // File: CRC.h
    // Originated by: osenjw
    //
    // Description: CRC support public function declarations
    //

    /* Calculate the CRC for a region of memory. EE would need to be copied to RAM to calculate.
    * Usually you need to make the first calculation with seed = 0, then feed the return back
    * into the function iteratively, unless you do the whole memory area at once. You can
    * do about 50 memory locations in .1msec = 100usec @ 80MHz*/


    //******************************************************************************
    //
    // Toro Confidential and Proprietary
    //
    // This work contains valuable confidential and proprietary information.
    // Disclosure, use or reproduction without the written authorization of Toro
    // is prohibited. This unpublished work by Toro is protected by the laws of
    // the United States and other countries. If publication of this work should
    // occur the following notice shall apply:
    //
    // Copyright © 2015 The Toro Company, All Rights Reserved.
    //
    //******************************************************************************
    //
    // File: globaldefs.h
    // Originated by: osenjw, from AVR CommonLib
    //
    // Description: global #defines and typedefs
    //
    //


    /* Bit manipulation macros */

    /* Utility */
    //#define FALSE 0
    //#define TRUE 1
    //typedef unsigned char BOOL;
    typedef enum BOOL {
    FALSE = 0,
    TRUE = 1
    } BOOL;

    typedef unsigned char UINT8;
    typedef unsigned short UINT16;
    typedef unsigned int UINT32; // could say long, but conflicts with TIVA_WARE
    typedef unsigned long long UINT64;
    typedef signed char SINT8;
    typedef signed short SINT16;
    typedef signed int SINT32; // could say long, but conflicts with TIVA_WARE
    typedef signed long long SINT64;
    typedef float FLT32;
    typedef union UNION64 {
    UINT64 uint64;
    UINT32 uint32[2];
    UINT16 uint16[4];
    UINT8 uint8[8];
    } UNION64;
    typedef union UNION32 {
    UINT32 uint32;
    UINT16 uint16[2];
    UINT8 uint8[4];
    } UNION32;



    /*!
    * \brief Calculates the CRC for a region of memory
    * \remark Tested on FLASH only
    */
    extern UINT16 CRC_Calc(UINT16 seed, UINT8 *p, UINT32 len);


    UINT16 CRC_Calc(UINT16 seed,UINT8 *p, UINT32 len)
    {
    UINT16 rv = seed;
    while (len--) {
    UINT8 i;
    UINT8 byte = *p++;
    for (i = 0; i < 8; ++i) {
    UINT32 osum = rv;
    rv <<= 1;
    if (byte & 0x80)
    rv |= 1 ;
    if (osum & 0x8000)
    rv ^= 0x1021;
    byte <<= 1;
    }
    }
    return rv;
    }

    // This is supposedly the faster table driven routine. Kept here as it can be hard to
    // find. Not tested. Found at: stackoverflow.com/.../crc16-calculation-in-c
    // it looks to require 1 less increment/decrement and one less assign than the above algorithm
    // while requireing an additional table look up. JWO assumes this is a wash. It certainly does
    // not look like 2x faster.
    //static const unsigned short crc16tab[256]= {
    //0x0000,0x1021,0x2042,0x3063,0x4084,0x50a5,0x60c6,0x70e7,
    //0x8108,0x9129,0xa14a,0xb16b,0xc18c,0xd1ad,0xe1ce,0xf1ef,
    //0x1231,0x0210,0x3273,0x2252,0x52b5,0x4294,0x72f7,0x62d6,
    //0x9339,0x8318,0xb37b,0xa35a,0xd3bd,0xc39c,0xf3ff,0xe3de,
    //0x2462,0x3443,0x0420,0x1401,0x64e6,0x74c7,0x44a4,0x5485,
    //0xa56a,0xb54b,0x8528,0x9509,0xe5ee,0xf5cf,0xc5ac,0xd58d,
    //0x3653,0x2672,0x1611,0x0630,0x76d7,0x66f6,0x5695,0x46b4,
    //0xb75b,0xa77a,0x9719,0x8738,0xf7df,0xe7fe,0xd79d,0xc7bc,
    //0x48c4,0x58e5,0x6886,0x78a7,0x0840,0x1861,0x2802,0x3823,
    //0xc9cc,0xd9ed,0xe98e,0xf9af,0x8948,0x9969,0xa90a,0xb92b,
    //0x5af5,0x4ad4,0x7ab7,0x6a96,0x1a71,0x0a50,0x3a33,0x2a12,
    //0xdbfd,0xcbdc,0xfbbf,0xeb9e,0x9b79,0x8b58,0xbb3b,0xab1a,
    //0x6ca6,0x7c87,0x4ce4,0x5cc5,0x2c22,0x3c03,0x0c60,0x1c41,
    //0xedae,0xfd8f,0xcdec,0xddcd,0xad2a,0xbd0b,0x8d68,0x9d49,
    //0x7e97,0x6eb6,0x5ed5,0x4ef4,0x3e13,0x2e32,0x1e51,0x0e70,
    //0xff9f,0xefbe,0xdfdd,0xcffc,0xbf1b,0xaf3a,0x9f59,0x8f78,
    //0x9188,0x81a9,0xb1ca,0xa1eb,0xd10c,0xc12d,0xf14e,0xe16f,
    //0x1080,0x00a1,0x30c2,0x20e3,0x5004,0x4025,0x7046,0x6067,
    //0x83b9,0x9398,0xa3fb,0xb3da,0xc33d,0xd31c,0xe37f,0xf35e,
    //0x02b1,0x1290,0x22f3,0x32d2,0x4235,0x5214,0x6277,0x7256,
    //0xb5ea,0xa5cb,0x95a8,0x8589,0xf56e,0xe54f,0xd52c,0xc50d,
    //0x34e2,0x24c3,0x14a0,0x0481,0x7466,0x6447,0x5424,0x4405,
    //0xa7db,0xb7fa,0x8799,0x97b8,0xe75f,0xf77e,0xc71d,0xd73c,
    //0x26d3,0x36f2,0x0691,0x16b0,0x6657,0x7676,0x4615,0x5634,
    //0xd94c,0xc96d,0xf90e,0xe92f,0x99c8,0x89e9,0xb98a,0xa9ab,
    //0x5844,0x4865,0x7806,0x6827,0x18c0,0x08e1,0x3882,0x28a3,
    //0xcb7d,0xdb5c,0xeb3f,0xfb1e,0x8bf9,0x9bd8,0xabbb,0xbb9a,
    //0x4a75,0x5a54,0x6a37,0x7a16,0x0af1,0x1ad0,0x2ab3,0x3a92,
    //0xfd2e,0xed0f,0xdd6c,0xcd4d,0xbdaa,0xad8b,0x9de8,0x8dc9,
    //0x7c26,0x6c07,0x5c64,0x4c45,0x3ca2,0x2c83,0x1ce0,0x0cc1,
    //0xef1f,0xff3e,0xcf5d,0xdf7c,0xaf9b,0xbfba,0x8fd9,0x9ff8,
    //0x6e17,0x7e36,0x4e55,0x5e74,0x2e93,0x3eb2,0x0ed1,0x1ef0
    //}
    //
    //unsigned short crc16_ccitt(unsigned char *buf, int len)
    //{
    //
    // register int counter;
    // register unsigned short crc = 0;
    // for (counter = 0; counter < len; counter++)
    // crc = (crc << 8) ^ crc16tab[((crc >> 8) ^ *(char *) buf++) & 0x00FF];
    // return crc;
    //}
  • And the compiler command:
    'Building file: C:/Data/Code/LM4FCommon/CommonLib/trunk/CRC.c'
    'Invoking: ARM Compiler'
    "C:/ti/ccsv6/tools/compiler/ti-cgt-arm_5.2.5/bin/armcl" -mv7M4 --code_state=16 --float_support=FPv4SPD16 --abi=eabi -me -O2 --opt_for_speed=5 --include_path="C:/ti/ccsv6/tools/compiler/ti-cgt-arm_5.2.5/include" --include_path="C:/ti/TivaWare_C_Series" -g --define=css --define=MONITOR --define=MRF24J40 --define=PART_TM4C123GH6PGE --define=TARGET_IS_BLIZZARD_RA1 --define=ccs="ccs" --diag_warning=225 --display_error_number --gen_func_subsections=on --preproc_with_compile --preproc_dependency="CRC.pp" "C:/Data/Code/LM4FCommon/CommonLib/trunk/CRC.c"
    'Finished building: C:/Data/Code/LM4FCommon/CommonLib/trunk/CRC.c'
    ' '
  • //******************************************************************************
    //
    // Toro Confidential and Proprietary
    //
    // This work contains valuable confidential and proprietary information.
    // Disclosure, use or reproduction without the written authorization of Toro
    // is prohibited. This unpublished work by Toro is protected by the laws of
    // the United States and other countries. If publication of this work should
    // occur the following notice shall apply:
    //
    // Copyright © 2015 The Toro Company, All Rights Reserved.
    //
    //******************************************************************************
    //
    // File: CRC.c
    // Originated by: osenjw
    //
    // Description: CRC support functions
    //

    // ************* NOTE **********************
    // For this function to meet performance expectations it is assumed that the file
    // Properties -> Build -> ARM Compiler -> Optimization level = 1 Local Optimizations
    // Properties -> Build -> ARM Compiler -> Speed vs. size trade-offs = 5 speed
    // ************* NOTE **********************


    // This algorithm can be found in various places. It is the one used in XModem standard.
    // Often referred to its 'polynomial identifier' 0x1021. This infers: x^16 + x^12 + x^5 + 1
    // This algorithm is used as it does not require 512 bytes of flash that can bloat the
    // Monitor. It is the same CRC algorithm used on the AVR.
    // Initial timing results is that for an application between 16KB start and 126KB end, it
    // takes 206 msec.

    // Also note that Tiva 129 variants have a CRC module. So at the time a Tiva 129 is to be
    // supported, the available CRC module will need to be investigated.

    //******************************************************************************
    //
    // Toro Confidential and Proprietary
    //
    // This work contains valuable confidential and proprietary information.
    // Disclosure, use or reproduction without the written authorization of Toro
    // is prohibited. This unpublished work by Toro is protected by the laws of
    // the United States and other countries. If publication of this work should
    // occur the following notice shall apply:
    //
    // Copyright © 2015 The Toro Company, All Rights Reserved.
    //
    //******************************************************************************
    //
    // File: CRC.h
    // Originated by: osenjw
    //
    // Description: CRC support public function declarations
    //

    /* Calculate the CRC for a region of memory. EE would need to be copied to RAM to calculate.
    * Usually you need to make the first calculation with seed = 0, then feed the return back
    * into the function iteratively, unless you do the whole memory area at once. You can
    * do about 50 memory locations in .1msec = 100usec @ 80MHz*/


    //******************************************************************************
    //
    // Toro Confidential and Proprietary
    //
    // This work contains valuable confidential and proprietary information.
    // Disclosure, use or reproduction without the written authorization of Toro
    // is prohibited. This unpublished work by Toro is protected by the laws of
    // the United States and other countries. If publication of this work should
    // occur the following notice shall apply:
    //
    // Copyright © 2015 The Toro Company, All Rights Reserved.
    //
    //******************************************************************************
    //
    // File: globaldefs.h
    // Originated by: osenjw, from AVR CommonLib
    //
    // Description: global #defines and typedefs
    //
    //


    /* Bit manipulation macros */

    /* Utility */
    //#define FALSE 0
    //#define TRUE 1
    //typedef unsigned char BOOL;
    typedef enum BOOL {
    FALSE = 0,
    TRUE = 1
    } BOOL;

    typedef unsigned char UINT8;
    typedef unsigned short UINT16;
    typedef unsigned int UINT32; // could say long, but conflicts with TIVA_WARE
    typedef unsigned long long UINT64;
    typedef signed char SINT8;
    typedef signed short SINT16;
    typedef signed int SINT32; // could say long, but conflicts with TIVA_WARE
    typedef signed long long SINT64;
    typedef float FLT32;
    typedef union UNION64 {
    UINT64 uint64;
    UINT32 uint32[2];
    UINT16 uint16[4];
    UINT8 uint8[8];
    } UNION64;
    typedef union UNION32 {
    UINT32 uint32;
    UINT16 uint16[2];
    UINT8 uint8[4];
    } UNION32;



    /*!
    * \brief Calculates the CRC for a region of memory
    * \remark Tested on FLASH only
    */
    extern UINT16 CRC_Calc(UINT16 seed, UINT8 *p, UINT32 len);


    UINT16 CRC_Calc(UINT16 seed,UINT8 *p, UINT32 len)
    {
    UINT16 rv = seed;
    while (len--) {
    UINT8 i;
    UINT8 byte = *p++;
    for (i = 0; i < 8; ++i) {
    UINT32 osum = rv;
    rv <<= 1;
    if (byte & 0x80)
    rv |= 1 ;
    if (osum & 0x8000)
    rv ^= 0x1021;
    byte <<= 1;
    }
    }
    return rv;
    }

    // This is supposedly the faster table driven routine. Kept here as it can be hard to
    // find. Not tested. Found at: stackoverflow.com/.../crc16-calculation-in-c
    // it looks to require 1 less increment/decrement and one less assign than the above algorithm
    // while requireing an additional table look up. JWO assumes this is a wash. It certainly does
    // not look like 2x faster.
    //static const unsigned short crc16tab[256]= {
    //0x0000,0x1021,0x2042,0x3063,0x4084,0x50a5,0x60c6,0x70e7,
    //0x8108,0x9129,0xa14a,0xb16b,0xc18c,0xd1ad,0xe1ce,0xf1ef,
    //0x1231,0x0210,0x3273,0x2252,0x52b5,0x4294,0x72f7,0x62d6,
    //0x9339,0x8318,0xb37b,0xa35a,0xd3bd,0xc39c,0xf3ff,0xe3de,
    //0x2462,0x3443,0x0420,0x1401,0x64e6,0x74c7,0x44a4,0x5485,
    //0xa56a,0xb54b,0x8528,0x9509,0xe5ee,0xf5cf,0xc5ac,0xd58d,
    //0x3653,0x2672,0x1611,0x0630,0x76d7,0x66f6,0x5695,0x46b4,
    //0xb75b,0xa77a,0x9719,0x8738,0xf7df,0xe7fe,0xd79d,0xc7bc,
    //0x48c4,0x58e5,0x6886,0x78a7,0x0840,0x1861,0x2802,0x3823,
    //0xc9cc,0xd9ed,0xe98e,0xf9af,0x8948,0x9969,0xa90a,0xb92b,
    //0x5af5,0x4ad4,0x7ab7,0x6a96,0x1a71,0x0a50,0x3a33,0x2a12,
    //0xdbfd,0xcbdc,0xfbbf,0xeb9e,0x9b79,0x8b58,0xbb3b,0xab1a,
    //0x6ca6,0x7c87,0x4ce4,0x5cc5,0x2c22,0x3c03,0x0c60,0x1c41,
    //0xedae,0xfd8f,0xcdec,0xddcd,0xad2a,0xbd0b,0x8d68,0x9d49,
    //0x7e97,0x6eb6,0x5ed5,0x4ef4,0x3e13,0x2e32,0x1e51,0x0e70,
    //0xff9f,0xefbe,0xdfdd,0xcffc,0xbf1b,0xaf3a,0x9f59,0x8f78,
    //0x9188,0x81a9,0xb1ca,0xa1eb,0xd10c,0xc12d,0xf14e,0xe16f,
    //0x1080,0x00a1,0x30c2,0x20e3,0x5004,0x4025,0x7046,0x6067,
    //0x83b9,0x9398,0xa3fb,0xb3da,0xc33d,0xd31c,0xe37f,0xf35e,
    //0x02b1,0x1290,0x22f3,0x32d2,0x4235,0x5214,0x6277,0x7256,
    //0xb5ea,0xa5cb,0x95a8,0x8589,0xf56e,0xe54f,0xd52c,0xc50d,
    //0x34e2,0x24c3,0x14a0,0x0481,0x7466,0x6447,0x5424,0x4405,
    //0xa7db,0xb7fa,0x8799,0x97b8,0xe75f,0xf77e,0xc71d,0xd73c,
    //0x26d3,0x36f2,0x0691,0x16b0,0x6657,0x7676,0x4615,0x5634,
    //0xd94c,0xc96d,0xf90e,0xe92f,0x99c8,0x89e9,0xb98a,0xa9ab,
    //0x5844,0x4865,0x7806,0x6827,0x18c0,0x08e1,0x3882,0x28a3,
    //0xcb7d,0xdb5c,0xeb3f,0xfb1e,0x8bf9,0x9bd8,0xabbb,0xbb9a,
    //0x4a75,0x5a54,0x6a37,0x7a16,0x0af1,0x1ad0,0x2ab3,0x3a92,
    //0xfd2e,0xed0f,0xdd6c,0xcd4d,0xbdaa,0xad8b,0x9de8,0x8dc9,
    //0x7c26,0x6c07,0x5c64,0x4c45,0x3ca2,0x2c83,0x1ce0,0x0cc1,
    //0xef1f,0xff3e,0xcf5d,0xdf7c,0xaf9b,0xbfba,0x8fd9,0x9ff8,
    //0x6e17,0x7e36,0x4e55,0x5e74,0x2e93,0x3eb2,0x0ed1,0x1ef0
    //}
    //
    //unsigned short crc16_ccitt(unsigned char *buf, int len)
    //{
    //
    // register int counter;
    // register unsigned short crc = 0;
    // for (counter = 0; counter < len; counter++)
    // crc = (crc << 8) ^ crc16tab[((crc >> 8) ^ *(char *) buf++) & 0x00FF];
    // return crc;
    //}
  • I compiled both ways while retaining the asm file. I do not read ARM assembler, but can guess at it.

    Comparing the ASM files, it looks like:
    - optimizing at the file level utilizes all 16 of the general purpose registers, 0 stack
    - optimizing at the function level only uses 8 general purpose registers and 24 bytes of stack and the assembly looks like there is a whole lotta loading into registers going on.

    It is almost as if the '--opt_level=2' is being ignored. I tried -O2 syntax, and found no change.
  • Thank you for the test case.  I can reproduce the same results.  I agree with your assessment.  I filed CODEGEN-1515 in the SDOWP system to have this addressed.  You are welcome to follow it with the SDOWP link below in my signature.

    Thanks and regards,

    -George

  • I don't see the other compiler command, only the one using --opt_level=2, but if it uses --opt_level=off, then the optimiser part of the compiler will not be invoked at all. The interpretation of the pragma to control optimisation level will simply not happen.
    If you want a FUNCTION_OPTIONS pragma to modulate --opt_level, you'll need to have at least --opt_level=0 on the command line. (And indeed, that's what I was doing in testing, and could not reproduce your results. The code with and without the pragma was about the same.)
  • Something needs to change here, though I am unsure of what that is.  So I filed CODEGEN-1517 in the SDOWP system to have this investigated.  You are welcome to track this issue with the SDOWP link below in my signature.

    Thanks and regards,

    -George