This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Compiler/TMS320C6657: C6657 optimization doesn't perform as well as C6747

Part Number: TMS320C6657

Tool/software: TI C/C++ Compiler

I implementd the gain function below to C6657 EVM.

However, it seems that opmimazation doesn't perform well as I expected.

Question 1)

In .asm file (output by -k optiion), is it correct to determine "ii" and "Loop unroll multipe" means 

the cycle of each loop and the number of Unroll, respectively?

If  "ii = 20" and "Unroll Multiple = 2", this  means 1loop cost 10cycle ( 20/2 ) as a result?

Question 2)

If Question 1 is true, as you can see in the gain function below,

the optimization result completely changes depend on the way of writing from 10cycle to 0.5cycle per audiosample.

I hope to write as ”case A", but the optimization doesn't peform well (10 cycle/sample).

"Case D" is best but it costs overhead for copying "pfData array".

In my experience for coding to C6747, "A" can output high performance.

Would it be possible to optimize the function "case A", by using #pragma or optimizing option.

void CCompressor::Calc( float pfData[256] )
{
   int i;
   float pfTemp[256];
   float fPreGain = m_fPreGain; // member variable
   memcpy( pfTemp, pfData, sizeof(float)*SHIFT_SIZE );

   // gain function
   for (i = 0; i < 256; i++){
       pfData[i] *= m_fPreGain;   // A ii = 20, Loop Unroll =  2x  -->   10 cycle/sample
//     pfData[i] *= fPreGain;        // B ii =   8, Loop Unroll =  8x  -->    1 cycle/sample

//     pfTemp[i] *= m_fPreGain; // C ii =   2, Loop Unroll =  2x  -->    1 cycle/sample
//      pfTemp[i] *= fPreGain;     // D ii =   4, Loop Unroll =  8x  -->  0.5 cycle/sample

   }  

 

    // another processing

   ;

   ;

}

  • Y_S said:

    In .asm file (output by -k optiion), is it correct to determine "ii" and "Loop unroll multipe" means 

    the cycle of each loop and the number of Unroll, respectively?

    If  "ii = 20" and "Unroll Multiple = 2", this  means 1loop cost 10cycle ( 20/2 ) as a result?

    That is correct.  This is a good way to evaluate the performance of a software pipelined loop generated by the compiler.  Note, however, this is a static analysis which only considers the CPU cycles used in one iteration of the loop.  Not included is the number of loop iterations, or cycles lost to memory effects like cache misses or wait states.  

    Y_S said:
    I hope to write as ”case A", but the optimization doesn't peform well (10 cycle/sample).

    I cannot reproduce that result.  But I had to guess at a few things such as the compiler version, and the build options used.  If you want me to reproduce these results, then please follow the directions in the article How to Submit a Compiler Test Case.

    Y_S said:
    Would it be possible to optimize the function "case A", by using #pragma or optimizing option.

    Consider adding this line ...

       _nassert(((int)pfData % 8) == 0);

    This tells the compiler the pointer pfData is aligned to an 8-byte boundary.  For further background, please see this article.  It is likely this change will double the performance of the loop.

    Thanks and regards,

    -George

  • Please submit the requested test case.

    Thanks and regards,

    -George

  • Hi,

     

    Thank you for your quick reply.

     

    // Problem1

    Following your advice, I prepared compact version of the C++ file attached below.

     

    While preparing this, I found the bottle neck.

     

    "Case 1-A" output the result of 10cycle/sample.

    However, without line 12 (float m_pfAmpRingBuf[64]; It seems meaningless in this compact C++ file but it is needed for the other processing), the performance of "case 1-A" improve from 10cycle/sample to 0.5cycle/sample (20 times faster).

     

    In another way of coding, "case 1-D" also perform 0.5cycle/sample result as I expect, but it costs overhead for copying the array.

     

    // Problem2

    In attached pp-file, I send another example of optimization result.

     

    In this matrix update function, I just calculate simple average of the element of the matrix.

    I tested 4 types of the same average function "case 2-A", "case 2-B", .. , "case 2-D".

     

    I hope to write as "case 2-A", but the performance is 26cycle/data and not good.

    "Case 2-D" has best performance (2cycle/data) but this also costs overhead for copying the array.

     

    Especially, I cannot understand the difference of "case 2-C" and "case 2-D", 14cycle and 2cycle, respectively.

     

    Futhermore, "case 2-C" has two optimization result of 14cycle and 2cycle in asm-file.

     

    Finally, though I am implementing to C6657 EVM, the total performance of our processing is about 500% of this DSP.

    I understand the high performance of this C66x DSP.

    My estimation before implementation was around 60%.

     

    I guess one biggest reason why the performance is not good is the optimization result.

    What I'd like to know is the appropriate build option or #pragma to output this fastest result in case 1-A and case 2-A (case 2-C is also acceptable).

     

    Best regards,

    Y_S

  • Thank you for sending in a test case.  However, the attachment cannot be downloaded.  I do not know why.  Please try again.  When you post your next reply, click on the link in the lower right corner titled Insert Code, Attach Files and more.  This causes a more feature rich message compose interface to come up. Use the paper clip icon to attach the text file.

    Thanks and regards,

    -George

  • OptSample.pp.txt
    /*****************************************************************************/
    /* string.h   v8.2.5                                                         */
    /*                                                                           */
    /* Copyright (c) 1993-2018 Texas Instruments Incorporated                    */
    /* http://www.ti.com/                                                        */
    /*                                                                           */
    /*  Redistribution and  use in source  and binary forms, with  or without    */
    /*  modification,  are permitted provided  that the  following conditions    */
    /*  are met:                                                                 */
    /*                                                                           */
    /*     Redistributions  of source  code must  retain the  above copyright    */
    /*     notice, this list of conditions and the following disclaimer.         */
    /*                                                                           */
    /*     Redistributions in binary form  must reproduce the above copyright    */
    /*     notice, this  list of conditions  and the following  disclaimer in    */
    /*     the  documentation  and/or   other  materials  provided  with  the    */
    /*     distribution.                                                         */
    /*                                                                           */
    /*     Neither the  name of Texas Instruments Incorporated  nor the names    */
    /*     of its  contributors may  be used to  endorse or  promote products    */
    /*     derived  from   this  software  without   specific  prior  written    */
    /*     permission.                                                           */
    /*                                                                           */
    /*  THIS SOFTWARE  IS PROVIDED BY THE COPYRIGHT  HOLDERS AND CONTRIBUTORS    */
    /*  "AS IS"  AND ANY  EXPRESS OR IMPLIED  WARRANTIES, INCLUDING,  BUT NOT    */
    /*  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR    */
    /*  A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT    */
    /*  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,    */
    /*  SPECIAL,  EXEMPLARY,  OR CONSEQUENTIAL  DAMAGES  (INCLUDING, BUT  NOT    */
    /*  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,    */
    /*  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY    */
    /*  THEORY OF  LIABILITY, WHETHER IN CONTRACT, STRICT  LIABILITY, OR TORT    */
    /*  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE    */
    /*  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.     */
    /*                                                                           */
    /*****************************************************************************/
    
    
    
    #pragma diag_push
    #pragma CHECK_MISRA("-6.3") /* standard types required for standard headers */
    #pragma CHECK_MISRA("-19.1") /* #includes required for implementation */
    #pragma CHECK_MISRA("-20.1") /* standard headers must define standard names */
    #pragma CHECK_MISRA("-20.2") /* standard headers must define standard names */
    
    /*---------------------------------------------------------------------------*/
    /* <cstring> IS RECOMMENDED OVER <string.h>.  <string.h> IS PROVIDED FOR     */
    /* COMPATIBILITY WITH C AND THIS USAGE IS DEPRECATED IN C++                  */
    /*---------------------------------------------------------------------------*/
    extern "C" namespace std
    {
     
    
    typedef unsigned size_t;
    
    /*****************************************************************************/
    /* linkage.h   v8.2.5                                                        */
    /*                                                                           */
    /* Copyright (c) 1998-2018 Texas Instruments Incorporated                    */
    /* http://www.ti.com/                                                        */
    /*                                                                           */
    /*  Redistribution and  use in source  and binary forms, with  or without    */
    /*  modification,  are permitted provided  that the  following conditions    */
    /*  are met:                                                                 */
    /*                                                                           */
    /*     Redistributions  of source  code must  retain the  above copyright    */
    /*     notice, this list of conditions and the following disclaimer.         */
    /*                                                                           */
    /*     Redistributions in binary form  must reproduce the above copyright    */
    /*     notice, this  list of conditions  and the following  disclaimer in    */
    /*     the  documentation  and/or   other  materials  provided  with  the    */
    /*     distribution.                                                         */
    /*                                                                           */
    /*     Neither the  name of Texas Instruments Incorporated  nor the names    */
    /*     of its  contributors may  be used to  endorse or  promote products    */
    /*     derived  from   this  software  without   specific  prior  written    */
    /*     permission.                                                           */
    /*                                                                           */
    /*  THIS SOFTWARE  IS PROVIDED BY THE COPYRIGHT  HOLDERS AND CONTRIBUTORS    */
    /*  "AS IS"  AND ANY  EXPRESS OR IMPLIED  WARRANTIES, INCLUDING,  BUT NOT    */
    /*  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR    */
    /*  A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT    */
    /*  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,    */
    /*  SPECIAL,  EXEMPLARY,  OR CONSEQUENTIAL  DAMAGES  (INCLUDING, BUT  NOT    */
    /*  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,    */
    /*  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY    */
    /*  THEORY OF  LIABILITY, WHETHER IN CONTRACT, STRICT  LIABILITY, OR TORT    */
    /*  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE    */
    /*  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.     */
    /*                                                                           */
    /*****************************************************************************/
    
    
    #pragma diag_push
    #pragma CHECK_MISRA("-19.4") /* macros required for implementation */
    
    /*--------------------------------------------------------------------------*/
    /* Define _CODE_ACCESS ==> how to call RTS functions                        */
    /*--------------------------------------------------------------------------*/
    
    /*--------------------------------------------------------------------------*/
    /* Define _DATA_ACCESS ==> how to access RTS global or static data          */
    /*--------------------------------------------------------------------------*/
    /*--------------------------------------------------------------------------*/
    /* Define _DATA_ACCESS_NEAR ==> some C6000 RTS data must always be near     */
    /*--------------------------------------------------------------------------*/
    
    /*--------------------------------------------------------------------------*/
    /* Define _IDECL ==> how inline functions are declared                      */
    /*--------------------------------------------------------------------------*/
    
    /*--------------------------------------------------------------------------*/
    /* If compiling with non-TI compiler (e.g. GCC), nullify any TI-specific    */
    /* language extensions.                                                     */
    /*--------------------------------------------------------------------------*/
    
    #pragma diag_pop
    
    
    #pragma diag_push
    #pragma CHECK_MISRA("-19.4") /* macros required for implementation */
    
    
    #pragma diag_pop
    
    static __inline size_t  strlen(const char *string);
    
    static __inline char *strcpy(char * __restrict dest,
                            const char * __restrict src);
    static __inline char *strncpy(char * __restrict dest,
                             const char * __restrict src, size_t n);
    static __inline char *strcat(char * __restrict string1,
                            const char * __restrict string2);
    static __inline char *strncat(char * __restrict dest,
                             const char * __restrict src, size_t n);
    static __inline char *strchr(const char *string, int c);
    static __inline char *strrchr(const char *string, int c);
    
    static __inline int  strcmp(const char *string1, const char *string2);
    static __inline int  strncmp(const char *string1, const char *string2, size_t n);
    
     int     strcoll(const char *string1, const char *_string2);
     size_t  strxfrm(char * __restrict to,
                                 const char * __restrict from, size_t n);
     char   *strpbrk(const char *string, const char *chs);
     size_t  strspn(const char *string, const char *chs);
     size_t  strcspn(const char *string, const char *chs);
     char   *strstr(const char *string1, const char *string2);
     char   *strtok(char * __restrict str1,
                                const char * __restrict str2);
     char   *strerror(int _errno);
     char   *strdup(const char *string);
    
    
     void   *memmove(void *s1, const void *s2, size_t n);
    #pragma diag_push
    #pragma CHECK_MISRA("-16.4") /* false positives due to builtin declarations */
     void   *memcpy(void * __restrict s1,
                                const void * __restrict s2, size_t n);
    #pragma diag_pop
    
    static __inline int     memcmp(const void *cs, const void *ct, size_t n);
    static __inline void   *memchr(const void *cs, int c, size_t n);
    
     void   *memset(void *mem, int ch, size_t length);
    
    
    } /* extern "C" namespace std */
    
    
    
    namespace std {
    
    #pragma diag_push
    #pragma CHECK_MISRA("-19.4") /* macros required for implementation */
    
    
    #pragma diag_pop
    
    #pragma diag_push /* functions */
    
    /* MISRA exceptions to avoid changing inline versions of the functions that
       would be linked in instead of included inline at different mf levels */
    /* these functions are very well-tested, stable, and efficient; it would
       introduce a high risk to implement new, separate MISRA versions just for the
       inline headers */
    
    #pragma CHECK_MISRA("-5.7") /* keep names intact */
    #pragma CHECK_MISRA("-6.1") /* false positive on use of char type */
    #pragma CHECK_MISRA("-8.5") /* need to define inline functions */
    #pragma CHECK_MISRA("-10.1") /* use implicit casts */
    #pragma CHECK_MISRA("-10.3") /* need casts */
    #pragma CHECK_MISRA("-11.5") /* casting away const required for standard impl */
    #pragma CHECK_MISRA("-12.1") /* avoid changing expressions */
    #pragma CHECK_MISRA("-12.2") /* avoid changing expressions */
    #pragma CHECK_MISRA("-12.4") /* avoid changing expressions */
    #pragma CHECK_MISRA("-12.5") /* avoid changing expressions */
    #pragma CHECK_MISRA("-12.6") /* avoid changing expressions */
    #pragma CHECK_MISRA("-12.13") /* ++/-- needed for reasonable implementation */
    #pragma CHECK_MISRA("-13.1") /* avoid changing expressions */
    #pragma CHECK_MISRA("-14.7") /* use multiple return points */
    #pragma CHECK_MISRA("-14.8") /* use non-compound statements */
    #pragma CHECK_MISRA("-14.9") /* use non-compound statements */
    #pragma CHECK_MISRA("-17.4") /* pointer arithmetic needed for reasonable impl */
    #pragma CHECK_MISRA("-17.6") /* false positive returning pointer-typed param */
    
    static __inline size_t strlen(const char *string)
    {
       size_t      n = (size_t)-1;
       const char *s = string;
    
       do n++; while (*s++);
       return n;
    }
    
    static __inline char *strcpy(char * __restrict dest, const char * __restrict src)
    {
         char       *d = dest;
         const char *s = src;
    
         while (*d++ = *s++);
         return dest;
    }
    
    static __inline char *strncpy(char * __restrict dest,
                             const char * __restrict src,
                             size_t n)
    {
         if (n) 
         {
    	 char       *d = dest;
    	 const char *s = src;
    	 while ((*d++ = *s++) && --n);              /* COPY STRING         */
    	 if (n-- > 1) do *d++ = '\0'; while (--n);  /* TERMINATION PADDING */
         }
         return dest;
    }
    
    static __inline char *strcat(char * __restrict string1,
                            const char * __restrict string2)
    {
       char       *s1 = string1;
       const char *s2 = string2;
    
       while (*s1) s1++;		     /* FIND END OF STRING   */
       while (*s1++ = *s2++);	     /* APPEND SECOND STRING */
       return string1;
    }
    
    static __inline char *strncat(char * __restrict dest,
                             const char * __restrict src, size_t n)
    {
        if (n)
        {
    	char       *d = dest;
    	const char *s = src;
    
    	while (*d) d++;                      /* FIND END OF STRING   */
    
    	while (n--)
    	  if (!(*d++ = *s++)) return dest; /* APPEND SECOND STRING */
    	*d = 0;
        }
        return dest;
    }
    
    static __inline char *strchr(const char *string, int c)
    {
       char        tch, ch  = c;
       const char *s        = string;
    
       for (;;)
       {
           if ((tch = *s) == ch) return (char *) s;
           if (!tch)             return (char *) 0;
           s++;
       }
    }
    
    static __inline char *strrchr(const char *string, int c)
    {
       char        tch, ch = c;
       char       *result  = 0;
       const char *s       = string;
    
       for (;;)
       {
          if ((tch = *s) == ch) result = (char *) s;
          if (!tch) break;
          s++;
       }
    
       return result;
    }
    
    static __inline int strcmp(const char *string1, const char *string2)
    {
       int c1, res;
    
       for (;;)
       {
           c1  = (unsigned char)*string1++;
           res = c1 - (unsigned char)*string2++;
    
           if (c1 == 0 || res != 0) break;
       }
    
       return res;
    }
    
    static __inline int strncmp(const char *string1, const char *string2, size_t n)
    {
         if (n) 
         {
    	 const char *s1 = string1;
    	 const char *s2 = string2;
    	 unsigned char cp;
    	 int         result;
    
    	 do 
    	    if (result = (unsigned char)*s1++ - (cp = (unsigned char)*s2++))
                    return result;
    	 while (cp && --n);
         }
         return 0;
    }
    
    static __inline int memcmp(const void *cs, const void *ct, size_t n)
    {
       if (n) 
       {
           const unsigned char *mem1 = (unsigned char *)cs;
           const unsigned char *mem2 = (unsigned char *)ct;
           int                 cp1, cp2;
    
           while ((cp1 = *mem1++) == (cp2 = *mem2++) && --n);
           return cp1 - cp2;
       }
       return 0;
    }
    
    static __inline void *memchr(const void *cs, int c, size_t n)
    {
       if (n)
       {
          const unsigned char *mem = (unsigned char *)cs;   
          unsigned char        ch  = c;
    
          do 
             if ( *mem == ch ) return (void *)mem;
             else mem++;
          while (--n);
       }
       return 0;
    }
    
    
    } /* namespace std */
    
    
    
    #pragma diag_pop
    
    
    #pragma diag_push
    
    /* using declarations must occur outside header guard to support including both
       C and C++-wrapped version of header; see _CPP_STYLE_HEADER check */
    /* this code is for C++ mode only and thus also not relevant for MISRA */
    #pragma CHECK_MISRA("-19.15")
    
    using std::size_t;
    using std::strlen;
    using std::strcpy;
    using std::strncpy;
    using std::strcat;
    using std::strncat;
    using std::strchr;
    using std::strrchr;
    using std::strcmp;
    using std::strncmp;
    using std::strcoll;
    using std::strxfrm;
    using std::strpbrk;
    using std::strspn;
    using std::strcspn;
    using std::strstr;
    using std::strtok;
    using std::strerror;
    using std::strdup;
    using std::memmove;
    using std::memcpy;
    using std::memcmp;
    using std::memchr;
    using std::memset;
    
    
    
    #pragma diag_pop
    /*****************************************************************************/
    /*  C6X.H v8.2.5                                                             */
    /*                                                                           */
    /* Copyright (c) 1996-2018 Texas Instruments Incorporated                    */
    /* http://www.ti.com/                                                        */
    /*                                                                           */
    /*  Redistribution and  use in source  and binary forms, with  or without    */
    /*  modification,  are permitted provided  that the  following conditions    */
    /*  are met:                                                                 */
    /*                                                                           */
    /*     Redistributions  of source  code must  retain the  above copyright    */
    /*     notice, this list of conditions and the following disclaimer.         */
    /*                                                                           */
    /*     Redistributions in binary form  must reproduce the above copyright    */
    /*     notice, this  list of conditions  and the following  disclaimer in    */
    /*     the  documentation  and/or   other  materials  provided  with  the    */
    /*     distribution.                                                         */
    /*                                                                           */
    /*     Neither the  name of Texas Instruments Incorporated  nor the names    */
    /*     of its  contributors may  be used to  endorse or  promote products    */
    /*     derived  from   this  software  without   specific  prior  written    */
    /*     permission.                                                           */
    /*                                                                           */
    /*  THIS SOFTWARE  IS PROVIDED BY THE COPYRIGHT  HOLDERS AND CONTRIBUTORS    */
    /*  "AS IS"  AND ANY  EXPRESS OR IMPLIED  WARRANTIES, INCLUDING,  BUT NOT    */
    /*  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR    */
    /*  A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT    */
    /*  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,    */
    /*  SPECIAL,  EXEMPLARY,  OR CONSEQUENTIAL  DAMAGES  (INCLUDING, BUT  NOT    */
    /*  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,    */
    /*  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY    */
    /*  THEORY OF  LIABILITY, WHETHER IN CONTRACT, STRICT  LIABILITY, OR TORT    */
    /*  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE    */
    /*  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.     */
    /*                                                                           */
    /*****************************************************************************/
    
    /*****************************************************************************/
    /*  VECT.H v8.2.5                                                            */
    /*                                                                           */
    /* Copyright (c) 1996-2018 Texas Instruments Incorporated                    */
    /* http://www.ti.com/                                                        */
    /*                                                                           */
    /*  Redistribution and  use in source  and binary forms, with  or without    */
    /*  modification,  are permitted provided  that the  following conditions    */
    /*  are met:                                                                 */
    /*                                                                           */
    /*     Redistributions  of source  code must  retain the  above copyright    */
    /*     notice, this list of conditions and the following disclaimer.         */
    /*                                                                           */
    /*     Redistributions in binary form  must reproduce the above copyright    */
    /*     notice, this  list of conditions  and the following  disclaimer in    */
    /*     the  documentation  and/or   other  materials  provided  with  the    */
    /*     distribution.                                                         */
    /*                                                                           */
    /*     Neither the  name of Texas Instruments Incorporated  nor the names    */
    /*     of its  contributors may  be used to  endorse or  promote products    */
    /*     derived  from   this  software  without   specific  prior  written    */
    /*     permission.                                                           */
    /*                                                                           */
    /*  THIS SOFTWARE  IS PROVIDED BY THE COPYRIGHT  HOLDERS AND CONTRIBUTORS    */
    /*  "AS IS"  AND ANY  EXPRESS OR IMPLIED  WARRANTIES, INCLUDING,  BUT NOT    */
    /*  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR    */
    /*  A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT    */
    /*  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,    */
    /*  SPECIAL,  EXEMPLARY,  OR CONSEQUENTIAL  DAMAGES  (INCLUDING, BUT  NOT    */
    /*  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,    */
    /*  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY    */
    /*  THEORY OF  LIABILITY, WHETHER IN CONTRACT, STRICT  LIABILITY, OR TORT    */
    /*  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE    */
    /*  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.     */
    /*                                                                           */
    /*****************************************************************************/
    
    
    extern "C"
    {
    
    /*****************************************************************************/
    /* STDINT.H v8.2.5                                                           */
    /*                                                                           */
    /* Copyright (c) 2002-2018 Texas Instruments Incorporated                    */
    /* http://www.ti.com/                                                        */
    /*                                                                           */
    /*  Redistribution and  use in source  and binary forms, with  or without    */
    /*  modification,  are permitted provided  that the  following conditions    */
    /*  are met:                                                                 */
    /*                                                                           */
    /*     Redistributions  of source  code must  retain the  above copyright    */
    /*     notice, this list of conditions and the following disclaimer.         */
    /*                                                                           */
    /*     Redistributions in binary form  must reproduce the above copyright    */
    /*     notice, this  list of conditions  and the following  disclaimer in    */
    /*     the  documentation  and/or   other  materials  provided  with  the    */
    /*     distribution.                                                         */
    /*                                                                           */
    /*     Neither the  name of Texas Instruments Incorporated  nor the names    */
    /*     of its  contributors may  be used to  endorse or  promote products    */
    /*     derived  from   this  software  without   specific  prior  written    */
    /*     permission.                                                           */
    /*                                                                           */
    /*  THIS SOFTWARE  IS PROVIDED BY THE COPYRIGHT  HOLDERS AND CONTRIBUTORS    */
    /*  "AS IS"  AND ANY  EXPRESS OR IMPLIED  WARRANTIES, INCLUDING,  BUT NOT    */
    /*  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR    */
    /*  A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT    */
    /*  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,    */
    /*  SPECIAL,  EXEMPLARY,  OR CONSEQUENTIAL  DAMAGES  (INCLUDING, BUT  NOT    */
    /*  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,    */
    /*  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY    */
    /*  THEORY OF  LIABILITY, WHETHER IN CONTRACT, STRICT  LIABILITY, OR TORT    */
    /*  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE    */
    /*  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.     */
    /*                                                                           */
    /*****************************************************************************/
    
    /* 7.18.1.1 Exact-width integer types */
    
        typedef   signed char   int8_t;
        typedef unsigned char  uint8_t;
        typedef          short  int16_t;
        typedef unsigned short uint16_t;
        typedef          int    int32_t;
        typedef unsigned int   uint32_t;
    
        typedef          __int40_t  int40_t;
        typedef unsigned __int40_t uint40_t;
    
        typedef          long long  int64_t;
        typedef unsigned long long uint64_t;
    
    /* 7.18.1.2 Minimum-width integer types */
    
        typedef  int8_t   int_least8_t;
        typedef uint8_t  uint_least8_t;
    
        typedef  int16_t  int_least16_t;
        typedef uint16_t uint_least16_t;
        typedef  int32_t  int_least32_t;
        typedef uint32_t uint_least32_t;
    
        typedef  int40_t  int_least40_t;
        typedef uint40_t uint_least40_t;
    
        typedef  int64_t  int_least64_t;
        typedef uint64_t uint_least64_t;
    
    /* 7.18.1.3 Fastest minimum-width integer types */
    
        typedef  int32_t  int_fast8_t;
        typedef uint32_t uint_fast8_t;
        typedef  int32_t  int_fast16_t;
        typedef uint32_t uint_fast16_t;
    
        typedef  int32_t  int_fast32_t;
        typedef uint32_t uint_fast32_t;
    
        typedef  int40_t  int_fast40_t;
        typedef uint40_t uint_fast40_t;
    
        typedef  int64_t  int_fast64_t;
        typedef uint64_t uint_fast64_t;
    
    /* 7.18.1.4 Integer types capable of holding object pointers */
        typedef          int intptr_t;
        typedef unsigned int uintptr_t;
    
    /* 7.18.1.5 Greatest-width integer types */
        typedef          long long intmax_t;
        typedef unsigned long long uintmax_t;
    
    /* 
       According to footnotes in the 1999 C standard, "C++ implementations
       should define these macros only when __STDC_LIMIT_MACROS is defined
       before <stdint.h> is included." 
    */
    
    
    typedef float float32_t;
    
    
    
    
    
    
    
    struct __simd128_int32_t { int32_t _v[4]; } __attribute__((aligned(16))) __attribute__((vector_type)); typedef struct __simd128_int32_t int32x4_t;
    
    typedef int32x4_t __x128_t;
    
    } /* extern "C" */
    
    extern "C"
    {
    
    /*****************************************************************************/
    /*                                                                           */
    /* NOTICE TO THOSE WHO USE INTRINSICS AND PACKED DATA                        */
    /*                                                                           */
    /* This note contains information on a new __float2_t type.                  */
    /* It also contains recommendations on the use of the "double" type.         */
    /*                                                                           */
    /* In order to better support packed data compiler optimizations in the      */
    /* future, the use of the type "double" for packed data is now discouraged   */
    /* and its support may be discontinued in the future.                        */
    /*                                                                           */
    /* There are several recommendations and changes as a result.  Note that     */
    /* these changes do NOT break compatibility with older code (source files    */
    /* or object files).                                                         */
    /*                                                                           */
    /* (1) long long should be used for 64-bit packed integer data.  The double  */
    /*     type should be used only for double-precision floating point values.  */
    /*                                                                           */
    /* (2) There is a new type, __float2_t, that holds two floats and should     */
    /*     be used instead of double for holding two floats.  For now, this new  */
    /*     type is typedef'ed to double in c6x.h, but could be changed in the    */
    /*     future to a structure or vector type to allow better optimization of  */
    /*     packed data floats.  We recommend the use of __float2_t for any       */
    /*     float x2 data instead of double.                                      */
    /*                                                                           */
    /* (3) There are new __float2_t manipulation intrinsics (see below) that     */
    /*     should be used to create and manipulate objects of type __float2_t.   */
    /*                                                                           */
    /* (4) C66 intrinsics that deal with packed float data are now declared      */
    /*     using __float2_t instead of double.  (Those intrinsics are declared   */
    /*     in this file, c6x.h.)                                                 */
    /*                                                                           */
    /* (5) When using any intrinsic that involves __float2_t, c6x.h must be      */
    /*     included.                                                             */
    /*                                                                           */
    /* (6) Certain intrinsics that used double to store fixed-point packed       */
    /*     data have been deprecated.  They will still be supported in the       */
    /*     near future, but their descriptions will be removed from the          */
    /*     compiler user's guide (spru187).  Deprecated: _mpy2, _mpyhi, _mpyli,  */
    /*     _mpysu4, _mpyu4, and _smpy2.  Use the long long versions instead.     */
    /*                                                                           */
    /* Please see                                                                */
    /* http://processors.wiki.ti.com/index.php/C6000_Intrinsics_and_Type_Double  */
    /* and the C6000 Compiler User's Guide (v7.2), spru187, for more             */
    /* information.                                                              */
    /*                                                                           */
    /*****************************************************************************/
    /* If not using host intrinsics, define __float2_t items. */
      typedef double   __float2_t;
      /*-------------------------------------------------------------------------*/
      /* __float2_t manipulation intrinsics                                      */
      /*                                                                         */
      /* Since __float2_t is just a typedef to double at this time, we simply    */
      /* use #defines to "create" the __float2_t manipulation intrinsics.  The   */
      /* __float2_t intrinsics are listed in this comment for convenience.       */
      /*                                                                         */
      /* __float2_t _lltof2(long long)   Reinterpret long long as __float2_t     */
      /* long long  _f2toll(__float2_t)  Reinterpret __float2_t as long long     */
      /* __float2_t _ftof2(float, float) Create a __float2_t from 2 floats       */
      /* float      _hif2(__float2_t)    Return the hi 32 bits of a __float2_t   */
      /* float      _lof2(__float2_t)    Return the lo 32 bits of a __float2_t   */
      /* __int40_t  _f2tol(__float2_t)   Reinterpret __float2_t as 40-bit type   */
      /* __float2_t _ltof2(__int40_t)    Reinterpret 40-bit type as __float2_t   */
      /*                                                                         */
      /* __float2_t & _amem8_f2(void *)         |                                */
      /* __float2_t & _amem8_f2_const(void *)   | Allows (un)aligned loads and   */
      /* __float2_t & _mem8_f2(void *)          | stores of 8 bytes to and from  */
      /* __float2_t & _mem8_f2_const(void *)    | memory.                        */
      /*                                                                         */
      /* __float2_t _fdmv_f2(float, float)  Move two floats with one instruction */
      /* __float2_t _hif2_128(__x128_t)     Return hi 64 bits of __x128_t        */
      /* __float2_t _lof2_128(__x128_t)     Return hi 64 bits of __x128_t        */
      /* __x128_t   _f2to128(__float2_t, __float2_t)  Compose __x128_t           */
      /* __float2_t _fdmvd_f2(float, float) Delayed move of two floats           */
      /*-------------------------------------------------------------------------*/
    
    
      /* _mem8_f2 and _mem8_f2_const for C6400 and compatible */
    
      /* _fdmv_f2 for C6400+ and compatible */
    
      /* __float2_t manipulation intrinsics for __x128_t and C6600 */
    
    
    unsigned  _extu	   (unsigned, unsigned, unsigned);
    int       _ext	   (int,      unsigned, unsigned);
    unsigned  _set	   (unsigned, unsigned, unsigned);
    unsigned  _clr	   (unsigned, unsigned, unsigned);
    unsigned  _extur   (unsigned, int);
    int       _extr	   (int,      int);
    unsigned  _setr	   (unsigned, int);
    unsigned  _clrr	   (unsigned, int);
    int       _sadd	   (int,      int);
    int	  _ssub	   (int,      int);
    int       _sshl	   (int,      unsigned);
    int	  _add2	   (int,      int);
    int	  _sub2	   (int,      int);
    unsigned  _subc	   (unsigned, unsigned);
    unsigned  _lmbd	   (unsigned, unsigned);
    int       _abs	   (int);
    __int40_t _labs	   (__int40_t);
    unsigned  _norm	   (int);
    int	  _smpy	   (int,      int);
    int	  _smpyhl  (int,      int);
    int	  _smpylh  (int,      int);
    int	  _smpyh   (int,      int);
    int	  _mpy	   (int,      int);
    int	  _mpyus   (unsigned, int);
    int	  _mpysu   (int,      unsigned);
    unsigned  _mpyu	   (unsigned, unsigned);
    int	  _mpyhl   (int,      int);
    int	  _mpyhuls (unsigned, int);
    int	  _mpyhslu (int,      unsigned);
    unsigned  _mpyhlu  (unsigned, unsigned);
    int	  _mpylh   (int,      int);
    int	  _mpyluhs (unsigned, int);
    int	  _mpylshu (int,      unsigned);
    unsigned  _mpylhu  (unsigned, unsigned);
    int	  _mpyh	   (int,      int);
    int	  _mpyhus  (unsigned, int);
    int	  _mpyhsu  (int,      unsigned);
    unsigned  _mpyhu   (unsigned, unsigned);
    
    __int40_t _lsadd   (int, __int40_t);
    __int40_t _lssub   (int, __int40_t);
    int       _sat	   (__int40_t);
    unsigned  _lnorm   (__int40_t);
    
    double    _fabs    (double);
    float     _fabsf   (float);
    long long _mpyidll (int,      int);
    int    	  _spint   (float);
    int    	  _dpint   (double);
    float  	  _rcpsp   (float);
    double 	  _rcpdp   (double);
    float  	  _rsqrsp  (float);
    double 	  _rsqrdp  (double);
    
    /*double    _mpyid   (int,      int);  Deprecated.  Use _mpyidll instead. */
    
    unsigned  _hi(double);      /* Return the hi 32 bits of a double as an int    */
    float     _hif(double);     /* Return the hi 32 bits of a double as a float   */
    unsigned  _hill(long long); /* Return the hi 32 bits of a long long as an int */
    unsigned  _lo(double);      /* Return the lo 32 bits of a double as an int    */
    float     _lof(double);     /* Return the lo 32 bits of a double as a float   */
    unsigned  _loll(long long); /* Return the lo 32 bits of a long long as an int */
      
    double 	  _itod(unsigned, unsigned);  /* Create a double from 2 ints    */
    double 	  _ftod(float,    float);     /* Create a double from 2 floats  */
    long long _itoll(unsigned, unsigned); /* Create a long long from 2 ints */
    float  	  _itof(unsigned);            /* Reinterpret int as float.      */
    unsigned  _ftoi(float);               /* Reinterpret float as int.      */
    
    __int40_t _dtol(double);              /* Reinterpret double as 40-bit type    */
    double    _ltod(__int40_t);           /* Reinterpret 40-bit type as double    */
    long long _dtoll(double);             /* Reinterpret double as long long      */
    double    _lltod(long long);          /* Reinterpret long long as double      */
    
      /* Define pseudo intrinsics for some pseudo instructions */
    int       _add4      (int,      int);
    int       _avg2      (int,      int);
    unsigned  _avgu4     (unsigned, unsigned);
    int       _cmpeq2    (int,      int);
    int       _cmpeq4    (int,      int);
    int       _cmpgt2    (int,      int);
    unsigned  _cmpgtu4   (unsigned, unsigned);
    int       _dotp2     (int,      int);
    int       _dotpn2    (int,      int);
    int       _dotpnrsu2 (int,      unsigned);
    int       _dotprsu2  (int,      unsigned);
    int       _dotpsu4   (int,      unsigned);
    unsigned  _dotpu4    (unsigned, unsigned);
    int       _gmpy4     (int,      int);
    __int40_t _ldotp2    (int,      int);
    int       _max2      (int,      int);
    unsigned  _maxu4     (unsigned, unsigned);
    int       _min2      (int,      int);
    unsigned  _minu4     (unsigned, unsigned);
    long long _mpy2ll    (int,      int);
    long long _mpyhill   (int,      int);
    int       _mpyhir    (int,      int);
    long long _mpylill   (int,      int);
    int       _mpylir    (int,      int);
    long long _mpysu4ll  (int,      unsigned);
    long long _mpyu4ll   (unsigned, unsigned);
    unsigned  _pack2     (unsigned, unsigned);
    unsigned  _packh2    (unsigned, unsigned);
    unsigned  _packh4    (unsigned, unsigned);
    unsigned  _packhl2   (unsigned, unsigned);
    unsigned  _packl4    (unsigned, unsigned);
    unsigned  _packlh2   (unsigned, unsigned);
    unsigned  _rotl      (unsigned, unsigned);
    int       _sadd2     (int,      int);
    unsigned  _saddu4    (unsigned, unsigned);
    int       _saddus2   (unsigned, int);
    unsigned  _shlmb     (unsigned, unsigned);
    int       _shr2      (int,      unsigned);
    unsigned  _shrmb     (unsigned, unsigned);
    unsigned  _shru2     (unsigned, unsigned);
    long long _smpy2ll   (int,      int);
    int       _spack2    (int,      int);
    unsigned  _spacku4   (int,      int);
    int       _sshvl     (int,      int);
    int       _sshvr     (int,      int);
    int       _sub4      (int,      int);
    int       _subabs4   (int,      int);
         
    int       _abs2      (int);
    unsigned  _bitc4     (unsigned);
    unsigned  _bitr      (unsigned);
    unsigned  _deal      (unsigned);
    int       _mvd       (int);
    unsigned  _shfl      (unsigned);
    unsigned  _swap4     (unsigned);
    unsigned  _unpkhu4   (unsigned);
    unsigned  _unpklu4   (unsigned);
    unsigned  _xpnd2     (unsigned);
    unsigned  _xpnd4     (unsigned);
    
    /*double  _mpy2      (int,      int);  Deprecated: use _mpy2ll instead */
    /*double  _mpyhi     (int,      int);  Deprecated: use _mpyhill instead */
    /*double  _mpysu4    (int,      unsigned);  Deprecated: use _mpysu4ll instead */
    /*double  _mpyu4     (unsigned, unsigned);  Deprecated: use _mpyu4ll instead */
    /*double  _smpy2     (int,      int);  Deprecated: use _smpy2ll instead */
    
    
    long long _addsub    (int,       int);
    long long _addsub2   (unsigned,  unsigned);
    long long _cmpy      (unsigned,  unsigned);
    unsigned  _cmpyr     (unsigned,  unsigned);
    unsigned  _cmpyr1    (unsigned,  unsigned);
    long long _ddotph2   (long long, unsigned);
    unsigned  _ddotph2r  (long long, unsigned);
    long long _ddotpl2   (long long, unsigned);
    unsigned  _ddotpl2r  (long long, unsigned);
    long long _ddotp4    (unsigned,  unsigned);
    long long _dpack2    (unsigned,  unsigned);
    long long _dpackx2   (unsigned,  unsigned);
    long long _dmv       (unsigned,  unsigned);
    double    _fdmv      (float,     float);
    unsigned  _gmpy      (unsigned,  unsigned);
    long long _mpy32ll   (int,       int);
    int       _mpy32     (int,       int);
    long long _mpy32su   (int,       unsigned);
    long long _mpy32us   (unsigned,  int);
    long long _mpy32u    (unsigned,  unsigned);
    long long _mpy2ir    (unsigned,  int);
    unsigned  _rpack2    (unsigned,  unsigned);
    long long _saddsub   (int,       int);
    long long _saddsub2  (unsigned,  unsigned);
    long long _shfl3     (unsigned,  unsigned);
    int       _smpy32    (int,       int);
    int       _ssub2     (int,       int);
    unsigned  _xormpy    (unsigned,  unsigned);
    
    long long  _dcmpyr1    (long long, long long);
    long long  _dccmpyr1   (long long, long long);
    long long  _cmpy32r1   (long long, long long);
    long long  _ccmpy32r1  (long long, long long);
    long long  _mpyu2      (unsigned,  unsigned);
    int        _dotp4h     (long long, long long);
    long long  _dotp4hll   (long long, long long);
    int        _dotpsu4h   (long long, long long);
    long long  _dotpsu4hll (long long, long long);
    long long  _dadd       (long long, long long);
    long long  _dadd_c     (int,       long long);
    long long  _dsadd      (long long, long long);
    long long  _dadd2      (long long, long long);
    long long  _dsadd2     (long long, long long);
    long long  _dsub       (long long, long long);
    long long  _dssub      (long long, long long);
    long long  _dssub2     (long long, long long);
    long long  _dapys2     (long long, long long);
    long long  _dshr       (long long, unsigned);
    long long  _dshru      (long long, unsigned);
    long long  _dshl       (long long, unsigned);
    long long  _dshr2      (long long, unsigned);
    long long  _dshru2     (long long, unsigned);
    unsigned   _shl2       (unsigned , unsigned);
    long long  _dshl2      (long long, unsigned);
    long long  _dxpnd4     (unsigned);
    long long  _dxpnd2     (unsigned);
    int        _crot90     (int);
    long long  _dcrot90    (long long);
    int        _crot270    (int);
    long long  _dcrot270   (long long);
    long long  _dmax2      (long long, long long);
    long long  _dmin2      (long long, long long);
    long long  _dmaxu4     (long long, long long);
    long long  _dminu4     (long long, long long);
    unsigned   _dcmpgt2    (long long, long long);
    unsigned   _dcmpeq2    (long long, long long);
    unsigned   _dcmpgtu4   (long long, long long);
    unsigned   _dcmpeq4    (long long, long long);
    long long  _davg2      (long long, long long);
    long long  _davgu4     (long long, long long);
    long long  _davgnr2    (long long, long long);
    long long  _davgnru4   (long long, long long);
    long long  _unpkbu4    (unsigned);
    long long  _unpkh2     (unsigned);
    long long  _unpkhu2    (unsigned);
    long long  _dpackl2    (long long, long long);
    long long  _dpackh2    (long long, long long);
    long long  _dpackhl2   (long long, long long);
    long long  _dpacklh4   (unsigned,  unsigned);
    long long  _dpackl4    (long long, long long);
    long long  _dpackh4    (long long, long long);
    long long  _dspacku4   (long long, long long);
    void       _mfence     ();
    __float2_t _dmpysp     (__float2_t, __float2_t);
    __float2_t _daddsp     (__float2_t, __float2_t);
    __float2_t _dsubsp     (__float2_t, __float2_t);
    __float2_t _dinthsp    (unsigned);
    __float2_t _dinthspu   (unsigned);
    __float2_t _dintsp     (long long);
    __float2_t _dintspu    (long long);
    unsigned   _dspinth    (__float2_t);
    long long  _dspint     (__float2_t);
    
    int        _land       (int, int);
    int        _landn      (int, int);
    int        _lor        (int, int);
    
    long long  _dmvd       (int,       int);
    double     _fdmvd      (float,     float);
    
    __float2_t _complex_mpysp           (__float2_t, __float2_t); /* CMPYSP then DADDSP */
    __float2_t _complex_conjugate_mpysp (__float2_t, __float2_t); /* CMPYSP then DSUBSP */
    
    long long  _xorll_c    (int, long long);
    
    __x128_t   __attribute__((builtin)) _dcmpy      (long long, long long);
    __x128_t   __attribute__((builtin)) _dccmpy     (long long, long long);
    long long  __attribute__((builtin)) _cmatmpyr1  (long long, __x128_t);
    long long  __attribute__((builtin)) _ccmatmpyr1 (long long, __x128_t);
    __x128_t   __attribute__((builtin)) _cmatmpy    (long long, __x128_t);
    __x128_t   __attribute__((builtin)) _ccmatmpy   (long long, __x128_t);
    __x128_t   __attribute__((builtin)) _qsmpy32r1  (__x128_t,  __x128_t);
    __x128_t   __attribute__((builtin)) _qmpy32     (__x128_t,  __x128_t);
    __x128_t   __attribute__((builtin)) _dsmpy2     (long long, long long);
    __x128_t   __attribute__((builtin)) _dmpy2      (long long, long long);
    __x128_t   __attribute__((builtin)) _dmpyu2     (long long, long long);
    __x128_t   __attribute__((builtin)) _dmpysu4    (long long, long long);
    __x128_t   __attribute__((builtin)) _dmpyu4     (long long, long long);
    __x128_t   __attribute__((builtin)) _cmpysp     (__float2_t, __float2_t);
    __x128_t   __attribute__((builtin)) _qmpysp     (__x128_t,  __x128_t);
    long long  __attribute__((builtin)) _ddotp4h    (__x128_t,  __x128_t);
    long long  __attribute__((builtin)) _ddotpsu4h  (__x128_t,  __x128_t);
    
    __x128_t   __attribute__((builtin)) _ito128  (unsigned,  unsigned, unsigned, unsigned);
    __x128_t   __attribute__((builtin)) _fto128  (float,     float,    float,    float);
    __x128_t   __attribute__((builtin)) _llto128 (long long, long long);
    __x128_t   __attribute__((builtin)) _dto128  (double,    double);
    
    long long  __attribute__((builtin)) _hi128   (__x128_t);
    double     __attribute__((builtin)) _hid128  (__x128_t);
    long long  __attribute__((builtin)) _lo128   (__x128_t);
    double     __attribute__((builtin)) _lod128  (__x128_t);
    
    unsigned  __attribute__((builtin)) _get32_128  (__x128_t, __attribute__((constrange((0), (3)))) unsigned);
    float     __attribute__((builtin)) _get32f_128 (__x128_t, __attribute__((constrange((0), (3)))) unsigned);
    
    __x128_t  __attribute__((builtin)) _dup32_128 (unsigned);
    
    
    extern __cregister volatile unsigned int AMR;
    extern __cregister volatile unsigned int CSR;
    extern __cregister volatile unsigned int IFR;
    extern __cregister volatile unsigned int ISR;
    extern __cregister volatile unsigned int ICR;
    extern __cregister volatile unsigned int IER;
    extern __cregister volatile unsigned int ISTP;
    extern __cregister volatile unsigned int IRP;
    extern __cregister volatile unsigned int NRP;
    
    extern __cregister volatile unsigned int GFPGFR;
    extern __cregister volatile unsigned int DIER;
    
    extern __cregister volatile unsigned int FADCR;
    extern __cregister volatile unsigned int FAUCR;
    extern __cregister volatile unsigned int FMCR;
    
    extern __cregister volatile unsigned int DESR;
    extern __cregister volatile unsigned int DETR;
    
    extern __cregister volatile unsigned int REP;
    extern __cregister volatile unsigned int TSCL;
    extern __cregister volatile unsigned int TSCH;
    extern __cregister volatile unsigned int ARP;
    extern __cregister volatile unsigned int ILC;
    extern __cregister volatile unsigned int RILC;
    extern __cregister volatile unsigned int PCE1;
    extern __cregister volatile unsigned int DNUM;
    extern __cregister volatile unsigned int SSR;
    extern __cregister volatile unsigned int GPLYA;
    extern __cregister volatile unsigned int GPLYB;
    extern __cregister volatile unsigned int TSR;
    extern __cregister volatile unsigned int ITSR;
    extern __cregister volatile unsigned int NTSR;
    extern __cregister volatile unsigned int ECR;
    extern __cregister volatile unsigned int EFR;
    extern __cregister volatile unsigned int IERR;
    
    extern __cregister volatile unsigned int DMSG;
    extern __cregister volatile unsigned int CMSG;
    extern __cregister volatile unsigned int DT_DMA_ADDR;
    extern __cregister volatile unsigned int DT_DMA_DATA;
    extern __cregister volatile unsigned int DT_DMA_CNTL;
    extern __cregister volatile unsigned int TCU_CNTL;
    extern __cregister volatile unsigned int RTDX_REC_CNTL;
    extern __cregister volatile unsigned int RTDX_XMT_CNTL;
    extern __cregister volatile unsigned int RTDX_CFG;
    extern __cregister volatile unsigned int RTDX_RDATA;
    extern __cregister volatile unsigned int RTDX_WDATA;
    extern __cregister volatile unsigned int RTDX_RADDR;
    extern __cregister volatile unsigned int RTDX_WADDR;
    extern __cregister volatile unsigned int MFREG0;
    extern __cregister volatile unsigned int DBG_STAT;
    extern __cregister volatile unsigned int BRK_EN;
    extern __cregister volatile unsigned int HWBP0_CNT;
    extern __cregister volatile unsigned int HWBP0;
    extern __cregister volatile unsigned int HWBP1;
    extern __cregister volatile unsigned int HWBP2;
    extern __cregister volatile unsigned int HWBP3;
    extern __cregister volatile unsigned int OVERLAY;
    extern __cregister volatile unsigned int PC_PROF;
    extern __cregister volatile unsigned int ATSR;
    extern __cregister volatile unsigned int TRR;
    extern __cregister volatile unsigned int TCRR;
    
    } /* extern "C" */
    
    /*****************************************************************************/
    /* DATA_IS_ALIGNED_2, DATA_IS_ALIGNED_4, DATA_IS_ALIGNED_8 -                 */
    /*     Tell the compiler that data is already aligned to a 2-byte, 4-byte    */
    /*     or 8-byte boundary.  Note: this macro does not change the             */
    /*     alignment of data.  Use DATA_ALIGN to change alignment.               */
    /*****************************************************************************/
    
    
    /*****************************************************************************/
    /* SAVE_AMR -                                                                */
    /*     Define a local 'volatile unsigned int' variable in your interrupt     */
    /*     routine.                                                              */
    /*     When invoking this macro, pass that local variable to save the AMR.   */
    /*                                                                           */
    /*     If you interrupted an assembly coded routine that may be using        */
    /*     circular addressing, and you interrupt into a C coded interrupt       */
    /*     service routine, you need to set the AMR to 0 for the C code and save */
    /*     off the AMR register, so that it will have the correct value upon     */
    /*     leaving the C interrupt service routine and returning to the assembly */
    /*     code.                                                                 */
    /*                                                                           */
    /*     Add this routine immediately after your local variable definitions    */
    /*     and before the start of your C interrupt code.                        */
    /*****************************************************************************/
    
    /*****************************************************************************/
    /* RESTORE_AMR -                                                             */
    /*    When invoking this macro, pass the same local variable that was passed */
    /*    to the SAVE_AMR macro.  This macro will restore the AMR to the value   */
    /*    it had when interrupted out of the hand assembly routine.              */
    /*                                                                           */
    /*    Add this macro immediately before exiting the C interrupt service      */
    /*    routine.                                                               */ 
    /*****************************************************************************/
    
    /*****************************************************************************/
    /* SAVE_SAT -                                                                */
    /*     Define a local 'volatile unsigned int' variable in your interrupt     */
    /*     routine.                                                              */
    /*     When invoking this macro, pass that local variable to save the SAT    */
    /*     bit.                                                                  */
    /*                                                                           */
    /*     If you interrupted a routine that was performing saturated arithmetic */
    /*     and the interrupt service routine is also performing saturated        */
    /*     arithmetic, then you must save and restore the SAT bit in your        */
    /*     interrupt service routine.                                            */
    /*                                                                           */
    /*     Add this routine immediately after your local variable definitions    */
    /*     and before the start of your C interrupt code.                        */
    /*****************************************************************************/
    
    /*****************************************************************************/
    /* RESTORE_SAT -                                                             */
    /*    When invoking this macro, pass the same local variable that was passed */
    /*    to the SAVE_SAT macro.  This macro will restore the SAT bit to the     */
    /*    value it had when your application was interrupted.                    */
    /*                                                                           */
    /*    Add this macro immediately before exiting the C interrupt service      */
    /*    routine.                                                               */ 
    /*****************************************************************************/
    /*****************************************************************************/
    /* assert.h   v8.2.5                                                         */
    /*                                                                           */
    /* Copyright (c) 1993-2018 Texas Instruments Incorporated                    */
    /* http://www.ti.com/                                                        */
    /*                                                                           */
    /*  Redistribution and  use in source  and binary forms, with  or without    */
    /*  modification,  are permitted provided  that the  following conditions    */
    /*  are met:                                                                 */
    /*                                                                           */
    /*     Redistributions  of source  code must  retain the  above copyright    */
    /*     notice, this list of conditions and the following disclaimer.         */
    /*                                                                           */
    /*     Redistributions in binary form  must reproduce the above copyright    */
    /*     notice, this  list of conditions  and the following  disclaimer in    */
    /*     the  documentation  and/or   other  materials  provided  with  the    */
    /*     distribution.                                                         */
    /*                                                                           */
    /*     Neither the  name of Texas Instruments Incorporated  nor the names    */
    /*     of its  contributors may  be used to  endorse or  promote products    */
    /*     derived  from   this  software  without   specific  prior  written    */
    /*     permission.                                                           */
    /*                                                                           */
    /*  THIS SOFTWARE  IS PROVIDED BY THE COPYRIGHT  HOLDERS AND CONTRIBUTORS    */
    /*  "AS IS"  AND ANY  EXPRESS OR IMPLIED  WARRANTIES, INCLUDING,  BUT NOT    */
    /*  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR    */
    /*  A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT    */
    /*  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,    */
    /*  SPECIAL,  EXEMPLARY,  OR CONSEQUENTIAL  DAMAGES  (INCLUDING, BUT  NOT    */
    /*  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,    */
    /*  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY    */
    /*  THEORY OF  LIABILITY, WHETHER IN CONTRACT, STRICT  LIABILITY, OR TORT    */
    /*  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE    */
    /*  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.     */
    /*                                                                           */
    /*****************************************************************************/
    
    
    
    
    #pragma diag_push
    #pragma CHECK_MISRA("-6.3") /* standard types required for standard headers */
    #pragma CHECK_MISRA("-19.4") /* macros required for implementation */
    #pragma CHECK_MISRA("-19.7") /* macros required for implementation */
    #pragma CHECK_MISRA("-19.13") /* # and ## required for implementation */
    
    
    /*---------------------------------------------------------------------------*/
    /* <cassert> IS RECOMMENDED OVER <assert.h>.  <assert.h> IS PROVIDED FOR     */
    /* COMPATIBILITY WITH C AND THIS USAGE IS DEPRECATED IN C++                  */
    /*---------------------------------------------------------------------------*/
    
    
    extern "C" namespace std
    {
    
    extern  void __c6xabi_abort_msg(const char *msg);
    
    
    
    } /* extern "C" namespace std */
    
    #pragma diag_pop
    
    
    #pragma diag_push
    
    /* using declarations must occur outside header guard to support including both
       C and C++-wrapped version of header; see _CPP_STYLE_HEADER check */
    /* this code is for C++ mode only and thus also not relevant for MISRA */
    #pragma CHECK_MISRA("-19.15")
    
    using std::_nassert;
    
    
    
    #pragma diag_pop
    
    /***********************************/
    /*   Problem 1                     */
    /***********************************/
    
    class CComp{
      private:
    	float m_fPreGain;
    	float m_pfAmpRingBuf[64]; // without this, the performance of "case 1-A" improve from 10cycle/sample to 0.5cycle/sample.
    
      public:
    	CComp();
    	void Calc(float pfData[256]);
    
    };
    
    
    CComp::CComp(void)
    {
    	m_fPreGain = 2.0F;
    }
    
    void CComp::Calc( float pfData[256] )
    {
    	int i;
    
    	float pfTemp[256];
    	float fPreGain = m_fPreGain;
    	memcpy( pfTemp, pfData, sizeof(float)*256 );
    
    	_nassert(((int)pfData % 8) == 0);
    
    	// pre gain
    	for (i = 0; i < 256; i++){
    		pfData[i] *= m_fPreGain; // case 1-A: ii = 20, Loop Unroll =  2x  -->   10 cycle/sample
    //		pfData[i] *= fPreGain;   // case 1-B: ii =  8, Loop Unroll =  8x  -->    1 cycle/sample
    
    //		pfTemp[i] *= m_fPreGain; // case 1-C: ii =  2, Loop Unroll =  2x  -->    1 cycle/sample
    //		pfTemp[i] *= fPreGain;   // case 1-D: ii =  4, Loop Unroll =  8x  -->  0.5 cycle/sample
    	}
    }
    
    /***********************************/
    /*   Problem 2                     */
    /***********************************/
    
    
    typedef struct Complex{
    	float re;
    	float im;
    }Complex;
    
    typedef Complex XMatrix[(8)][(8)];
    
    void CCovarianceMatrix_Update( const XMatrix xm, XMatrix xmAve, float fUpdateCoef )
    {
    	int i, k;
    	float fUpdateCoefInv = 1.0F - fUpdateCoef;
    	float *pf1;
    	float *pf2;
    
    	_nassert(((int)xm % 8) == 0);
    	_nassert(((int)xmAve % 8) == 0);
    
    	// case 2-A
    	for(i=0;i<(8);i++){
    		for(k=0;k<(8);k++){ // 26cycle
    			xmAve[i][k].re = fUpdateCoef * xmAve[i][k].re + fUpdateCoefInv * xm[i][k].re;
    			xmAve[i][k].im = fUpdateCoef * xmAve[i][k].im + fUpdateCoefInv * xm[i][k].im;
    		}
    	}
    
    	// case 2-B
    	for(i=0;i<(8);i++){
    		pf1 = (float *)xm[i];
    		pf2 = (float *)xmAve[i];
    		for(k=0;k<(8);k++){ // 3cycle
    			*pf2 = fUpdateCoef * (*pf2) + fUpdateCoefInv * (*pf1);
    			pf1++;
    			pf2++;
    			*pf2 = fUpdateCoef * (*pf2) + fUpdateCoefInv * (*pf1);
    			pf1++;
    			pf2++;
    		}
    	}
    
    	// case 2-C
    	pf1 = (float *)xm;
    	pf2 = (float *)xmAve;
    
    	for(i=0;i<(8)*(8);i++){ // 14cycle or 2cycle
    		*pf2 = fUpdateCoef * (*pf2) + fUpdateCoefInv * (*pf1);
    		pf1++;
    		pf2++;
    		*pf2 = fUpdateCoef * (*pf2) + fUpdateCoefInv * (*pf1);
    		pf1++;
    		pf2++;
    	}
    
    	// case 2-D
    	XMatrix xmAve2;
    	XMatrix xm2;
    
    	memcpy( xm2, xm, sizeof(XMatrix) );
    	memcpy( xmAve2, xmAve, sizeof(XMatrix) );
    
    	pf1 = (float *)xm2;
    	pf2 = (float *)xmAve2;
    
    	for(i=0;i<(8)*(8);i++){ // 2 cycle
    		*pf2 = fUpdateCoef * (*pf2) + fUpdateCoefInv * (*pf1);
    		pf1++;
    		pf2++;
    		*pf2 = fUpdateCoef * (*pf2) + fUpdateCoefInv * (*pf1);
    		pf1++;
    		pf2++;
    	}
    
    	memcpy( xmAve, xmAve2, sizeof(XMatrix) );
    }
    

  • Hi George,

    Thank you for supporting us.
    Could you download the attached pp-file?
    I'm glad if you can tell me the solution for optimization.

    Best regards,
    Y_S
  • I recommend you change the function CComp::Calc to the following ...

    void CComp::Calc( float pfData[restrict 256] )
    {
    	int i;
    	float fPreGain = m_fPreGain;
    
    	_nassert(((int)pfData % 8) == 0);
    
    	// pre gain
    	for (i = 0; i < 256; i++){
                    pfData[i] *= fPreGain;
    	}
    }

    You never showed which build options you use.  I build it with this command ...

    % cl6x -mv6600 --opt_level=3 --debug_software_pipeline file.cpp

    Then inspect the resulting .asm file.  After unrolling is considered, this loop produces a result every 0.5 cycles.  

    There are two changes to point out.  The first is the use of restrict on pfData.  This tells the compiler that, during the scope of this function, the only way to access the memory associated with pfData is through pfData.  To understand more, please read about how the restrict keyword is used in this article.  The second change is copying the member variable m_fPreGain to a local variable.  I don't have a good explanation for that change.  But the results are very good.

    As for the second loop ... I made several attempts to optimize it in a concise way, but failed.  I made contact with one of the experts on the compiler development team.  I expect to hear back on Monday.

    Thanks and regards,

    -George

  • Thank you for your reply.

    I buld the program with the option below.

    -mv6600 -O2 option.

    As for the problem1,

    I can understand your advice for the use of restcict, and it would also be effective to the other optimization.

    However, even the case of 1-A, without m_pfAmpRIngBuf[64], the loop produces a result every 0.5cycles.

    I'd like to know what happens with or without m_pfAmpRingBuf[];

    As for the problem2,

    this problem is more important for me because most of our function is matrix operation.

    And, the matrix operation costs most of computational power of this DSP.

    So if you can get the result from the compiler expert, please let me know.

    Thanks and best regards,

    Y_S

  • Y_S said:

    As for the problem1,

    I can understand your advice for the use of restcict, and it would also be effective to the other optimization.

    However, even the case of 1-A, without m_pfAmpRIngBuf[64], the loop produces a result every 0.5cycles.

    I can reproduce the result.  But I don't know why.  I'll ask an expert on the compiler development team.

    Y_S said:
    As for the problem2,

    Consider rewriting it like this ...

    typedef struct Complex{
        float comps[2];
        float& re()      { return comps[0]; }
        float& im()      { return comps[1]; }
        float re() const { return comps[0]; }
        float im() const { return comps[1]; }
    } Complex;
    
    typedef Complex XMatrix[(8)][(8)];
    
    void CCovarianceMatrix_Update(const XMatrix xm, XMatrix xmAve, float fUpdateCoef)
    {
        int i, k;
        float fUpdateCoefInv = 1.0F - fUpdateCoef;
    
        _nassert(((int)xm % 8) == 0);
        _nassert(((int)xmAve % 8) == 0);
    
        for(i=0; i < (8); i++)
        {
    	for(k=0; k < (8); k++)
    	{
    	    xmAve[i][k].re() =   fUpdateCoef    * xmAve[i][k].re()
    	                       + fUpdateCoefInv * xm[i][k].re();
    	    xmAve[i][k].im() =   fUpdateCoef    * xmAve[i][k].im()
    	                       + fUpdateCoefInv * xm[i][k].im();
    	}
        }
    }

    This gets an ii=2.  This uses an array approach similar to case 2-B.  The re() and im() member functions preserve the meaning of those operations.

    Thanks and regards,

    -George

  • George Mock said:
    I can reproduce the result.  But I don't know why.  I'll ask an expert on the compiler development team.

    I apologize.  I overlooked this one.  I'll get back to you next week.

    Thanks and regards,

    -George

  • George Mock said:

    As for the problem1,

    I can understand your advice for the use of restcict, and it would also be effective to the other optimization.

    However, even the case of 1-A, without m_pfAmpRIngBuf[64], the loop produces a result every 0.5cycles.

    I can reproduce that result.  But I don't know why.  I'll ask an expert on the compiler development team.

    It turns out this is due to a problem in the compiler.  The entry CODEGEN-6296 has been filed in the SDOWP system to have this investigated.  You are welcome to follow it with the SDOWP link below in my signature.

    Thanks and regards,

    -George