Compiler/TMS320C6657: Loop optimization

Negin Madani

Part Number: TMS320C6657

Tool/software: TI C/C++ Compiler

I have this loop that I am trying to optimize:

      for (j=0; j<L_SUBFR; j++){
    	  pG729ADecoder->Postfilt.scal_res2[j] = pG729ADecoder->Postfilt.res2[j]>>2;
      }

where L_SUBFR is 40. Here is the pipeline information:

If I try the restrict keyword on scal_res2 and res2, this is what I get with no decrease in cycle count:

I have also tried the MUST_ITERATE and UNROLL pragmas with different unroll factors, but the cycle count only increases.

Is there a way to optimize this loop? I would appreciate any suggestion!

over 6 years ago

0 George Mock over 6 years ago

TI__Guru**** 232790 points

Negin said:
If I try the restrict keyword on scal_res2 and res2, this is what I get with no decrease in cycle count:

Try making scal_res2 and res2 local variables, and restrict modify them. If that doesn't work, then please submit a test case I can compile. It does not have to run. Follow the directions in the article How to Submit a Compiler Test Case.

Thanks and regards,

-George

0 Victor Kazmirenko over 6 years ago in reply to George Mock

Guru 13042 points

Vote with both hands for George's suggestion. Nested dereference is the first thing to get rid of. I do confirm, that in my past experience with the similar arrays being members of nested structures making local pointers to actual data with restrict qualification does help a lot.

0 Negin Madani over 6 years ago in reply to George Mock

Prodigy 230 points

George Mock said:
Try making scal_res2 and res2 local variables, and restrict modify them.

This did save me a lot of cycles!

Here is what I did:

 Word16 * restrict pres2;
 Word16 * restrict pscal_res2;

 pres2 = pG729ADecoder->Postfilt.res2;
 pscal_res2 = pG729ADecoder->Postfilt.scal_res2;

.
.
.

      for (j=0; j<L_SUBFR; j++)
      {
    	  pscal_res2[j] = pres2[j]>>2;
      }

Based on profiling information, I can verify that the number of cycles for the function containing the loop has decreased. However, the .asm file does not show any software pipeline information for this loop after making this modification.

Do you know why this happened?

Thanks,

Negin

0 George Mock over 6 years ago in reply to Negin Madani

TI__Guru**** 232790 points

Negin said:
Do you know why this happened?

Unfortunately, no. Please submit a test case as described in my previous post.

Thanks and regards,

-George

0 Victor Kazmirenko over 6 years ago in reply to Negin Madani

Guru 13042 points

Hi!

There is no easy way to guess, why pipelining was not applied, need to see options and output, which is better described in mention tescase submission guide.

You might gain more effect continuing optimising this loop. Using local pointers you omitted unnecessary dereference operations, using restrict qualification you told compiler source and destination do not overlap, giving it more flexibility. But you can do even more. You may benefit with SIMD instructions, because your 16 bit data could be loaded, processed and saved in packed way. For that you need to ensure and tell compiler your data pointers are aligned. To my knowledge, members of structures are aligned minimum as their type require, but array are aligned and larger boundary, 16B on C66 IIRC. Thus, without worries it safe to tell compiler the data are aligned with nassert. I expect compiler might issue SHR2 instruction to process your data in pairs.

0 Negin Madani over 6 years ago in reply to George Mock

Prodigy 230 points

George Mock said:
Please submit a test case as described in my previous post.

I have attached the preprocessed file.

Compiler version: 8.1.4

Compiler options:

"C:\\ti\\ccsv6\\utils\\bin\\gmake" -k src/POSTFILT.obj 
'Building file: C:/.../POSTFILT.C'
'Invoking: C6000 Compiler'
"C:/ti/ccsv6/tools/compiler/ti-cgt-c6000_8.1.4/bin/cl6x" -mv6600 --abi=eabi -O3 --opt_for_speed=5 --include_path="C:/ti/ccsv6/tools/compiler/ti-cgt-c6000_8.1.4/include" --include_path="../../inc" --advice:performance=all -g --preproc_with_comment --preproc_with_compile --diag_wrap=off --display_error_number --diag_warning=225 --no_bad_aliases --debug_software_pipeline --obj_directory="src"  "C:/.../POSTFILT.C"
"C:/.../POSTFILT.C", line 169: advice #30009: (Performance) If you know that this loop will always execute at a multiple of <20> and at least <20> times, try adding "#pragma MUST_ITERATE(20, ,20)" just before the loop.
"C:/.../POSTFILT.C", line 158: advice #30009: (Performance) If you know that this loop will always execute at a multiple of <42> and at least <42> times, try adding "#pragma MUST_ITERATE(42, ,42)" just before the loop.
'Finished building: C:/.../POSTFILT.C'
' '

Thanks,

Negin

POSTFILT.txt

/*------------------------------------------------------------------------*
 *                         POSTFILTER.C                                   *
 *------------------------------------------------------------------------*
 * Performs adaptive postfiltering on the synthesis speech                *
 * This file contains all functions related to the post filter.           *
 *------------------------------------------------------------------------*/
/*****************************************************************************/
/* string.h   v8.1.4                                                         */
/*                                                                           */
/* Copyright (c) 1993-2017 Texas Instruments Incorporated                    */
/* http://www.ti.com/                                                        */
/*                                                                           */
/*  Redistribution and  use in source  and binary forms, with  or without    */
/*  modification,  are permitted provided  that the  following conditions    */
/*  are met:                                                                 */
/*                                                                           */
/*     Redistributions  of source  code must  retain the  above copyright    */
/*     notice, this list of conditions and the following disclaimer.         */
/*                                                                           */
/*     Redistributions in binary form  must reproduce the above copyright    */
/*     notice, this  list of conditions  and the following  disclaimer in    */
/*     the  documentation  and/or   other  materials  provided  with  the    */
/*     distribution.                                                         */
/*                                                                           */
/*     Neither the  name of Texas Instruments Incorporated  nor the names    */
/*     of its  contributors may  be used to  endorse or  promote products    */
/*     derived  from   this  software  without   specific  prior  written    */
/*     permission.                                                           */
/*                                                                           */
/*  THIS SOFTWARE  IS PROVIDED BY THE COPYRIGHT  HOLDERS AND CONTRIBUTORS    */
/*  "AS IS"  AND ANY  EXPRESS OR IMPLIED  WARRANTIES, INCLUDING,  BUT NOT    */
/*  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR    */
/*  A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT    */
/*  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,    */
/*  SPECIAL,  EXEMPLARY,  OR CONSEQUENTIAL  DAMAGES  (INCLUDING, BUT  NOT    */
/*  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,    */
/*  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY    */
/*  THEORY OF  LIABILITY, WHETHER IN CONTRACT, STRICT  LIABILITY, OR TORT    */
/*  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE    */
/*  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.     */
/*                                                                           */
/*****************************************************************************/



#pragma diag_push
#pragma CHECK_MISRA("-6.3") /* standard types required for standard headers */
#pragma CHECK_MISRA("-19.1") /* #includes required for implementation */
#pragma CHECK_MISRA("-20.1") /* standard headers must define standard names */
#pragma CHECK_MISRA("-20.2") /* standard headers must define standard names */

 

typedef unsigned size_t;

/*****************************************************************************/
/* linkage.h   v8.1.4                                                        */
/*                                                                           */
/* Copyright (c) 1998-2017 Texas Instruments Incorporated                    */
/* http://www.ti.com/                                                        */
/*                                                                           */
/*  Redistribution and  use in source  and binary forms, with  or without    */
/*  modification,  are permitted provided  that the  following conditions    */
/*  are met:                                                                 */
/*                                                                           */
/*     Redistributions  of source  code must  retain the  above copyright    */
/*     notice, this list of conditions and the following disclaimer.         */
/*                                                                           */
/*     Redistributions in binary form  must reproduce the above copyright    */
/*     notice, this  list of conditions  and the following  disclaimer in    */
/*     the  documentation  and/or   other  materials  provided  with  the    */
/*     distribution.                                                         */
/*                                                                           */
/*     Neither the  name of Texas Instruments Incorporated  nor the names    */
/*     of its  contributors may  be used to  endorse or  promote products    */
/*     derived  from   this  software  without   specific  prior  written    */
/*     permission.                                                           */
/*                                                                           */
/*  THIS SOFTWARE  IS PROVIDED BY THE COPYRIGHT  HOLDERS AND CONTRIBUTORS    */
/*  "AS IS"  AND ANY  EXPRESS OR IMPLIED  WARRANTIES, INCLUDING,  BUT NOT    */
/*  LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR    */
/*  A PARTICULAR PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT    */
/*  OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,    */
/*  SPECIAL,  EXEMPLARY,  OR CONSEQUENTIAL  DAMAGES  (INCLUDING, BUT  NOT    */
/*  LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,    */
/*  DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY    */
/*  THEORY OF  LIABILITY, WHETHER IN CONTRACT, STRICT  LIABILITY, OR TORT    */
/*  (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE    */
/*  OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.     */
/*                                                                           */
/*****************************************************************************/


#pragma diag_push
#pragma CHECK_MISRA("-19.4") /* macros required for implementation */

/*--------------------------------------------------------------------------*/
/* Define _CODE_ACCESS ==> how to call RTS functions                        */
/*--------------------------------------------------------------------------*/

/*--------------------------------------------------------------------------*/
/* Define _DATA_ACCESS ==> how to access RTS global or static data          */
/*--------------------------------------------------------------------------*/
/*--------------------------------------------------------------------------*/
/* Define _DATA_ACCESS_NEAR ==> some C6000 RTS data must always be near     */
/*--------------------------------------------------------------------------*/

/*--------------------------------------------------------------------------*/
/* Define _IDECL ==> how inline functions are declared                      */
/*--------------------------------------------------------------------------*/

/*--------------------------------------------------------------------------*/
/* If compiling with non-TI compiler (e.g. GCC), nullify any TI-specific    */
/* language extensions.                                                     */
/*--------------------------------------------------------------------------*/

#pragma diag_pop


#pragma diag_push
#pragma CHECK_MISRA("-19.4") /* macros required for implementation */


#pragma diag_pop

static __inline size_t  strlen(const char *string);

static __inline char *strcpy(char *dest, const char *src);
static __inline char *strncpy(char *dest, const char *src, size_t n);
static __inline char *strcat(char *string1, const char *string2);
static __inline char *strncat(char *dest, const char *src, size_t n);
static __inline char *strchr(const char *string, int c);
static __inline char *strrchr(const char *string, int c);

static __inline int  strcmp(const char *string1, const char *string2);
static __inline int  strncmp(const char *string1, const char *string2, size_t n);

 int     strcoll(const char *string1, const char *_string2);
 size_t  strxfrm(char *to, const char *from, size_t n);
 char   *strpbrk(const char *string, const char *chs);
 size_t  strspn(const char *string, const char *chs);
 size_t  strcspn(const char *string, const char *chs);
 char   *strstr(const char *string1, const char *string2);
 char   *strtok(char *str1, const char *str2);
 char   *strerror(int _errno);
 char   *strdup(const char *string);


 void   *memmove(void *s1, const void *s2, size_t n);
#pragma diag_push
#pragma CHECK_MISRA("-16.4") /* false positives due to builtin declarations */
 void   *memcpy(void *s1, const void *s2, size_t n);
#pragma diag_pop

static __inline int     memcmp(const void *cs, const void *ct, size_t n);
static __inline void   *memchr(const void *cs, int c, size_t n);

 void   *memset(void *mem, int ch, size_t length);






#pragma diag_push
#pragma CHECK_MISRA("-19.4") /* macros required for implementation */


#pragma diag_pop

#pragma diag_push /* functions */

/* MISRA exceptions to avoid changing inline versions of the functions that
   would be linked in instead of included inline at different mf levels */
/* these functions are very well-tested, stable, and efficient; it would
   introduce a high risk to implement new, separate MISRA versions just for the
   inline headers */

#pragma CHECK_MISRA("-5.7") /* keep names intact */
#pragma CHECK_MISRA("-6.1") /* false positive on use of char type */
#pragma CHECK_MISRA("-8.5") /* need to define inline functions */
#pragma CHECK_MISRA("-10.1") /* use implicit casts */
#pragma CHECK_MISRA("-10.3") /* need casts */
#pragma CHECK_MISRA("-11.5") /* casting away const required for standard impl */
#pragma CHECK_MISRA("-12.1") /* avoid changing expressions */
#pragma CHECK_MISRA("-12.2") /* avoid changing expressions */
#pragma CHECK_MISRA("-12.4") /* avoid changing expressions */
#pragma CHECK_MISRA("-12.5") /* avoid changing expressions */
#pragma CHECK_MISRA("-12.6") /* avoid changing expressions */
#pragma CHECK_MISRA("-12.13") /* ++/-- needed for reasonable implementation */
#pragma CHECK_MISRA("-13.1") /* avoid changing expressions */
#pragma CHECK_MISRA("-14.7") /* use multiple return points */
#pragma CHECK_MISRA("-14.8") /* use non-compound statements */
#pragma CHECK_MISRA("-14.9") /* use non-compound statements */
#pragma CHECK_MISRA("-17.4") /* pointer arithmetic needed for reasonable impl */
#pragma CHECK_MISRA("-17.6") /* false positive returning pointer-typed param */

static __inline size_t strlen(const char *string)
{
   size_t      n = (size_t)-1;
   const char *s = string;

   do n++; while (*s++);
   return n;
}

static __inline char *strcpy(register char *dest, register const char *src)
{
     register char       *d = dest;     
     register const char *s = src;

     while (*d++ = *s++);
     return dest;
}

static __inline char *strncpy(register char *dest,
		     register const char *src,
		     register size_t n)
{
     if (n) 
     {
	 register char       *d = dest;
	 register const char *s = src;
	 while ((*d++ = *s++) && --n);              /* COPY STRING         */
	 if (n-- > 1) do *d++ = '\0'; while (--n);  /* TERMINATION PADDING */
     }
     return dest;
}

static __inline char *strcat(char *string1, const char *string2)
{
   char       *s1 = string1;
   const char *s2 = string2;

   while (*s1) s1++;		     /* FIND END OF STRING   */
   while (*s1++ = *s2++);	     /* APPEND SECOND STRING */
   return string1;
}

static __inline char *strncat(char *dest, const char *src, register size_t n)
{
    if (n)
    {
	char       *d = dest;
	const char *s = src;

	while (*d) d++;                      /* FIND END OF STRING   */

	while (n--)
	  if (!(*d++ = *s++)) return dest; /* APPEND SECOND STRING */
	*d = 0;
    }
    return dest;
}

static __inline char *strchr(const char *string, int c)
{
   char        tch, ch  = c;
   const char *s        = string;

   for (;;)
   {
       if ((tch = *s) == ch) return (char *) s;
       if (!tch)             return (char *) 0;
       s++;
   }
}

static __inline char *strrchr(const char *string, int c)
{
   char        tch, ch = c;
   char       *result  = 0;
   const char *s       = string;

   for (;;)
   {
      if ((tch = *s) == ch) result = (char *) s;
      if (!tch) break;
      s++;
   }

   return result;
}

static __inline int strcmp(register const char *string1,
		  register const char *string2)
{
   register int c1, res;

   for (;;)
   {
       c1  = (unsigned char)*string1++;
       res = c1 - (unsigned char)*string2++;

       if (c1 == 0 || res != 0) break;
   }

   return res;
}

static __inline int strncmp(const char *string1, const char *string2, size_t n)
{
     if (n) 
     {
	 const char *s1 = string1;
	 const char *s2 = string2;
	 unsigned char cp;
	 int         result;

	 do 
	    if (result = (unsigned char)*s1++ - (cp = (unsigned char)*s2++))
                return result;
	 while (cp && --n);
     }
     return 0;
}

static __inline int memcmp(const void *cs, const void *ct, size_t n)
{
   if (n) 
   {
       const unsigned char *mem1 = (unsigned char *)cs;
       const unsigned char *mem2 = (unsigned char *)ct;
       int                 cp1, cp2;

       while ((cp1 = *mem1++) == (cp2 = *mem2++) && --n);
       return cp1 - cp2;
   }
   return 0;
}

static __inline void *memchr(const void *cs, int c, size_t n)
{
   if (n)
   {
      const unsigned char *mem = (unsigned char *)cs;   
      unsigned char        ch  = c;

      do 
         if ( *mem == ch ) return (void *)mem;
         else mem++;
      while (--n);
   }
   return 0;
}





#pragma diag_pop


#pragma diag_push

/* using declarations must occur outside header guard to support including both
   C and C++-wrapped version of header; see _CPP_STYLE_HEADER check */
/* this code is for C++ mode only and thus also not relevant for MISRA */
#pragma CHECK_MISRA("-19.15")


#pragma diag_pop

//   Types definitions

typedef short Word16;
typedef int   Word32;
typedef int   Flag;

/*___________________________________________________________________________
 |                                                                           |
 |   Operators prototypes                                                    |
 |___________________________________________________________________________|
*/

Word16 sature(Word32 L_var1, Flag overflow_carry[]);             /* Limit to 16 bits,    1 */
Word32 L_mult(Word16 var1, Word16 var2, Flag overflow_carry[]);  /* Long mult,           1 */
Word32 L_msu(Word32 L_var3, Word16 var1, Word16 var2, Flag overflow_carry[]); /* Msu,    1 */
Word32 L_sub(Word32 L_var1, Word32 L_var2, Flag overflow_carry[]);   /* Long sub,        2 */
Word32 L_shl(Word32 L_var1, Word16 var2, Flag overflow_carry[]); /* Long shift left,     2 */
Word16 shl(Word16 var1, Word16 var2, Flag overflow_carry[]);     /* Short shift left,    1 */
Word16 shr(Word16 var1, Word16 var2, Flag overflow_carry[]);     /* Short shift right,   1 */
Word32 L_mac(Word32 L_var3, Word16 var1, Word16 var2, Flag overflow_carry[]); /* Mac,    1 */
Word32 L_macNs(Word32 L_var3, Word16 var1, Word16 var2, Flag overflow_carry[]);/* Mac without sat, 1*/
Word32 L_msuNs(Word32 L_var3, Word16 var1, Word16 var2, Flag overflow_carry[]);/* Msu without sat, 1*/
Word32 L_add(Word32 L_var1, Word32 L_var2, Flag overflow_carry[]);   /* Long add,        2 */
Word32 L_add_c(Word32 L_var1, Word32 L_var2, Flag overflow_carry[]); /*Long add with c,  2 */
Word32 L_sub_c(Word32 L_var1, Word32 L_var2, Flag overflow_carry[]); /*Long sub with c,  2 */
Word16 shr_r(Word16 var1, Word16 var2, Flag overflow_carry[]);/* Shift right with round, 2 */
Word16 mac_r(Word32 L_var3, Word16 var1, Word16 var2, Flag overflow_carry[]);/* Mac with rounding, 2*/
Word16 msu_r(Word32 L_var3, Word16 var1, Word16 var2, Flag overflow_carry[]);/* Msu with rounding, 2*/
Word32 L_shr_r(Word32 L_var1, Word16 var2, Flag overflow_carry[]);/* Long shift right with round,  3*/
Word32 L_sat(Word32 L_var1, Flag overflow_carry[]);            /* Long saturation,       4 */
Word16 div_s(Word16 var1, Word16 var2, Flag overflow_carry[]); /* Short division,       18 */

/*
   ITU-T G.729A Speech Coder    ANSI-C Source Code
   Version 1.1    Last modified: September 1996

   Copyright (c) 1996,
   AT&T, France Telecom, NTT, Universite de Sherbrooke
   All rights reserved.
*/

/*---------------------------------------------------------------*
 * LD8A.H                                                        *
 * ~~~~~~                                                        *
 * Function prototypes and constants use for G.729A 8kb/s coder. *
 *                                                               *
 *---------------------------------------------------------------*/

/*--------------------------------------------------------------------------*
 *       Codec constant parameters (coder, decoder, and postfilter)         *
 *--------------------------------------------------------------------------*/




/*-------------------------------*
 * Mathematic functions.         *
 *-------------------------------*/

Word32 Inv_sqrt(   /* (o) Q30 : output value   (range: 0<=val<1)           */
  Word32 L_x,       /* (i) Q0  : input value    (range: 0<=val<=7fffffff)   */
  Flag overflow_carry[]
);

void Log2(
  Word32 L_x,       /* (i) Q0 : input value                                 */
  Word16 *exponent, /* (o) Q0 : Integer part of Log2.   (range: 0<=val<=30) */
  Word16 *fraction,  /* (o) Q15: Fractionnal part of Log2. (range: 0<=val<1) */
  Flag overflow_carry[]
);

Word32 Pow2(        /* (o) Q0  : result       (range: 0<=val<=0x7fffffff) */
  Word16 exponent,  /* (i) Q0  : Integer part.      (range: 0<=val<=30)   */
  Word16 fraction,   /* (i) Q15 : Fractionnal part.  (range: 0.0<=val<1.0) */
  Flag overflow_carry[]
);

/*----------------------------------*
 * Main coder and decoder functions *
 *----------------------------------*/

/*-------------------------------*
 * LPC analysis and filtering.   *
 *-------------------------------*/
void Lag_window(
  Word16 m,         /* (i)     : LPC order                        */
  Word16 r_h[],     /* (i/o)   : Autocorrelations  (msb)          */
  Word16 r_l[],      /* (i/o)   : Autocorrelations  (lsb)          */
  Flag overflow_carry[]
);
void Lsp_Az(
  Word16 lsp[],    /* (i) Q15 : line spectral frequencies            */
  Word16 a[],       /* (o) Q12 : predictor coefficients (order = 10)  */
  Flag overflow_carry[]
);

void Lsf_lsp(
  Word16 lsf[],    /* (i) Q15 : lsf[m] normalized (range: 0.0<=val<=0.5) */
  Word16 lsp[],    /* (o) Q15 : lsp[m] (range: -1<=val<1)                */
  Word16 m         /* (i)     : LPC order                                */
);

void Lsp_lsf(
  Word16 lsp[],    /* (i) Q15 : lsp[m] (range: -1<=val<1)                */
  Word16 lsf[],    /* (o) Q15 : lsf[m] normalized (range: 0.0<=val<=0.5) */
  Word16 m         /* (i)     : LPC order                                */
);

void Int_qlpc(
 Word16 lsp_old[], /* input : LSP vector of past frame              */
 Word16 lsp_new[], /* input : LSP vector of present frame           */
 Word16 Az[],       /* output: interpolated Az() for the 2 subframes */
 Flag overflow_carry[]
);

void Weight_Az(
  Word16 a[],      /* (i) Q12 : a[m+1]  LPC coefficients             */
  Word16 gamma,    /* (i) Q15 : Spectral expansion factor.           */
  Word16 m,        /* (i)     : LPC order.                           */
  Word16 ap[],      /* (o) Q12 : Spectral expanded LPC coefficients   */
  Flag overflow_carry[]
);

void Residu(
  Word16 a[],    /* (i) Q12 : prediction coefficients                     */
  Word16 x[],    /* (i)     : speech (values x[-m..-1] are needed (m=10)  */
  Word16 y[],    /* (o)     : residual signal                             */
  Word16 lg,      /* (i)     : size of filtering                           */
  Flag overflow_carry[]
);

void Syn_filt(
  Word16 a[],     /* (i) Q12 : a[m+1] prediction coefficients   (m=10)  */
  Word16 x[],     /* (i)     : input signal                             */
  Word16 y[],     /* (o)     : output signal                            */
  Word16 lg,      /* (i)     : size of filtering                        */
  Word16 mem[],   /* (i/o)   : memory associated with this filtering.   */
  Word16 update,   /* (i)     : 0=no update, 1=update of memory.         */
  Flag overflow_carry[]
);

void Convolve(
  Word16 x[],      /* (i)     : input vector                           */
  Word16 h[],      /* (i) Q12 : impulse response                       */
  Word16 y[],      /* (o)     : output vector                          */
  Word16 L         /* (i)     : vector size                            */
);

/*--------------------------------------------------------------------------*
 *       LTP constant parameters                                            *
 *--------------------------------------------------------------------------*/


/*-----------------------*
 * Pitch functions.      *
 *-----------------------*/

Word16 Pitch_ol_fast(  /* output: open loop pitch lag                        */
   Word16 signal[],    /* input : signal used to compute the open loop pitch */
                       /*     signal[-pit_max] to signal[-1] should be known */
   Word16   pit_max,   /* input : maximum pitch lag                          */
   Word16   L_frame,    /* input : length of frame to compute pitch           */
   Flag overflow_carry[]
);

Word16 Pitch_fr3_fast(/* (o)     : pitch period.                          */
  Word16 exc[],       /* (i)     : excitation buffer                      */
  Word16 xn[],        /* (i)     : target vector                          */
  Word16 h[],         /* (i) Q12 : impulse response of filters.           */
  Word16 L_subfr,     /* (i)     : Length of subframe                     */
  Word16 t0_min,      /* (i)     : minimum value in the searched range.   */
  Word16 t0_max,      /* (i)     : maximum value in the searched range.   */
  Word16 i_subfr,     /* (i)     : indicator for first subframe.          */
  Word16 *pit_frac,    /* (o)     : chosen fraction.                       */
  Flag overflow_carry[]
);

Word16 G_pitch(      /* (o) Q14 : Gain of pitch lag saturated to 1.2       */
  Word16 xn[],       /* (i)     : Pitch target.                            */
  Word16 y1[],       /* (i)     : Filtered adaptive codebook.              */
  Word16 g_coeff[],  /* (i)     : Correlations need for gain quantization. */
  Word16 L_subfr,     /* (i)     : Length of subframe.                      */
  Flag overflow_carry[]
);

Word16 Enc_lag3(     /* output: Return index of encoding */
  Word16 T0,         /* input : Pitch delay              */
  Word16 T0_frac,    /* input : Fractional pitch delay   */
  Word16 *T0_min,    /* in/out: Minimum search delay     */
  Word16 *T0_max,    /* in/out: Maximum search delay     */
  Word16 pit_min,    /* input : Minimum pitch delay      */
  Word16 pit_max,    /* input : Maximum pitch delay      */
  Word16 pit_flag,    /* input : Flag for 1st subframe    */
  Flag overflow_carry[]
);

void Dec_lag3(        /* output: return integer pitch lag       */
  Word16 index,       /* input : received pitch index           */
  Word16 pit_min,     /* input : minimum pitch lag              */
  Word16 pit_max,     /* input : maximum pitch lag              */
  Word16 i_subfr,     /* input : subframe flag                  */
  Word16 *T0,         /* output: integer part of pitch lag      */
  Word16 *T0_frac,     /* output: fractional part of pitch lag   */
  Flag overflow_carry[]
);

Word16 Interpol_3(      /* (o)  : interpolated value  */
  Word16 *x,            /* (i)  : input vector        */
  Word16 frac           /* (i)  : fraction            */
);

void Pred_lt_3(
  Word16   exc[],       /* in/out: excitation buffer */
  Word16   T0,          /* input : integer pitch lag */
  Word16   frac,        /* input : fraction of lag   */
  Word16   L_subfr,      /* input : subframe size     */
  Flag overflow_carry[]
);

Word16 Parity_Pitch(    /* output: parity bit (XOR of 6 MSB bits)    */
   Word16 pitch_index,   /* input : index for which parity to compute */
   Flag overflow_carry[]
);

Word16  Check_Parity_Pitch( /* output: 0 = no error, 1= error */
  Word16 pitch_index,       /* input : index of parameter     */
  Word16 parity,             /* input : parity bit             */
  Flag overflow_carry[]
);

void Cor_h_X(
     Word16 h[],        /* (i) Q12 :Impulse response of filters      */
     Word16 X[],        /* (i)     :Target vector                    */
     Word16 D[],         /* (o)     :Correlations between h[] and D[] */
                        /*          Normalized to 13 bits            */
	 Flag overflow_carry[]
);

/*-----------------------*
 * Innovative codebook.  *
 *-----------------------*/


/* The following constants are Q15 fractions.
   These fractions is used to keep maximum precision on "alp" sum */


Word16  ACELP_Code_A(    /* (o)     :index of pulses positions    */
  Word16 x[],            /* (i)     :Target vector                */
  Word16 h[],            /* (i) Q12 :Inpulse response of filters  */
  Word16 T0,             /* (i)     :Pitch lag                    */
  Word16 pitch_sharp,    /* (i) Q14 :Last quantized pitch gain    */
  Word16 code[],         /* (o) Q13 :Innovative codebook          */
  Word16 y[],            /* (o) Q12 :Filtered innovative codebook */
  Word16 *sign,           /* (o)     :Signs of 4 pulses            */
  Flag overflow_carry[]
);

void Decod_ACELP(
  Word16 sign,      /* (i)     : signs of 4 pulses.                       */
  Word16 index,     /* (i)     : Positions of the 4 pulses.               */
  Word16 cod[],      /* (o) Q13 : algebraic (fixed) codebook excitation    */
  Flag overflow_carry[]
);
/*--------------------------------------------------------------------------*
 *       LSP constant parameters                                            *
 *--------------------------------------------------------------------------*/





/*-------------------------------*
 * LSP VQ functions.             *
 *-------------------------------*/

void Lsf_lsp2(
  Word16 lsf[],    /* (i) Q13 : lsf[m] (range: 0.0<=val<PI) */
  Word16 lsp[],    /* (o) Q15 : lsp[m] (range: -1<=val<1)   */
  Word16 m,         /* (i)     : LPC order                   */
  Flag overflow_carry[]
);

void Lsp_lsf2(
  Word16 lsp[],    /* (i) Q15 : lsp[m] (range: -1<=val<1)   */
  Word16 lsf[],    /* (o) Q13 : lsf[m] (range: 0.0<=val<PI) */
  Word16 m,         /* (i)     : LPC order                   */
  Flag overflow_carry[]
);
void Get_wegt(
  Word16 flsp[],    /* Q13 */
  Word16 wegt[],     /* Q11 -> normalized */
  Flag overflow_carry[]
);
void Lsp_expand_1(
  Word16 buf[],          /* Q13 */
  Word16 gap,             /* Q13 */
  Flag overflow_carry[]
);

void Lsp_expand_2(
  Word16 buf[],         /* Q13 */
  Word16 gap,            /* Q13 */
  Flag overflow_carry[]
);

void Lsp_expand_1_2(
  Word16 buf[],         /* Q13 */
  Word16 gap,            /* Q13 */
  Flag overflow_carry[]
);

void Lsp_get_quant(
  Word16 lspcb1[][10],      /* Q13 */
  Word16 lspcb2[][10],      /* Q13 */
  Word16 code0,
  Word16 code1,
  Word16 code2,
  Word16 fg[][10],            /* Q15 */
  Word16 freq_prev[][10],     /* Q13 */
  Word16 lspq[],                /* Q13 */
  Word16 fg_sum[],               /* Q15 */
  Flag overflow_carry[]
);

void Lsp_get_tdist(
  Word16 wegt[],        /* normalized */
  Word16 buf[],         /* Q13 */
  Word32 *L_tdist,      /* Q27 */
  Word16 rbuf[],        /* Q13 */
  Word16 fg_sum[],       /* Q15 */
  Flag overflow_carry[]
);

void Lsp_last_select(
  Word32 L_tdist[],     /* Q27 */
  Word16 *mode_index,
  Flag overflow_carry[]
);

void Lsp_pre_select(
  Word16 rbuf[],              /* Q13 */
  Word16 lspcb1[][10],      /* Q13 */
  Word16 *cand,
  Flag overflow_carry[]
);

void Lsp_select_1(
  Word16 rbuf[],              /* Q13 */
  Word16 lspcb1[],            /* Q13 */
  Word16 wegt[],              /* normalized */
  Word16 lspcb2[][10],      /* Q13 */
  Word16 *index,
  Flag overflow_carry[]
);

void Lsp_select_2(
  Word16 rbuf[],              /* Q13 */
  Word16 lspcb1[],            /* Q13 */
  Word16 wegt[],              /* normalized */
  Word16 lspcb2[][10],      /* Q13 */
  Word16 *index,
  Flag overflow_carry[]
);

void Lsp_stability(
  Word16 buf[],     /* Q13 */
  Flag overflow_carry[]
);

void Relspwed(
  Word16 lsp[],                          /* Q13 */
  Word16 wegt[],                         /* normalized */
  Word16 lspq[],                         /* Q13 */
  Word16 lspcb1[][10],                 /* Q13 */
  Word16 lspcb2[][10],                 /* Q13 */
  Word16 fg[2][4][10],          /* Q15 */
  Word16 freq_prev[4][10],         /* Q13 */
  Word16 fg_sum[2][10],             /* Q15 */
  Word16 fg_sum_inv[2][10],         /* Q12 */
  Word16 code_ana[],
  Flag overflow_carry[]
);


void Lsp_prev_compose(
  Word16 lsp_ele[],             /* Q13 */
  Word16 lsp[],                 /* Q13 */
  Word16 fg[][10],            /* Q15 */
  Word16 freq_prev[][10],     /* Q13 */
  Word16 fg_sum[],               /* Q15 */
  Flag overflow_carry[]
);

void Lsp_prev_extract(
  Word16 lsp[10],                 /* Q13 */
  Word16 lsp_ele[10],             /* Q13 */
  Word16 fg[4][10],           /* Q15 */
  Word16 freq_prev[4][10],    /* Q13 */
  Word16 fg_sum_inv[10],           /* Q12 */
  Flag overflow_carry[]
);

void Lsp_prev_update(
  Word16 lsp_ele[10],             /* Q13 */
  Word16 freq_prev[4][10]     /* Q13 */
);

/*-------------------------------*
 * gain VQ constants.            *
 *-------------------------------*/


/*--------------------------------------------------------------------------*
 * gain VQ functions.                                                       *
 *--------------------------------------------------------------------------*/

void Gain_predict(
  Word16 past_qua_en[],/* (i) Q10 :Past quantized energies                  */
  Word16 code[],    /* (i) Q13 : Innovative vector.                         */
  Word16 L_subfr,   /* (i)     : Subframe length.                           */
  Word16 *gcode0,   /* (o) Qxx : Predicted codebook gain                    */
  Word16 *exp_gcode0, /* (o)    : Q-Format(gcode0)                           */
  Flag overflow_carry[]
);

void Gain_update(
  Word16 past_qua_en[],/* (i) Q10 :Past quantized energies                  */
  Word32 L_gbk12,    /* (i) Q13 : gbk1[indice1][1]+gbk2[indice2][1]          */
  Flag overflow_carry[]
);

void Gain_update_erasure(
  Word16 past_qua_en[],/* (i) Q10 :Past quantized energies                   */
  Flag overflow_carry[]
);

void Corr_xy2(
      Word16 xn[],           /* (i) Q0  :Target vector.                  */
      Word16 y1[],           /* (i) Q0  :Adaptive codebook.              */
      Word16 y2[],           /* (i) Q12 :Filtered innovative vector.     */
      Word16 g_coeff[],      /* (o) Q[exp]:Correlations between xn,y1,y2 */
      Word16 exp_g_coeff[],   /* (o)       :Q-format of g_coeff[]         */
	  Flag overflow_carry[]
);

/*-----------------------*
 * Bitstream function    *
 *-----------------------*/

void  prm2bits_ld8k(Word16 prm[], Word16 bits[]);
void  bits2prm_ld8k(Word16 bits[], Word16 prm[]);


/*-----------------------------------*
 * Post-filter functions.            *
 *-----------------------------------*/

void pit_pst_filt(
  Word16 *signal,      /* (i)     : input signal                        */
  Word16 *scal_sig,    /* (i)     : input signal (scaled, divided by 4) */
  Word16 t0_min,       /* (i)     : minimum value in the searched range */
  Word16 t0_max,       /* (i)     : maximum value in the searched range */
  Word16 L_subfr,      /* (i)     : size of filtering                   */
  Word16 *signal_pst,   /* (o)     : harmonically postfiltered signal    */
  Flag overflow_carry[]
);


/*--------------------------------------------------------------------------*
 * Constants and prototypes for taming procedure.                           *
 *--------------------------------------------------------------------------*/


/*--------------------------------------------------------------------------*
 * Prototypes for auxiliary functions.                                      *
 *--------------------------------------------------------------------------*/
//Word16 Random(Flag overflow_carry[]);


/*
   ITU-T G.729A Speech Coder    ANSI-C Source Code
   Version 1.1    Last modified: September 1996

   Copyright (c) 1996,
   AT&T, France Telecom, NTT, Universite de Sherbrooke
   All rights reserved.
*/

/* Double precision operations */

void   L_Extract(Word32 L_32, Word16 *hi, Word16 *lo, Flag overflow_carry[]);
Word32 L_Comp(Word16 hi, Word16 lo, Flag overflow_carry[]);
Word32 Mpy_32(Word16 hi1, Word16 lo1, Word16 hi2, Word16 lo2, Flag overflow_carry[]);
Word32 Mpy_32_16(Word16 hi, Word16 lo, Word16 n, Flag overflow_carry[]);
Word32 Div_32(Word32 L_num, Word16 denom_hi, Word16 denom_lo, Flag overflow_carry[]);

typedef struct {
	Word16 *exc;
	Word16 old_exc[80+143+(10+1)];	// Excitation vector
	Word16 lsp_old[10];							// Lsp (Line spectral pairs)
	Word16 mem_syn[10];							// Filter's memory
	Word16 sharp;           					// pitch sharpening of previous frame
	Word16 old_T0;          					// integer delay of previous frame
	Word16 gain_code;       					// Code gain
	Word16 gain_pitch;      					// Pitch gain
} Dec_ld8a_s;

typedef struct {
	Word16 freq_prev[4][10];   		// Q13
	Word16 freq_prev_reset[10];  		// Q13
	Word16 prev_ma;                  	// previous MA prediction coef.
	Word16 prev_lsp[10];              	// previous LSP vector
} LspDec_s;

typedef struct {						// inverse filtered synthesis (with A(z/GAMMA2_PST))
	Word16 * res2;
	Word16 * scal_res2;
	Word16 res2_buf[143+40];
	Word16 scal_res2_buf[143+40];
	Word16 mem_syn_pst[10]; 				// memory of filter 1/A(z/GAMMA1_PST)
	Word16 mem_pre;
	Word16 past_gain;
} Postfilt_s;

// 2nd order high pass filter state variable for Pre_Process
typedef struct {
	Word16 y2_hi, y2_lo, y1_hi, y1_lo, x0, x1;
} Post_Process_s;

typedef struct {
	unsigned short L23h	: 8;
	unsigned short L01	: 1+7;

	unsigned short C1h	: 5;
	unsigned short P0	: 1;
	unsigned short P1	: 8;
	unsigned short L23l	: 2;

	unsigned short GAB1h: 4;
	unsigned short S1	: 4;
	unsigned short C1l	: 8;

	unsigned short C2h	: 8;
	unsigned short P2	: 5;
	unsigned short GAB1l: 3;

	unsigned short GAB2	: 3+4;
	unsigned short S2	: 4;
	unsigned short C2l	: 5;
} Bits_s;

typedef struct {
	Flag Overflow_Carry[2];				// Flag array: Overflow_Carry[0] = Overflow, Overflow_Carry[1] = Carry
	Word16 *synth; 						// Synthesis

	Dec_ld8a_s Dec_ld8a;
	Postfilt_s Postfilt;
	LspDec_s LspDec;
	Post_Process_s Post_Process;
	Bits_s Bitstream;

	Word16 synth_buf[80+10]; 		// Synthesis
	Word16 parm[11+1];            // Synthesis parameters
	Word16 Az_dec[(10+1)*2];               // Decoded Az for post-filter
	Word16 T2[2];                       // Pitch lag for 2 subframes
	Word16 Past_qua_en[4]; 				// Gain predictor, Past quantized energies = -14.0 in Q10
	Word16 bad_lsf;        				// bad LSF indicator
	Word16 seed;
} G729ADecoder_s;


void Init_Decod_ld8a(G729ADecoder_s* pG729ADecoder);
void Lsp_decw_reset(G729ADecoder_s* pG729ADecoder);
void Init_Post_Filter(G729ADecoder_s* pG729ADecoder);
void Post_Process(G729ADecoder_s* pG729ADecoder);
int G729ADecoderInit(void* ptrG729ADecoder);
Word16 Random(G729ADecoder_s* pG729ADecoder);
void G729A_Decoder(void* ptrG729ADecoder, short* pSerialInput, short* pSynthOut);
void Bits2prm(G729ADecoder_s* pG729ADecoder);

void Decod_ld8a(
  Word16  parm[],      /* (i)   : vector of synthesis parameters
                                  parm[0] = bad frame indicator (bfi)  */
  Word16  synth[],     /* (o)   : synthesis speech                     */
  Word16  A_t[],       /* (o)   : decoded LP filter in 2 subframes     */
  Word16  *T2,          /* (o)   : decoded pitch lag in 2 subframes     */
  G729ADecoder_s* pG729ADecoder
);

void Lsp_iqua_cs(
 Word16 prm[],          /* input : codes of the selected LSP*/
 Word16 lsp_q[],        /* output: Quantized LSP parameters*/
 Word16 erase,           /* input : frame erase information */
 G729ADecoder_s* pG729ADecoder
);

void D_lsp(
  Word16 prm[],          /* (i)     : indexes of the selected LSP */
  Word16 lsp_q[],        /* (o) Q15 : Quantized LSP parameters    */
  Word16 erase,           /* (i)     : frame erase information     */
  G729ADecoder_s* pG729ADecoder
);

void Dec_gain(
  Word16 index,     /* (i)     : Index of quantization.                     */
  Word16 code[],    /* (i) Q13 : Innovative vector.                         */
  Word16 L_subfr,   /* (i)     : Subframe length.                           */
  Word16 bfi,       /* (i)     : Bad frame indicator                        */
  Word16 *gain_pit, /* (o) Q14 : Pitch gain.                                */
  Word16 *gain_cod,  /* (o) Q1  : Code gain.                                 */
  G729ADecoder_s* pG729ADecoder
);

void Post_Filter(
  Word16 *syn,       /* in/out: synthesis speech (postfiltered is output)    */
  Word16 *Az_4,       /* input : interpolated LPC parameters in all subframes */
  Word16 *T,            /* input : decoded pitch lags in all subframes          */
  G729ADecoder_s* pG729ADecoder
);

void preemphasis(
  Word16 *signal,  /* (i/o)   : input signal overwritten by the output */
  Word16 g,        /* (i) Q15 : preemphasis coefficient                */
  Word16 L,         /* (i)     : size of filtering                      */
  G729ADecoder_s* pG729ADecoder
);

void agc(
  Word16 *sig_in,   /* (i)     : postfilter input signal  */
  Word16 *sig_out,  /* (i/o)   : postfilter output signal */
  Word16 l_trm,      /* (i)     : subframe size            */
  G729ADecoder_s* pG729ADecoder
);



/*---------------------------------------------------------------*
 *    Postfilter constant parameters (defined in "ld8a.h")       *
 *---------------------------------------------------------------*
 *   L_FRAME     : Frame size.                                   *
 *   L_SUBFR     : Sub-frame size.                               *
 *   M           : LPC order.                                    *
 *   MP1         : LPC order+1                                   *
 *   PIT_MAX     : Maximum pitch lag.                            *
 *   GAMMA2_PST  : Formant postfiltering factor (numerator)      *
 *   GAMMA1_PST  : Formant postfiltering factor (denominator)    *
 *   GAMMAP      : Harmonic postfiltering factor                 *
 *   MU          : Factor for tilt compensation filter           *
 *   AGC_FAC     : Factor for automatic gain control             *
 *---------------------------------------------------------------*/

/*---------------------------------------------------------------*
 * Procedure    Init_Post_Filter:                                *
 *              ~~~~~~~~~~~~~~~~                                 *
 *  Initializes the postfilter parameters:                       *
 *---------------------------------------------------------------*/

void Init_Post_Filter(G729ADecoder_s* pG729ADecoder)
{
  pG729ADecoder->Postfilt.res2  = pG729ADecoder->Postfilt.res2_buf + 143;
  pG729ADecoder->Postfilt.scal_res2  = pG729ADecoder->Postfilt.scal_res2_buf + 143;

  memset(pG729ADecoder->Postfilt.mem_syn_pst, 0, 2*10);
  memset(pG729ADecoder->Postfilt.res2_buf, 0, 2*(143+40));
  memset(pG729ADecoder->Postfilt.scal_res2_buf, 0, 2*(143+40));

  return;
}


/*------------------------------------------------------------------------*
 *  Procedure     Post_Filter:                                            *
 *                ~~~~~~~~~~~                                             *
 *------------------------------------------------------------------------*
 *  The postfiltering process is described as follows:                    *
 *                                                                        *
 *  - inverse filtering of syn[] through A(z/GAMMA2_PST) to get res2[]    *
 *  - use res2[] to compute pitch parameters                              *
 *  - perform pitch postfiltering                                         *
 *  - tilt compensation filtering; 1 - MU*k*z^-1                          *
 *  - synthesis filtering through 1/A(z/GAMMA1_PST)                       *
 *  - adaptive gain control                                               *
 *------------------------------------------------------------------------*/



void Post_Filter(
  Word16 *syn,       /* in/out: synthesis speech (postfiltered is output)    */
  Word16 *Az_4,      /* input : interpolated LPC parameters in all subframes */
  Word16 *T,          /* input : decoded pitch lags in all subframes          */
  G729ADecoder_s* restrict pG729ADecoder
)
{
 /*-------------------------------------------------------------------*
  *           Declaration of parameters                               *
  *-------------------------------------------------------------------*/

 Word16 res2_pst[40];  /* res2[] after pitch postfiltering */
 Word16 syn_pst[80];   /* post filtered synthesis speech   */

 Word16 Ap3[(10+1)], Ap4[(10+1)];  /* bandwidth expanded LP parameters */

 Word16 *Az;                 /* pointer to Az_4:                 */
                             /*  LPC parameters in each subframe */
 Word16   t0_max, t0_min;    /* closed-loop pitch search range   */
 Word16   i_subfr;           /* index for beginning of subframe  */

 Word16 h[22];

 Word16  i, j;
 Word16  temp1, temp2;
 Word32  L_tmp;

 Word16 * restrict pres2;
 Word16 * restrict pscal_res2;

 pres2 = pG729ADecoder->Postfilt.res2;
 pscal_res2 = pG729ADecoder->Postfilt.scal_res2;

   /*-----------------------------------------------------*
    * Post filtering                                      *
    *-----------------------------------------------------*/

    Az = Az_4;

    for (i_subfr = 0; i_subfr < 80; i_subfr += 40)
    {
      /* Find pitch range t0_min - t0_max */

      t0_min = (sature(*T++ -3,pG729ADecoder->Overflow_Carry));
      t0_max = (sature(t0_min+6,pG729ADecoder->Overflow_Carry));
//      if (sub(t0_max, PIT_MAX, pG729ADecoder->Overflow_Carry) > 0) {
      if (t0_max > 143) {
        t0_max = 143;
        t0_min = (sature(t0_max-6,pG729ADecoder->Overflow_Carry));
      }

      /* Find weighted filter coefficients Ap3[] and ap[4] */

      Weight_Az(Az, 18022, 10, Ap3, pG729ADecoder->Overflow_Carry);
      Weight_Az(Az, 22938, 10, Ap4, pG729ADecoder->Overflow_Carry);

      /* filtering of synthesis speech by A(z/GAMMA2_PST) to find res2[] */

      Residu(Ap3, &syn[i_subfr], pG729ADecoder->Postfilt.res2, 40, pG729ADecoder->Overflow_Carry);

      /* scaling of "res2[]" to avoid energy overflow */

      for (j=0; j<40; j++)
      {
    	  //pG729ADecoder->Postfilt.scal_res2[j] = shr(pG729ADecoder->Postfilt.res2[j], 2, pG729ADecoder->Overflow_Carry);
    	  //pG729ADecoder->Postfilt.scal_res2[j] = pG729ADecoder->Postfilt.res2[j]>>2;
    	  pscal_res2[j] = pres2[j]>>2;
      }

      /* pitch postfiltering */

      pit_pst_filt(pG729ADecoder->Postfilt.res2, pG729ADecoder->Postfilt.scal_res2, t0_min, t0_max, 40, res2_pst, pG729ADecoder->Overflow_Carry);

      /* tilt compensation filter */

      /* impulse response of A(z/GAMMA2_PST)/A(z/GAMMA1_PST) */

      memcpy(h, Ap3, 2*(10+1));

      memset(&h[10+1], 0, 2*(22-10-1));

      Syn_filt(Ap4, h, h, 22, &h[10+1], 0, pG729ADecoder->Overflow_Carry);

      /* 1st correlation of h[] */

      L_tmp = L_mult(h[0], h[0], pG729ADecoder->Overflow_Carry);
      /*for (i=1; i<L_H; i++) L_tmp = L_mac(L_tmp, h[i], h[i], pG729ADecoder->Overflow_Carry);*/
      for (i=1; i<22; i++){
    	  L_tmp = (_sadd((L_tmp),(_smpy((h[i]),(h[i])))));
          if (L_tmp == (Word32)0x80000000L || L_tmp == (Word32)0x7fffffffL){
        	  pG729ADecoder->Overflow_Carry[0] = 1;
    	  }
      }

      temp1 = ((L_tmp)>>16);

      L_tmp = L_mult(h[0], h[1], pG729ADecoder->Overflow_Carry);
      /*for (i=1; i<L_H-1; i++) L_tmp = L_mac(L_tmp, h[i], h[i+1], pG729ADecoder->Overflow_Carry);*/
      for (i=1; i<22-1; i++){
    	  L_tmp = (_sadd((L_tmp),(_smpy((h[i]),(h[i+1])))));
          if (L_tmp == (Word32)0x80000000L || L_tmp == (Word32)0x7fffffffL){
        	  pG729ADecoder->Overflow_Carry[0] = 1;
    	  }
      }

      temp2 = ((L_tmp)>>16);

      if(temp2 <= 0) {
        temp2 = 0;
      }
      else {
        temp2 = (_smpy((temp2),(26214))>>16);
        temp2 = div_s(temp2, temp1, pG729ADecoder->Overflow_Carry);
      }

      preemphasis(res2_pst, temp2, 40, pG729ADecoder);

      /* filtering through  1/A(z/GAMMA1_PST) */

      Syn_filt(Ap4, res2_pst, &syn_pst[i_subfr], 40, pG729ADecoder->Postfilt.mem_syn_pst, 1, pG729ADecoder->Overflow_Carry);

      /* scale output to input */

      agc(&syn[i_subfr], &syn_pst[i_subfr], 40, pG729ADecoder);

      /* update res2[] buffer;  shift by L_SUBFR */

     
      memcpy(&pG729ADecoder->Postfilt.res2[-143], &pG729ADecoder->Postfilt.res2[40-143], 2*143);
      memcpy(&pG729ADecoder->Postfilt.scal_res2[-143], &pG729ADecoder->Postfilt.scal_res2[40-143], 2*143);

      Az += (10+1);
    }

    /* update syn[] buffer */

    memcpy(&syn[-10], &syn[80-10], 2*10);

    /* overwrite synthesis speech by postfiltered synthesis speech */

    memcpy(syn, syn_pst, 2*80);

    return;
}

0 Negin Madani over 6 years ago in reply to Victor Kazmirenko

Prodigy 230 points

rrlagic said:
For that you need to ensure and tell compiler your data pointers are aligned.

Thanks for your comment! I tried using nassert as you suggested:

      _nassert ((Word16) pres2 % 8 == 0);
      _nassert((Word16) pscal_res2 % 8 == 0);
      #pragma MUST_ITERATE(40,40,40)
      #pragma UNROLL(8)

      for (j=0; j<L_SUBFR; j++)
      {
    	  pscal_res2[j] = pres2[j]>>2;
      }

The average number of cycles for the function without adding the pragmas and nassert was 2031.

Adding nassert increased it to 2115, but including the pragmas (with nassert) decreased it down to 1940.

I am still not able to see any software pipeline information in my .asm file though (not even a "disqualified loop"), it is like the loop is not even recognized as a loop.

Do you think I can further optimize the loop? I have submitted the test case in my previous post and would appreciate your suggestions.

Thanks,

Negin

0 Negin Madani over 6 years ago in reply to Negin Madani

Prodigy 230 points

Negin said:
I tried using nassert as you suggested

Actually, this is causing the function to produce wrong results. I also get this warning:

#770-D conversion from pointer to smaller integer

Any ideas on how to fix this?

Thanks,

Negin

0 Keith Barkley over 6 years ago in reply to Negin Madani

Guru 35620 points

Are pointers 16 bits?

0 Negin Madani over 6 years ago in reply to Keith Barkley

Prodigy 230 points

Keith Barkley said:
Are pointers 16 bits?

They are 32 bits:

0 Keith Barkley over 6 years ago in reply to Negin Madani

Guru 35620 points

Then maybe you should use a Word32 in the cast instead of Word16.

0 George Mock over 6 years ago in reply to Negin Madani

TI__Guru**** 232790 points

Negin said:

Based on profiling information, I can verify that the number of cycles for the function containing the loop has decreased. However, the .asm file does not show any software pipeline information for this loop after making this modification.

Do you know why this happened?

Yes. Because this loop ...

Negin said:
for (j=0; j<L_SUBFR; j++)
{ pscal_res2[j] = pres2[j]>>2; }

... is completely unrolled. All those operations are implemented as a straight line block of code.

Thanks and regards,

-George

0 Victor Kazmirenko over 6 years ago in reply to Negin Madani

Guru 13042 points

Hi!

As George suggested, the loop might be unrolled completely. I could not reproduce that with my test. Nevertheless, cycle count your report is caused by the rest of your function, not the loop under consideration. Load, shift, store - is very simple sequence and benefit from SIMD use. I expected SHR2 instruction, but I am old, now thing become even better. Consider fragment:

#pragma FUNC_CANNOT_INLINE ( mytest );
#pragma   FUNCTION_OPTIONS ( mytest, "--opt_level3 --opt_for_speed5" );

void mytest(void)
{
    const   int16 * RESTRICT src = (int16 *) mysrc;
            int16 * RESTRICT dst = (int16 *) mydst;
    int i;

    _nassert ( (int) src % 8 == 0 );
    _nassert ( (int) dst % 8 == 0 );
    #pragma MUST_ITERATE(40,40,40)

    for ( i = 0; i < 40; i++ )
        dst[i] = src[i] >> 2;
}

Assembly for this piece looks like:

mytest:
;** --------------------------------------------------------------------------*
;* 1304	-----------------------    C$1 = there was source;
;* 1304	-----------------------    src = (int (*)[14][1200])C$1+19712;
;* 1305	-----------------------    dst = (int (*)[14][1200])C$1+24512;
;**  	-----------------------    U$15 = (short  const (*)[4])src;
;**  	-----------------------    U$18 = (short (*)[4])dst;
;* 1312	-----------------------    L$1 = 10;
;**  	-----------------------    #pragma MUST_ITERATE(10, 10, 10)
;**  	-----------------------    // LOOP BELOW UNROLLED BY FACTOR(4)
;**  	-----------------------    #pragma LOOP_FLAGS(4098u)
;**	-----------------------g2:
;* 1313	-----------------------    *U$18++{8} = _dshr2(*U$15++{8}, 2);
;* 1312	-----------------------    if ( L$1 = L$1-1 ) goto g2;
;**  	-----------------------    return;

You see, the structure of the loop is very simple, and that is confirmed by pipelining info

;*      Loop Unroll Multiple             : 4x
;*      Known Minimum Trip Count         : 10                    
;*      Known Maximum Trip Count         : 10                    
;*      Known Max Trip Count Factor      : 10
;*      Loop Carried Dependency Bound(^) : 0
;*      Unpartitioned Resource Bound     : 1
;*      Partitioned Resource Bound(*)    : 1
;*      Resource Partition:
;*                                A-side   B-side
;*      .L units                     0        0     
;*      .S units                     0        1*    
;*      .D units                     1*       1*    
;*      .M units                     0        0     
;*      .X cross paths               0        1*    
;*      .T address paths             1*       1*    
;*      Long read paths              0        0     
;*      Long write paths             0        0     
;*      Logical  ops (.LS)           0        0     (.L or .S unit)
;*      Addition ops (.LSD)          0        0     (.L or .S or .D unit)
;*      Bound(.L .S .LS)             0        1*    
;*      Bound(.L .S .D .LS .LSD)     1*       1*    
;*
;*      Searching for software pipeline schedule at ...
;*         ii = 1  Schedule found with 7 iterations in parallel
;*      Done
;*
;*      Loop will be splooped
;*      Collapsed epilog stages       : 0
;*      Collapsed prolog stages       : 0
;*      Minimum required memory pad   : 0 bytes
;*
;*      Minimum safe trip count       : 1 (after unrolling)
;*----------------------------------------------------------------------------*
$C$L43:    ; PIPED LOOP PROLOG

           SPLOOPD 1       ;7                ; (P) 
||         MVC     .S2     B7,ILC

;** --------------------------------------------------------------------------*
$C$L44:    ; PIPED LOOP KERNEL
$C$DW$L$mytest$3$B:
           LDDW    .D1T1   *A3++,A5:A4       ; |1313| (P) <0,0> 
           NOP             4
           DSHR2   .S2X    A5:A4,2,B5:B4     ; |1313| (P) <0,5> 

           SPKERNEL 6,0
||         STDW    .D2T2   B5:B4,*B6++       ; |1313| <0,6> 

$C$DW$L$mytest$3$E:
;** --------------------------------------------------------------------------*
$C$L45:    ; PIPED LOOP EPILOG
;** --------------------------------------------------------------------------*

I expected it will operate on pairs, but it process them by fours! So for unrolled trip count of 10 it should take just few dozens of cycles to complete. Unrolled version might be even faster. I would check rest of the function for performance.

As to the warning you saw, it was related to type cast in _nassert(). If you control-click on it to find its declaration, you'll see

extern _CODE_ACCESS void _nassert(int);

and that is the root cause of warning. You have to cast to (int).

0 Alberto Chessa over 6 years ago in reply to Negin Madani

Mastermind 6650 points

Hi,

Loop: has been completely unrolled, implemented as a sequence of 10 DSHR2 *(64bits = 4 word16 * 10 = 40)

Conversion warning: your _nassert cast to Word16 a 32bits value. The _nassert regards the pointer, not the type to which it points to:
_nassert ((int) pres2 % 8 == 0); //use "int"- sizeof(int) == sizeof(Word16*)

Warning: the assertion precondition could be false. As far as I known, the array alignment at 16bytes is applicable only to top-level alignment. Inside a struct, the constrains is the type of the elements (2 bytes in your case).

To guarantee 8 bytes alignment you have to place, prior to the array declaration, a (maybe dummy) filed with alignment constrains 8 bytes (or use an __attribute__).

Note that the constrain is respected also if:
1. the struct contains at least one 8 bytes type
2. (sizeof() of the fields the precede your array) % 8 ==0.

0 Victor Kazmirenko over 6 years ago in reply to Alberto Chessa

Guru 13042 points

struct 
{
    char a;
    char b;

    double dummy;       // enforces 8B alignment
    short  array[size]; // aligned array.

It seems I was giving unverified advice about alignment. It seems that arrays as structure members get aligned as their respective type requires, unless there is preceding member of larger alignment requirement. This could be a reason why you see wrong numbers after modification. One way to ensure alignment is to place dummy field of required alignment, like

Code Composer Studio™︎

Code Composer Studio forum

Compiler/TMS320C6657: Loop optimization