TI Compiler optimization: Linux vs. PC

Jay Gowdy81418

I have a project which is being entirely developed under a Linux environment, and another project which is being entirely developed under a PC environment.

I am running the same version of the TI Optimizing compiler on the same code with the same compiler flags, but it appears (from examining the .asm files) that the PC compiler produces loop code that is twice as efficient as the Linux compiler.

I've attached the .asm files (with the interlisting and extra loop debugging flags turned on). You will see that both compilers claim to be v7.3.1 and report the same flags used, and the only difference is that one is Unix and one is PC.

4606.pc_bayer_conversion_fast.asm

2604.linux_bayer_conversion_fast.asm

If you search for the first instance of "SOFTWARE PIPELINE INFORMATION" in both files, you will see that the first loop (at line 16 of the code) has an ii of 2 for a total cycle count of 1288 on the PC compiler,and has an ii of 4 for a total cycle count of 2568 on the Linux compiler. Just looking at the interlisting shortly before the software pipeline information, it appears that the PC compiler does some significantly different things in generating the optimized C code.

So, is CGT v7.3.1 in Linux really the same as CGT v7.3.1 on the PC? Should I be expected inferior performance for code built on Linux? Is there something obvious I am missing to make the Linux performance match the PC?

Just for completeness, when I do -version for the Linux version I get

TMS320C6x C/C++ Compiler                v7.3.1
Build Number 1LJRN-MP70E-UARAR-SAW-ZAZG_X_T_R

TMS320C6x C/C++ Parser                  v7.3.1
Build Number 1LJRN-MP70E-UARAR-SAW-ZAZG_X_T_R
TMS320C6x EABI C/C++ Parser             v7.3.1
Build Number 1LJRN-MP70E-UARAR-SAW-ZAZG_X_T_R
TMS320C6x C/C++ File Merge              v7.3.1
Build Number 1LJRN-MP70E-UARAR-SAW-ZAZG_X_T_R
TMS320C6x C/C++ Optimizer               v7.3.1
Build Number 1LJRN-MP70E-UARAR-SAW-ZAZG_X_T_R
TMS320C6x C/C++ Codegen                 v7.3.1
Build Number 1LJRN-MP70E-UARAR-SAW-ZAZG_X_T_R
TMS320C6x Consultant Generator          v7.3.1
Build Number 1LJRN-MP70E-UARAR-SAW-ZAZG_X_T_R
TMS320C6x Assembly Preprocessor         v7.3.1
Build Number 1LJRN-MP70E-UARAR-SAW-ZAZG_X_T_R
TMS320C6x Assembler                     v7.3.1
Build Number 1LJRN-MP70E-UARAR-SAW-ZAZG_X_T_R
TMS320C6x Compressor                    v7.3.1
Build Number 1LJRN-MP70E-UARAR-SAW-ZAZG_X_T_R
TMS320C6x Embed Utility                 v7.3.1
Build Number 1LJRN-MP70E-UARAR-SAW-ZAZG_X_T_R
TMS320C6x C Source Interlister          v7.3.1
Build Number 1LJRN-MP70E-UARAR-SAW-ZAZG_X_T_R
TMS320C6x Linker                        v7.3.1
Build Number 1LJRN-MP70E-UARAR-SAW-ZAZG_X_T_R
TMS320C6x Absolute Lister               v7.3.1
Build Number 1LJRN-MP70E-UARAR-SAW-ZAZG_X_T_R
TMS320C6x Strip Utility                 v7.3.1
Build Number 1LJRN-MP70E-UARAR-SAW-ZAZG_X_T_R
TMS320C6x XREF Utility                  v7.3.1
Build Number 1LJRN-MP70E-UARAR-SAW-ZAZG_X_T_R
TMS320C6x C++ Demangler                 v7.3.1
Build Number 1LJRN-MP70E-UARAR-SAW-ZAZG_X_T_R
TMS320C6x Hex Converter                 v7.3.1
Build Number 1LJRN-MP70E-UARAR-SAW-ZAZG_X_T_R
TMS320C6x Library Builder               v7.3.1
Build Number 1LJRN-MP70E-UARAR-SAW-ZAZG_X_T_R
TMS320C6x Name Utility                  v7.3.1
Build Number 1LJRN-MP70E-UARAR-SAW-ZAZG_X_T_R
TMS320C6x Object File Display           v7.3.1
Build Number 1LJRN-MP70E-UARAR-SAW-ZAZG_X_T_R
TMS320C6x Archiver                      v7.3.1
Build Number 1LJRN-MP70E-UARAR-SAW-ZAZG_X_T_R

and for the PC I get

TMS320C6x C/C++ Compiler                v7.3.1
Build Number 1LJRN-KDADEMDK-RTARQ-TAV-ZAZG_X_T_R

TMS320C6x C/C++ Parser                  v7.3.1
Build Number 1LJRN-KDADEMDK-RTARQ-TAV-ZAZG_X_T_R
TMS320C6x EABI C/C++ Parser             v7.3.1
Build Number 1LJRN-KDADEMDK-RTARQ-TAV-ZAZG_X_T_R
TMS320C6x C/C++ File Merge              v7.3.1
Build Number 1LJRN-KDADEMDK-RTARQ-TAV-ZAZG_X_T_R
TMS320C6x C/C++ Optimizer               v7.3.1
Build Number 1LJRN-KDADEMDK-RTARQ-TAV-ZAZG_X_T_R
TMS320C6x C/C++ Codegen                 v7.3.1
Build Number 1LJRN-KDADEMDK-RTARQ-TAV-ZAZG_X_T_R
TMS320C6x Consultant Generator          v7.3.1
Build Number 1LJRN-KDADEMDK-RTARQ-TAV-ZAZG_X_T_R
TMS320C6x Assembly Preprocessor         v7.3.1
Build Number 1LJRN-KDADEMDK-RTARQ-TAV-ZAZG_X_T_R
TMS320C6x Assembler                     v7.3.1
Build Number 1LJRN-KDADEMDK-RTARQ-TAV-ZAZG_X_T_R
TMS320C6x Compressor                    v7.3.1
Build Number 1LJRN-KDADEMDK-RTARQ-TAV-ZAZG_X_T_R
TMS320C6x Embed Utility                 v7.3.1
Build Number 1LJRN-KDADEMDK-RTARQ-TAV-ZAZG_X_T_R
TMS320C6x C Source Interlister          v7.3.1
Build Number 1LJRN-KDADEMDK-RTARQ-TAV-ZAZG_X_T_R
TMS320C6x Linker                        v7.3.1
Build Number 1LJRN-KDADEMDK-RTARQ-TAV-ZAZG_X_T_R
TMS320C6x Absolute Lister               v7.3.1
Build Number 1LJRN-KDADEMDK-RTARQ-TAV-ZAZG_X_T_R
TMS320C6x Strip Utility                 v7.3.1
Build Number 1LJRN-KDADEMDK-RTARQ-TAV-ZAZG_X_T_R
TMS320C6x XREF Utility                  v7.3.1
Build Number 1LJRN-KDADEMDK-RTARQ-TAV-ZAZG_X_T_R
TMS320C6x C++ Demangler                 v7.3.1
Build Number 1LJRN-KDADEMDK-RTARQ-TAV-ZAZG_X_T_R
TMS320C6x Hex Converter                 v7.3.1
Build Number 1LJRN-KDADEMDK-RTARQ-TAV-ZAZG_X_T_R
TMS320C6x Library Builder               v7.3.1
Build Number 1LJRN-KDADEMDK-RTARQ-TAV-ZAZG_X_T_R
TMS320C6x Name Utility                  v7.3.1
Build Number 1LJRN-KDADEMDK-RTARQ-TAV-ZAZG_X_T_R
TMS320C6x Object File Display           v7.3.1
Build Number 1LJRN-KDADEMDK-RTARQ-TAV-ZAZG_X_T_R
TMS320C6x Archiver                      v7.3.1
Build Number 1LJRN-KDADEMDK-RTARQ-TAV-ZAZG_X_T_R

over 13 years ago

0 Archaeologist over 13 years ago

TI__Guru* 84285 points

Jay Gowdy81418 said:
So, is CGT v7.3.1 in Linux really the same as CGT v7.3.1 on the PC?

Yes

Should I be expected inferior performance for code built on Linux?

Is there something obvious I am missing to make the Linux performance match the PC?

No, it should be exactly the same.

I'll look at your assembly code, but without a compilable test case, it may not be possible to find the problem.

0 Archaeologist over 13 years ago in reply to Archaeologist

TI__Guru* 84285 points

Those two assembly files are dramatically different; in particular, the PC version has about 10 times the amount of DWARF debugging information. Are you absolutely sure you are using identical flags on Linux and PC?

The performance differences happen far upstream of software pipelining. I would need a test case to investigate further.

0 Jay Gowdy81418 over 13 years ago in reply to Archaeologist

Intellectual 810 points

So I tried to make a standalone example, and found a bizarre thing.

My code is structured like this

static inline void downsample(...) {

}

void realfunc() {

downsample(...)

}

When I build this code on the PC under code composer, then the loop in downsample gets 1288 cycles. When I build the SAME code with the SAME compiler version and the SAME flags (as far as the .asm output file is concerned), then the loop is 2568

Under Linux, if I take out the static inline modifiers from the downsample function, so I have just

void downsample(...) ...

Then I get the original 1288 cycle count in the augmented .asm file.

I'll try and get this packaged up into a standalone example that shows the problem, but do you have any ideas?

Jay

0 Jay Gowdy81418 over 13 years ago in reply to Archaeologist

Intellectual 810 points

It looks like the PC/Linux difference was a red herring. I did track this down to a different compiler flag bringing in the "EDMA" version of some memory transfer code on the PC and a inlined "memcpy" based version on Linux. When I used the same flag on the PC side, I could make the problem come and go there as well.

I have attached a stripped down file illustrating the problem.

Fullscreen 8726.downsample.c Download

typedef struct {
  void* restrict dest;
  const void* restrict src;
  void* restrict dest_base;      /* set in main setups so we can use offsets */
  const void* restrict src_base; /* in later code to avoid passing in original pointers */
  unsigned short num_bytes;   /* bytes in a single block */
  unsigned short num_blocks;  /* number of blocks in an array */
  unsigned short block_stride;  /* number of bytes between blocks in an array transfer */
} memxfr_struct;

static void memxfr_set_dest(memxfr_struct* spec, void* dest)
{
  spec->dest = dest;
}

static void memxfr_set_src(memxfr_struct* spec, const void* src)
{
  spec->src = src;
}

static void* memxfr_get_dest(memxfr_struct* spec)
{
  return spec->dest;
}

static const void* memxfr_get_src(memxfr_struct* spec)
{
  return spec->src;
}

static void memxfr_set_num_bytes(memxfr_struct* spec, unsigned short num_bytes)
{
  spec->num_bytes = num_bytes;
}

static void memxfr_set_block_stride(memxfr_struct* spec, unsigned short block_stride)
{
  spec->block_stride = block_stride;
}

static void memxfr_set_num_blocks(memxfr_struct* spec, unsigned short num_blocks)
{
  spec->num_blocks = num_blocks;
}

static int memxfr_block_setup(memxfr_struct* spec, 
                                      const void* src, void* dest, unsigned short num_bytes)
{
  spec->src = spec->src_base = src;
  spec->dest = spec->dest_base = dest;
  spec->num_bytes = num_bytes;
  spec->num_blocks = 1;
  spec->block_stride = 0;

  return 0;
}

static int memxfr_array_setup(memxfr_struct* spec, 
                                      const void* src, void* dest,
                                      unsigned short num_bytes,
                                      unsigned short num_blocks,
                                      unsigned short block_stride)
{
  spec->src = spec->src_base = src;
  spec->dest = spec->dest_base = dest;
  spec->num_bytes = num_bytes;
  if (num_blocks < 1)
    num_blocks = 1;
  spec->num_blocks = num_blocks;
  spec->block_stride = block_stride;

  return 0;
}

static int memxfr_start(memxfr_struct* spec)
{
  return 0;
}

static int memxfr_wait(memxfr_struct* spec, int timeout)
{
  unsigned short i;
  unsigned char* restrict dest = spec->dest;
  const unsigned char* restrict src = spec->src;

  for (i=0;i<spec->num_blocks;i++) {
  /* CRITICAL SECTION */
#if 1
    memcpy(dest, src, spec->num_bytes);
#endif
    src += spec->block_stride;
    dest += spec->num_bytes;
  }

  return 0;
}

static int memxfr_release(memxfr_struct* spec)
{
  return 0;
}

#define RAW_IMAGE_WIDTH 1280
#define RAW_IMAGE_HEIGHT 960
#define OUTPUT_HEIGHT 720
#define BUFFER_OUTPUT_HEIGHT 960

/* CRITICAL: switch if you want to change efficiency */
static inline void downsample_line(const unsigned short* src, unsigned char* dest)
//void downsample_line(const unsigned short* src, unsigned char* dest)
{
  short i;

  for (i=0;i<RAW_IMAGE_WIDTH;i++) {
    /* swap bytes and shift by 4 simultaneously */
    dest[i] = ((src[i] & 0xFF) << 4) | ((src[i] >> 12));
  }
}

int ds_bayer_to_yuv420sp_fast(const unsigned short * restrict pusBayer_Raw_Img, unsigned char *restrict pucyuv420sp_ImgBuffer, short row_start)
{
  memxfr_struct imgxfr;
  memxfr_struct Yxfr, UVxfr;
  unsigned short raw_img_data[RAW_IMAGE_WIDTH];
  unsigned char img_data[RAW_IMAGE_WIDTH*2];
  unsigned char Y[RAW_IMAGE_WIDTH], UV[RAW_IMAGE_WIDTH];
  short r, c;
  short red, green, blue;
  unsigned char* restrict cur;
  unsigned char* restrict next;
  unsigned char* tmp;

  /* make sure row start is in bounds */
  if (row_start < 1 || row_start >= RAW_IMAGE_HEIGHT-OUTPUT_HEIGHT-1)
    return 0;

  if (memxfr_block_setup(&imgxfr, 0L, raw_img_data,
                         RAW_IMAGE_WIDTH*sizeof(unsigned short))) {
    return -1;
  }
  if (memxfr_block_setup(&Yxfr, Y, 0L, sizeof(Y))) {
    return -1;
  }
  if (memxfr_block_setup(&UVxfr, UV, 0L, sizeof(UV))) {
    return -1;
  }

  memxfr_set_src(&imgxfr, pusBayer_Raw_Img + (row_start+1)*RAW_IMAGE_WIDTH);
  memxfr_start(&imgxfr);

  cur = &img_data[0];
  next = &img_data[RAW_IMAGE_WIDTH];

  downsample_line(pusBayer_Raw_Img + (row_start)*RAW_IMAGE_WIDTH, cur);

  for (r=row_start;r<row_start+OUTPUT_HEIGHT;r++) {
    /* if necessary, wait for the transfer of the Bayer patter from DDR */
    if (memxfr_wait(&imgxfr, 10)) {
      return -1;
    }

    /* make a down sampled copy of the next row */
    downsample_line(raw_img_data, next);

    /* start transferring the next image data while processing data */
    if (r != row_start+OUTPUT_HEIGHT-1) {
      memxfr_set_src(&imgxfr, pusBayer_Raw_Img+ (r+2)*RAW_IMAGE_WIDTH);
      memxfr_start(&imgxfr);
    }

    /* make sure we have finished writing from previous time */
    if (r != row_start) {
      if (memxfr_wait(&Yxfr, 1)) {
        return -1;
      }
    }

    /* calculate the bilinearly interpolated Y value */
    /* on even rows we also do U and V */
    if (!(r & 1)) {
      /* make sure we are finished transferring the UV outputs */
      if (r != row_start) {
        if (memxfr_wait(&UVxfr, 1)) {
          return -1;
        }
      }

      for (c=0;c<RAW_IMAGE_WIDTH-2;c+=2) {
        red = cur[c+1];
        green = (cur[c] + next[c+1])/2;
        blue = next[c];
      
        Y[c] = (red+green+blue)/3;
        UV[c] = red;
        UV[c+1] = blue;

        Y[c+1] = (red+green+blue)/3;
      }
      /* have to handle the last cell separately so we don't read beyond the
         end of the line */
      c = RAW_IMAGE_WIDTH-2;
      red = cur[c+1];
      green = (cur[c] + next[c+1])/2;
      blue = next[c];
      
      Y[c+1] = Y[c] = (red+green+blue)/3;
      UV[c] = red;
      UV[c+1] = green;

      /* start the transfer of the UV values to DDR */
      memxfr_set_dest(&UVxfr,
              pucyuv420sp_ImgBuffer + RAW_IMAGE_WIDTH*BUFFER_OUTPUT_HEIGHT +
              (r-row_start)*RAW_IMAGE_WIDTH/2);
      memxfr_start(&UVxfr);
    } else {
      /* odd rows, no U and V */
      for (c=0;c<RAW_IMAGE_WIDTH-2;c+=2) {
        Y[c] = (next[c+1]+(cur[c+1] + next[c])/2+cur[c])/3;
        Y[c+1] = (next[c+1]+(cur[c+1] + next[c+2])/2+cur[c+2])/3;
      }
      /* have to handle the last cell separately so we don't read beyond the
         end of the line */
      c = RAW_IMAGE_WIDTH-2;
      Y[c+1] = Y[c] = (next[c+1]+(cur[c+1] + next[c])/2+cur[c])/3;
    }

    /* start the transfer of the Y values to DDR */
    memxfr_set_dest(&Yxfr, pucyuv420sp_ImgBuffer + (r-row_start)*RAW_IMAGE_WIDTH);
    memxfr_start(&Yxfr);

    /* swap image data buffers */
    tmp = cur;
    cur = next;
    next = tmp;
  }

  /* make sure all output transfers are done */
  memxfr_wait(&Yxfr, 1);
  memxfr_wait(&UVxfr, 1);

  /* and clean up memory transfers */
  memxfr_release(&imgxfr);
  memxfr_release(&Yxfr);
  memxfr_release(&UVxfr);

  return 1;
}

This code no longer works (I've stripped out the real RGB->YUV conversions), but it does compile and build a .asm file when I use the command line,

c:/ti/ccsv5/tools/compiler/c6000/bin/cl6x.exe -c -pdsw225 -O3 --symdebug:skeletal --no_bad_aliases --use_const_for_alias_analysis --debug_software_pipeline --optimizer_interlist --opt_for_speed=5 -k --src_interlist -I. -mv6740 --abi=eabi --display_error_number --diag_suppress=179 --diag_warning=225 --output_file=Release/downsample.oe674 -fc downsample.c

It appears that there are two "critical" points that slow down the first loop at line 114 by a factor of two: either commenting out the "memcpy" call at line 88 OR changing the function definition at line 110 to be void instead of static inline void.

I am not quite grasping why such minor changes cause such major performance hits. I suspect I am missing some other underlying cause, and I would appreciate any help in spotting it.

Thanks,

Jay

0 Archaeologist over 13 years ago in reply to Jay Gowdy81418

TI__Guru* 84285 points

One thing to note: changing "static inline void" to "void" is two changes, "static" and "inline". The function is small enough that the optimizer inlines it either way, with or without the keyword; however, when the function is static, the optimizer can delete it. When the function is not static, the optimizer must assume some other module may call it. In the case where memcpy is not called, changing the choice to a choice between "static inline void" and "static void" shows that there is no interesting difference between these two cases. Analyzing the other cases now...

0 Archaeologist over 13 years ago in reply to Archaeologist

TI__Guru* 84285 points

Without the memcpy call, the function memxfr_wait effectively does nothing, and calls to it can be eliminated. I argue that the cases where memcpy is not called are not relevant.

0 Jay Gowdy81418 over 13 years ago in reply to Archaeologist

Intellectual 810 points

Thanks for digging into this.

I think I understand this point and the previous, but what I don't get is why eliminating the memcpy call, or changing the function signature from static to public affects the performance of the loop in the downsample function by a factor of two. In those situations the compiler must be making some conservative assumption that constrains how it can optimize, which results in a 2600 cycle count instead of a 1300 cycle count. My fundamental questions are this: is there some lay-person way to understand (and possibly identify) what this compiler assumption is, and is there some other way to get the compiler to make a different assumption in a manner that is less obscure than changing the function signature or elminating calls to memcpy that occur before downsample is called?

Jay

0 Archaeologist over 13 years ago in reply to Archaeologist

TI__Guru* 84285 points

This may not be related to the problem, but function ds_bayer_to_yuv420sp_fast contains the following flip/flop construct which uses restrict incorrectly:

unsigned char* restrict cur;
unsigned char* restrict next;
for (r=row_start;r<row_start+output_height;r++) {
    /* swap image data buffers */
    tmp = cur;
    cur = next;
    next = tmp;
}

cur and next from adjacent iterations of the loop alias each other, which violates the rules of restrict. Please see section "4.9 Flip-flop pointers and FIFOs" in the paper Performance Tuning with the "Restrict" Keyword, which can be found at http://processors.wiki.ti.com/index.php/Restrict_Type_Qualifier

0 Archaeologist over 13 years ago in reply to Archaeologist

TI__Guru* 84285 points

Okay, I've gotten as far as I can, and my answer to you is "I don't know." The compiler is a complicated beast, and I can't quite put my finger on what the difference is here. It probably has to do with the use of restrict in a function that gets inlined, but I can't put it into an easy-to-understand form, because I don't understand it. I've submitted performance defect SDSCM00045437 for further analysis.

0 Jay Gowdy81418 over 13 years ago in reply to Archaeologist

Intellectual 810 points

Thanks for the effort. You helped me isolate the key, although somewhat arbitrary, change that makes the code run faster, even if we couldn't get to the bottom of why it made it run faster. That is good enough for me.

Also, thanks for the tip on the flip/flop pointers and restrict. After reading the doc, that actually makes sense and I've changed my code to do the right thing.

Jay

Code Composer Studio™︎

Code Composer Studio forum

TI Compiler optimization: Linux vs. PC