Compiler: bad performance when upgrade the code generation toolchain from 7.4.x to 8.3.x

Huan Hou

Hi, all

Recently I upgrate the c6000 code generation toolchain from 7.4.x to 8.3.x when bulilding for C66x DSPs. I encouter worse performance and I checked the generated assembly, I found the new compiler perform worse when the inline functions have an input parameter passing as a restrict reference as below

static void inline_smpy2_hi_lo (int src1, int src2, int *restrictout_hi, int *restrict out_lo)
{
long long out = _smpy2ll(src1, src2);
*out_hi = _hill(out);
*out_lo = _loll(out);
}

for()

{

....

_smpy2_hi_lo(inA1, inB1, &out_hi1, &out_lo1);

.....

}

I know I can simplify the code with just _smpy2ll with a 64-bit return paramter. But I just wonder why the same code works fine with compiler v7.4.x but not for v8.3.x. Is there a simple complier option to handle the inline functions with reference params in the latest v8.3.x compiler?

Thank you.

over 3 years ago

0 George Mock over 3 years ago

TI__Guru**** 244030 points

Please see if adding the compiler option --legacy improves performance. If it doesn't, then for a source file with this problem call to _smpy2_hi_lo, please follow the directions in the article How to Submit a Compiler Test Case.

Thanks and regards,

-George

0 Huan Hou over 3 years ago in reply to George Mock

Expert 1505 points

Thanks George for your advice. I try the --legacy option, it doesn't work, actually it even worsen the performance. I'll try to submit a compiler test case follow your suggestion.

0 Huan Hou over 3 years ago in reply to George Mock

Expert 1505 points

test_smpy2ll.pp.txt

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
static void inline _smpy2_hi_lo (int src1, int src2, int *restrict out_hi, int *restrict out_lo)
{
    long long out = _smpy2ll(src1, src2);
    *out_hi = _hill(out);
    *out_lo = _loll(out);
}
void ym_mult_q15(short *restrict pSrcA,
                 short *restrict pSrcB,
                 short *restrict pDst,
                 unsigned int blockSize)
{
    unsigned int blkCnt;                               /* loop counters */
    int inA1, inA2, inB1, inB2;                  /* temporary input variables */
    long long inA, inB;
    long long *restrict p_srcA = (long long*)pSrcA;
    long long *restrict p_srcB = (long long*)pSrcB;
    long long *restrict p_dst = (long long*)pDst;
    _nassert(((int)pSrcA & 0x07) == 0);
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

static void inline _smpy2_hi_lo (int src1, int src2, int *restrict out_hi, int *restrict out_lo)
{
    long long out = _smpy2ll(src1, src2);
    *out_hi = _hill(out);
    *out_lo = _loll(out);
}
void ym_mult_q15(short *restrict pSrcA,
                 short *restrict pSrcB,
                 short *restrict pDst,
                 unsigned int blockSize)
{
    unsigned int blkCnt;                               /* loop counters */
    int inA1, inA2, inB1, inB2;                  /* temporary input variables */
    long long inA, inB;
    long long *restrict p_srcA = (long long*)pSrcA;
    long long *restrict p_srcB = (long long*)pSrcB;
    long long *restrict p_dst = (long long*)pDst;

    _nassert(((int)pSrcA & 0x07) == 0);
    _nassert(((int)pSrcB & 0x07) == 0);
    _nassert(((int)pDst & 0x07) == 0);
    _nassert((int)blockSize > 0);
    _nassert((int)(blockSize & 7) == 0);
    /* loop Unrolling */
    blkCnt = blockSize >> 3U;
    while (blkCnt > 0U)
    {
        int out_lo1, out_hi1, out_hi2, out_lo2;
        /* read two samples at a time from sourceA */
        inA = _amem8(p_srcA);
        p_srcA++;
        inA1 = _loll(inA);
        inA2 = _hill(inA);
        /* read two samples at a time from sourceB */
        inB = _amem8(p_srcB);
        p_srcB++;
        inB1 = _loll(inB);
        inB2 = _hill(inB);

        /* multiply mul = sourceA * sourceB */
        _smpy2_hi_lo(inA1, inB1, &out_hi1, &out_lo1);
        _smpy2_hi_lo(inA2, inB2, &out_hi2, &out_lo2);
        /* store the result */
        _amem8(p_dst) = _itoll(_packh2(out_hi2, out_lo2), _packh2(out_hi1, out_lo1));
        p_dst++;

        inA = _amem8(p_srcA);
        p_srcA++;
        inA1 = _loll(inA);
        inA2 = _hill(inA);
        /* read two samples at a time from sourceB */
        inB = _amem8(p_srcB);
        p_srcB++;
        inB1 = _loll(inB);
        inB2 = _hill(inB);

        /* multiply mul = sourceA * sourceB */
        _smpy2_hi_lo(inA1, inB1, &out_hi1, &out_lo1);
        _smpy2_hi_lo(inA2, inB2, &out_hi2, &out_lo2);
        /* store the result */
        _amem8(p_dst) = _itoll(_packh2(out_hi2, out_lo2), _packh2(out_hi1, out_lo1));
        p_dst++;
        /* Decrement the blockSize loop counter */
        blkCnt--;
    }
}

Hi, George. Attached is the generated pp file from my Linux makefile.

compiler version ti-cgt-c6000_8.3.12

compiler options: --abi=eabi --strip_coff_underscore -mv6600 -O3 -pm -mf -mt --debug_software_pipeline --src_interlist --preproc_with_comment --preproc_with_compile

Any hint to try out, please just let me know. Thank you a lot.

0 George Mock over 3 years ago in reply to Huan Hou

TI__Guru**** 244030 points

Thank you for the test case. I am able to reproduce the same behavior. I filed the entry EXT_EP-10852 to have this investigated. You are welcome to follow it with that link.

The best workaround is to remove restrict from the parameters to the function _smpy2_hi_lo. These same parameters are already restrict modified in the calling function ym_mult_q15.

Thanks and regards,

-George

0 Huan Hou over 3 years ago in reply to George Mock

Expert 1505 points

George, thanks for your feedback. Yes, the removal of restrict keyword from the inline function did make the optimization of ym_mult_q15 work. But the weired thing is that not all the inline functions with restrict keyword worsen the optimization. I have several inline functions with the similar declaration, only a few of them will have the optimization problem as the function _smpy2_hi_lo. I have some other functions which not work well with the new compiler even with the removel of restrict keyword. Please help to have a look what could be the problem.

test_accum2Ch.pp.txt

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// 4 element accumulation for interleaved 2 chans
static int inline accumInter2Ch4Elements(const signed short *p_rowIn, const signed char *p_coef, int *restrict p_conOut1)
{
    int convOut, convOut1;
    long long in0, out0, out1, coef0;
    long long *restrict p_srcA = (long long*)p_rowIn;
    long long *restrict p_srcB = (long long*)p_coef;
    in0 = _amem8(p_srcA);   coef0 = _amem8(p_srcB);
    out0 = _ddotp4(_loll(in0), _loll(coef0));
    out1 = _ddotp4(_hill(in0), _hill(coef0));
    convOut = _loll(out0) + _loll(out1);
    convOut1 = _hill(out0) + _hill(out1);
    *p_conOut1 = convOut1;
    return convOut;
}
// to do the convolution operations without padding, the padding should be done before the convolution
void conv2D_inner_loop(const signed short *restrict p_freqTimeChan, const unsigned int inCols, const signed char *restrict p_coef, 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

// 4 element accumulation for interleaved 2 chans
static int inline accumInter2Ch4Elements(const signed short *p_rowIn, const signed char *p_coef, int *restrict p_conOut1)
{
    int convOut, convOut1;
    long long in0, out0, out1, coef0;
    long long *restrict p_srcA = (long long*)p_rowIn;
    long long *restrict p_srcB = (long long*)p_coef;

    in0 = _amem8(p_srcA);   coef0 = _amem8(p_srcB);
    out0 = _ddotp4(_loll(in0), _loll(coef0));
    out1 = _ddotp4(_hill(in0), _hill(coef0));
    convOut = _loll(out0) + _loll(out1);
    convOut1 = _hill(out0) + _hill(out1);
    *p_conOut1 = convOut1;
    return convOut;
}

// to do the convolution operations without padding, the padding should be done before the convolution
void conv2D_inner_loop(const signed short *restrict p_freqTimeChan, const unsigned int inCols, const signed char *restrict p_coef, 
    const unsigned int freqKernelSize, signed short *restrict p_out, const unsigned int freqStride)
{
    unsigned int inFreq, outFreq = 0;
    int convOut, convOut1, convOut2, convOut3;
    int convOutCh1, convOut1Ch1, convOut2Ch1, convOut3Ch1;
    const signed short *restrict p_rowIn, *restrict p_rowIn1, *restrict p_rowIn2, *restrict p_rowIn3;
    const unsigned int out_shift = 15;

    convOut1 = convOut2 = convOut3 = convOut = 0;
    convOut1Ch1 = convOut2Ch1 = convOut3Ch1 = convOutCh1 = 0;
    p_rowIn = p_freqTimeChan + inCols * outFreq;
    p_rowIn1 = p_freqTimeChan + inCols * (outFreq + 1 * freqStride);
    p_rowIn2 = p_freqTimeChan + inCols * (outFreq + 2 * freqStride);
    p_rowIn3 = p_freqTimeChan + inCols * (outFreq + 3 * freqStride);
    for (inFreq = 0; inFreq < freqKernelSize * inCols; inFreq+=4)
    {
        int cvOutCh1, cvOut1Ch1, cvOut2Ch1, cvOut3Ch1;
        convOut += accumInter2Ch4Elements(p_rowIn, p_coef, &cvOutCh1);
        convOut1 += accumInter2Ch4Elements(p_rowIn1, p_coef, &cvOut1Ch1);
        convOut2 += accumInter2Ch4Elements(p_rowIn2, p_coef, &cvOut2Ch1);
        convOut3 += accumInter2Ch4Elements(p_rowIn3, p_coef, &cvOut3Ch1);
        convOutCh1 = convOutCh1 + cvOutCh1;
        convOut1Ch1 = convOut1Ch1 + cvOut1Ch1;
        convOut2Ch1 = convOut2Ch1 + cvOut2Ch1;
        convOut3Ch1 = convOut3Ch1 + cvOut3Ch1;
        p_coef+=8; p_rowIn+=4; p_rowIn1+=4; p_rowIn2+=4; p_rowIn3+=4;
    }
    _amem4(&p_out[0]) = _spack2( (convOutCh1 >> out_shift), (convOut >> out_shift));
    _amem4(&p_out[1]) = _spack2( (convOut1Ch1 >> out_shift), (convOut1 >> out_shift));
    _amem4(&p_out[2]) = _spack2( (convOut2Ch1 >> out_shift), (convOut2 >> out_shift));
    _amem4(&p_out[2]) = _spack2( (convOut3Ch1 >> out_shift), (convOut3 >> out_shift));
}

compiler version ti-cgt-c6000_8.3.12

compiler options: --abi=eabi --strip_coff_underscore -mv6600 -O3 -pm -mf -mt --debug_software_pipeline --src_interlist --preproc_with_comment --preproc_with_compile

From the optimization output, if cgt v7.4.24 utilized, ii==4 for above code line. But if cgt v8.3.12 used, ii==7. This is quite a performance loss. In above case, not matter restrict keyword used in inline function parameter, the similar performance. I have checked the assembly code, it seems the compiler failed to generate load instruction with automatic address update which is quite implict in the old toolchains.

Any workaround method to make the compiler work better and I hope the continuous improvement of the new toolchain and that's why I plan to upgrade my toolchain.

Best Regards

-Huan

0 George Mock over 3 years ago in reply to Huan Hou

TI__Guru**** 244030 points

Thank you for a second test case. I can reproduce the same behavior. I filed EXT_EP-10854 to have this investigated. You are welcome to follow it with that link.

At the same time, the way you code the inline function is bit odd. It returns one result int result, then assigns the other int result through a pointer. Then, after all the calls to the inline function, it adds those two int results together. It is hard for the compiler to consistently see through all of that to generate good code.

As an alternative, consider using one of the functions from dsplib. Even if it does not have a function exactly like yours, it will have something close. Learn from the programming techniques it uses. It rarely (never?) uses inline functions the way your code does.

Thanks and regards,

-George

0 Huan Hou over 3 years ago in reply to George Mock

Expert 1505 points

Sorry George. I don't catch your point. I think you might mistake the two params cvOutCh1 and convOutCh1. The return result through a pointer is a local param cvOutCh1 and accumulated with global convOutCh1. This function is like fir filtering of the channels together, quite similar to the fir filtering in the dsplib, but with two channels input to reduce the momory load operation.

Sure, I should have a better naming of the accumalted global variable to avoid the misunderstanding.

Thank you again to file the compiler issue for further investigation.

Best Regards

-Huan

0 JohnS over 3 years ago in reply to Huan Hou

TI__Guru**** 161500 points

Huan,

George is out today and will be back in the office on Monday.

Regards,

John

0 George Mock over 3 years ago in reply to Huan Hou

TI__Guru**** 244030 points

Huan Hou said:
Sorry George. I don't catch your point.

I don't think it is a critically important point. I'm saying that ...

Fullscreen

1
2
    *p_conOut1 = convOut1;
    return convOut;
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    *p_conOut1 = convOut1;
    return convOut;

... is unusual. I would not expect any compiler to handle that very well. Somehow, the older compiler does. I refer you to dsplib as an example of more typical coding techniques.

Thanks and regards,

-George

0 Huan Hou over 3 years ago in reply to George Mock

Expert 1505 points

Hi, George. Either the new compiler or the old compiler handle the critical part well as you list from the generated assembly. For the example code, from the assembly code, where the new compiler behave bad have two places:

1. the duplicated load of the coef

2. the extra addition for the address update instead of addressing with increment

Those extra instructions occupied more execution unit which increase the iteration cycles.

Hope to help.

Best Regards

Huan

Code Composer Studio™︎

Code Composer Studio forum

Compiler: bad performance when upgrade the code generation toolchain from 7.4.x to 8.3.x