Compiler/C6000-CGT: Software Pipelined Loop Problem

Curtis Belknap

Part Number: C6000-CGT

Tool/software: TI C/C++ Compiler

Hello,

I have a function that is estimating a simple covariance matrix from signals received from multiple ADC channels. When the code is optimized at -O3 software pipelining is enabled and causes the output of my covariance matrix to to deviate from the expected value. What is interesting is that the deviation only happens on the third through seventh iteration of the m-indexed loop.

void calc_B(float* restrict B, int32_t n_chan, int32_t n_samples, float* samples)
{
    int32_t* restrict offset1;
    int32_t* restrict offset2;
    int32_t idx;
    int64_t samples1;
    int64_t samples2;
    int64_t samples3;
    int64_t samples4;
    int32_t b_idx;
    int64_t lltmp;
    int32_t itmp;
    float scale;
    float acc1;
    float acc2;
    float acc3;
    float acc4;
    int32_t n;
    int32_t m;

    // Faster than 1 / n_chirps
    scale = recipsp(n_samples);

    b_idx = 0;

    for (n = 0; n < n_chan; n++)
    {
        offset1 = &input_samples[n_samples * n];

        acc1 = 0.0f);
        acc2 = acc1;
        acc3 = acc2;
        acc4 = acc3;

        // First inner-loop calculates matrix diagonal elements
        for (idx = 0; idx < n_samples; idx += 8)
        {
            samples1 = _amem8(offset1 + idx);                        
            itmp     = _hill(samples1);                                    
            lltmp    = _mpy2ll(itmp, itmp);                       
            itmp     = _loll(samples1);                                    
            lltmp    = _dadd(lltmp, _mpy2ll(itmp, itmp));
            acc1    += _hill(lltmp) + _loll(lltmp); 

            samples2 = _amem8(offset1 + idx);                        
            itmp     = _hill(samples2);                                    
            lltmp    = _mpy2ll(itmp, itmp);                       
            itmp     = _loll(samples2);                                    
            lltmp    = _dadd(lltmp, _mpy2ll(itmp, itmp));
            acc2    += _hill(lltmp) + _loll(lltmp); 
            
            samples3 = _amem8(offset1 + idx);                        
            itmp     = _hill(samples3);                                    
            lltmp    = _mpy2ll(itmp, itmp);                       
            itmp     = _loll(samples3);                                    
            lltmp    = _dadd(lltmp, _mpy2ll(itmp, itmp));
            acc3    += _hill(lltmp) + _loll(lltmp); 
            
            samples4 = _amem8(offset1 + idx);                        
            itmp     = _hill(samples4);                                    
            lltmp    = _mpy2ll(itmp, itmp);                       
            itmp     = _loll(samples4);                                    
            lltmp    = _dadd(lltmp, _mpy2ll(itmp, itmp));
            acc4    += _hill(lltmp) + _loll(lltmp); 
        }

        B[b_idx++] = scale * (acc1 + acc2 + acc3 + acc4);

        // Off-diagonal entries (upper-right triangular)
        for (m = n + 1; m < n_chan; m++)
        {
            offset2 = &input_samples[n_samples * m];
            acc1 = 0.0f;
            acc2 = acc2;

            for (idx = 0; idx < n_samples; idx += 4)
            {
                samples1 = _amem8(offset1 + idx);
                samples2 = _amem8(offset2 + idx);
                itmp     = _hill(samples2);
                lltmp    = _mpy2ll(_hill(samples1), itmp);
                itmp     = _loll(samples2);
                lltmp    = _dadd(lltmp, _mpy2ll(_loll(samples1), itmp));
                acc1    += _hill(lltmp) + _loll(lltmp); 

                samples3 = _amem8(offset1 + idx + 2);
                samples4 = _amem8(offset2 + idx + 2);
                itmp     = _hill(samples4);
                lltmp    = _mpy2ll(_hill(samples3), itmp);
                itmp     = _loll(samples4);
                lltmp    = _dadd(lltmp, _mpy2ll(_loll(samples3), itmp));
                acc2    += _hill(lltmp) + _loll(lltmp); 
            }

            B[rn_idx++]) = scale * (acc1 + acc2);
        }
    }
}

I have read the compiler guide advice concerning using the volatile keyword (https://www.ti.com/lit/ug/sprui04c/sprui04c.pdf#page=65 section 4.4.2). Making acc1 and acc2 volatile is sufficient to fix the numerical issue, but introduces intolerable performance decrease (~25% slower). Reading through the generated assembly it looks like making acc1/2 volatile destroys the inner-loop pipeline.

Why would the software pipelined loop only disturb some of the m-indexed loop values?

--Curtis

over 3 years ago