How to optimize this code?

David Halls

Hi,

This is the computationally intensive part of my code, does anyone have any suggestions on how to optimize it?

delay_idx = (start_pulse_idx-BUFSIZE-delay[0]+max_shift-N_taps)%LONG_BUFSIZE;
// Reset delay_idx
for( idx_samp = 0; idx_samp < BUFSIZE ; idx_samp++) ///BUFSIZE-N_taps
{
y_ll[0] = _dinthsp(p_data[0][idx_samp]);

// Set tap_delay_idx
tap_delay_idx = delay_idx;

// Clear y
y_est_ll[0] = 0;
for(idx_tap = 0; idx_tap < 12; idx_tap++) //N_taps*2
{

// Calculate y_est
y_est_ll[0] = _daddsp(y_est_ll[0],_complex_mpysp(x_ll_shared_data[0][tap_delay_idx],h_est_ll[idx_tap]));

tap_delay_idx++;
tap_delay_idx = ( tap_delay_idx >= LONG_BUFSIZE ) ? LONG_BUFSIZE - tap_delay_idx : tap_delay_idx;
}

delay_idx++;
delay_idx = ( delay_idx >= LONG_BUFSIZE ) ? LONG_BUFSIZE - delay_idx : delay_idx;

// Write output values
x_ll_data[0][idx_samp] = _dsubsp(y_ll[0],y_est_ll[0]);
}

where the *_ll are __float2_t types, p_data is a volatile Uint32*, LONG_BUFSIZE = 10*BUFSIZE, BUFSIZE = 15360, the iterators are global Uint32s,

over 8 years ago

0 David Halls over 8 years ago

Intellectual 1000 points

Splitting into two loops seems to save ~500k ticks, any other suggestions? I only need to shave off another 100k ticks!!! :)

// Convert input values
for( idx_samp = 0; idx_samp < BUFSIZE ; idx_samp++) ///BUFSIZE-N_taps
{
y_ll[idx_samp] = _dinthsp(p_data[0][idx_samp]);
}

// Cancellation
delay_idx = (start_pulse_idx-BUFSIZE-delay[0]+max_shift-N_taps)%LONG_BUFSIZE;
// Reset delay_idx
for( idx_samp = 0; idx_samp < BUFSIZE ; idx_samp++) ///BUFSIZE-N_taps
{

// Set tap_delay_idx
tap_delay_idx = delay_idx;

// Clear y
y_est_ll[0] = 0;
for(idx_tap = 0; idx_tap < 12; idx_tap++) //N_taps*2
{

// Calculate y_est
y_est_ll[0] = _daddsp(y_est_ll[0],_complex_mpysp(x_ll_shared_data[0][tap_delay_idx],h_est_ll[idx_tap]));

tap_delay_idx++;
tap_delay_idx = ( tap_delay_idx >= LONG_BUFSIZE ) ? LONG_BUFSIZE - tap_delay_idx : tap_delay_idx;
}

delay_idx++;
delay_idx = ( delay_idx >= LONG_BUFSIZE ) ? LONG_BUFSIZE - delay_idx : delay_idx;

// Write output values
x_ll_data[0][idx_samp] = _dsubsp(y_ll[idx_samp],y_est_ll[0]);
}

0 Norman Wong over 8 years ago in reply to David Halls

Guru 26430 points

Some general comments not specific to your processor.
Is y_ll and y_est_ll used elsewhere? If not, those two arrays could be replaced by two local variables. The idx increment+check/reset/set could be reordered to check/reset/inc. Making the index variables into local variables would help the compiler putting variables into registers. Hinting that the most commonly used variables should be in registers might help. That works only if your code snippet is contained in it's own function. Assumes it is not part of a large chunk of code.

void dostuf(void)
{
register __float2_t y; // Local summation variable.
register Uint32 tap_delay_idx;
...
// Clear y
y = 0;
for(idx_tap = 0; idx_tap < 12; idx_tap++) //N_taps*2
{
// Calculate y_est
y = _daddsp(y,_complex_mpysp(x_ll_shared_data[0][tap_delay_idx],h_est_ll[idx_tap]));
tap_delay_idx = ( tap_delay_idx == (LONG_BUFSIZE-1) ) ? 0 : tap_delay_idx+1;
}
y_est_ll[0] = y; // Save y.
...
}
More could be done with dereferencing 2D array pointers to 1D array pointers. I guess you pad the LONG_BUFSIZE buffers by BUFSIZE such that you don't bother wrapping around. Avoid the index wrap out tests at the expense of a lot of memory.

0 one and zero over 8 years ago in reply to Norman Wong

TI__Mastermind 18146 points

Hi Norman,

one thing to look at is e.g. if you actually use the full datapath bandwidth ... that can help a lot.

There's a really good application report explaining all optimization techniques in great detail (datapath optimizationis explained in Chapter 4.1.4 Exploiting Wide Loads and Stores (SIMD) ):

http://www.ti.com/lit/an/spra666/spra666.pdf

Kind regards,

one and zero

0 David Halls over 8 years ago in reply to Norman Wong

Intellectual 1000 points

Norman,

Thank you for a very clear and extremely helpful post. The idea about rearranging the increment+check/reset/set was neat and made a large difference. Thanks also for the information about using local variables, its own function, and using registers. Overall I managed to reduce by about half, which means I can use nearly twice the number of taps in my filter!

Using 1D pointers didn't seem to make any further improvement.

The code is very slow now to build, even when I don't make any changes to the function in question - is there any way around this? E.g. telling it that a certain section has not changed?

Here is the final function for the community to see

void DoCancellation()
{
// Define input pointers
Uint32 *p_data;
// Define output pointers
__float2_t *x_ll_data;
__float2_t *x_ll_shared_data;
__float2_t *h_est_ll_data;

// Assign input pointers
p_data = &TmpBuf[0][0];
// Assign output pointers
x_ll_data = &x_ll[0][0];
x_ll_shared_data = &x_ll_shared[0][0];
h_est_ll_data = &h_est_ll[0];

register Uint32 delay_idx_loc;
register Uint32 tap_delay_idx_loc;
register Uint32 idx_samp_loc;
register Uint32 idx_tap_loc;
register __float2_t y_est_ll_loc;
register __float2_t y_ll_loc;

// Set delay_idx_loc
delay_idx_loc = (start_pulse_idx-BUFSIZE-delay[0]+max_shift-N_taps)%LONG_BUFSIZE;
for( idx_samp_loc = 0; idx_samp_loc < BUFSIZE ; idx_samp_loc++) ///BUFSIZE-N_taps
{
// Set tap_delay_idx_loc
tap_delay_idx_loc = delay_idx_loc;

// Clear y
y_est_ll_loc = 0;
for(idx_tap_loc = 0; idx_tap_loc < TAPS*2; idx_tap_loc++) //N_taps*2
{
// Calculate y_est
y_est_ll_loc = _daddsp(y_est_ll_loc,_complex_mpysp(x_ll_shared_data[tap_delay_idx_loc],h_est_ll_data[idx_tap_loc]));

tap_delay_idx_loc = ( tap_delay_idx_loc == (LONG_BUFSIZE-1) ) ? 0 : tap_delay_idx_loc+1;
}
delay_idx_loc = ( delay_idx_loc == (LONG_BUFSIZE-1) ) ? 0 : delay_idx_loc+1;

// Write output values
y_ll_loc = _dinthsp(p_data[idx_samp_loc]);
x_ll_data[idx_samp_loc] = _dsubsp(y_ll_loc,y_est_ll_loc);
}
}

0 Norman Wong over 8 years ago in reply to David Halls

Guru 26430 points

You can also put your pointer variables into registers as well. Note that there are a limited number of registers. Eventually the compiler will not put the variable in a register and put them on the stack. You should declare your variables in order of precedence. The variables you think are used the most are declared first. Usually variables in the midst of loops. Variables that are used the one or twice should not be in registers.

The compiler can do register assignment automatically depending on optimization levels. However the compiler can't do that if it does not know the loop count. If the loop count is passed in, manually register assignment is required. Automatic loop unrolling optimization is another matter. The downside of heavy optimization is that the code it is hard to step though. Generated instructions no longer match the source.

As for compilation speed...How much code is in one file? Breaking up code into several files may not speed up the overall time but at least you can see progress and possibly identify which bits of code are slow to compile.

0 Andy Polyakov over 8 years ago in reply to David Halls

Expert 1340 points

Answers are bound to be speculative, because it's impossible to actually compile the snippet, macros describing vector lengths are missing. But I can tell that

__float2_t foo(__float2_t *a,__float2_t *b, int j)
{
    __float2_t ret1 = 0, ret2 = 0;
    int i;

    for (i = 0; i < 2*N;) {
        ret1 = _daddsp(ret1,_complex_mpysp(a[j],b[i++]));
        j = (j == JMAX-1) ? 0 : j+1;
        ret2 = _daddsp(ret2,_complex_mpysp(a[j],b[i++]));
        j = (j == JMAX-1) ? 0 : j+1;
    }

    return _daddsp(ret1, ret2);
}

asymptotically should be 50% faster than

__float2_t foo(__float2_t *a,__float2_t *b, int j)
{
    __float2_t ret = 0;
    int i;

    for (i = 0; i < 2*N;) {
        ret = _daddsp(ret,_complex_mpysp(a[j],b[i++]));
        j = (j == JMAX-1) ? 0 : j+1;
    }

    return ret;
}

Do note "asymptotically" and "should be". Asymptotically means "larger N closer to 50%" (well, provided that data is still in cache). And "should be" means "unless compiler did it for you given some specific flags". 50% was noted with cl6x 7.4.x invoked with -mv6600. Downside is that you get slightly different result. But different doesn't have to mean worse, because depending on data set it can be even more accurate. Or whole thing can be misleading because N is never large enough. As already said, it's bound to be speculative...

0 David Halls over 8 years ago in reply to Norman Wong

Intellectual 1000 points

Thanks Norman - roughly how many registers are available on the C6670?

0 David Halls over 8 years ago in reply to Andy Polyakov

Intellectual 1000 points

That's very interesting, so it does two multiplications and additions per loop? By 50% faster, do you mean twice the speed or 150% of the speed?

Thanks!

0 Andy Polyakov over 8 years ago in reply to David Halls

Expert 1340 points

David Halls said:
That's very interesting, so it does two multiplications and additions per loop?

Keyword is that it does pair of independent operations per iteration, so that they can be overlap. If you have only one accumulator execution time can't be less than addition latency times number of iterations (plus loop epilogue overhead). Multiple accumulators allow to break this barrier (for large enough N).

David Halls said:
By 50% faster, do you mean twice the speed or 150% of the speed?

5% improvement means 1.05 improvement factor, right? Hence 50% means 1.5. Once again, asymptotic.

0 David Halls over 8 years ago in reply to Andy Polyakov

Intellectual 1000 points

>> 5% improvement means 1.05 improvement factor, right? Hence 50% means 1.5. Once again, asymptotic.

Yes, I concur, just checking.

>> Keyword is that it does pair of independent operations per iteration, so that they can be overlap. If you have only one accumulator execution time can't be less than addition latency times number of iterations (plus loop epilogue overhead). Multiple accumulators allow to break this barrier (for large enough N).

Interesting stuff, if the inner loop is only 10-20 iterations, this won't achieve much speed-up? The outer loop in my code is (BUFSIZE=15360), though....

0 Andy Polyakov over 8 years ago in reply to David Halls

Expert 1340 points

David Halls said:
if the inner loop is only 10-20 iterations, this won't achieve much speed-up?

Correct. As implied there is "epilogue costs". In more tangible terms it means that loops in question execute in 3*2*N+15 and 4*N+18. So that for N=5 (recall that there are 2*N iterations in my examples, so that N=5 corresponds to 10 iterations) improvement would be 45/38=1.18. Specifically for the loop in question. And if it's placed in bigger context overall improvement would be even less, right?

David Halls said:
The outer loop in my code is (BUFSIZE=15360), though....

As I said, if code was actually ready to be compiled... Compiled might even choose fully unroll inner loop for that small N. Well, you did mentioned it was 12 in first post, but to be honest I've missed this number in first post... Sorry...

Processors

Processors forum

How to optimize this code?