How to improve this decimation FIR performance?

Robert W

Hi,

I always get advice to write efficient C code on this forum. Here is an example I do not know how to write it in C, not in asm. The code is from Simulink DSP Coder generation. It is a decimation FIR (Decimation rate is 2).

for (j = 0; j < 27; j++) {
int da, db;
da = Hcic1_DWork.Hciccomp2_StatesBuff[curTapIdx];
db = (int32_T)Hcic1_ConstP.Hciccomp2_FILT[cffIdx];
tmp = _mpyilr(da, db);

tmp = tmp + Hcic1_DWork.Hciccomp2_Sums;
Hcic1_DWork.Hciccomp2_Sums = tmp;
cffIdx++;
curTapIdx -= 2;
if (curTapIdx < 0) {
curTapIdx += 54;
}
}

The decimation FIR has 54 coefs. Each one of two filter phases pass through the above code for half coefs (27). The software pipeline is disabled because of the internal if clause.

I find that the FIR in DSPLIB are all block processed. There is no polyphase decimation FIR either. The input sampling data is at 96 KSPS in my project. The above code is a single data sample based, i.e. for each input data, the function is called once. I don't know how to write it in C to make the for loop efficiently although I know the assembly code can use circular buffer to pipeline the loop (but with some hard hand coding). I hope to get your opinion on efficient C coding, especially from RandyP.

Another question is about DSPLIB function DSP_fir_gen(). There are input data x[], FIR coef h[] and output data r[]. Where is the FIR internal states for repeat function calling? These function is only for one time call for initial state 0's?

Thanks,

over 12 years ago

0 RandyP over 12 years ago

TI__Guru* 84110 points

Robert,

While looking back through old posts with no replies, I found yours. We apologize that you have not received a reply after so long, and I am especially sad that you mentioned me and I was not aware of it.

Have you solved your problem after this time has passed?

What are the starting values for curTapIdx and cffIdx?

Which DSP are you writing this for?

Regards,
RandyP

0 Robert W over 12 years ago in reply to RandyP

Mastermind 7100 points

Sorry for the first post, especially if it generated some small disturbing. I want to get the best performance from C. Then what FIR filter performance can get? I read through the optimization manual of C6000. It seems that if one prefers circular addressing in FIR, he has to use linear assembly. Is it right? I shall use C if I can control C to get similar performance.

Thanks,

0 Robert W over 12 years ago in reply to Robert W

Mastermind 7100 points

I forget to mention that I realized that in that example the coefficients may approach 64. Thus we consider circular addressing. How to manage the data size was also an issue on performance. Thanks for you the forum.

0 RandyP over 12 years ago in reply to Robert W

TI__Guru* 84110 points

Robert,

No disturbance or offsense on our side, and still our apologies to you.

It will help to have answers to these questions:

Which DSP are you writing this for?
What are the starting values for curTapIdx and cffIdx?
What is the true data type of Hcic1_ConstP.Hciccomp2_FILT[cffIdx]; that you need to cast it to int32_T?
If Hcic1_ConstP.Hciccomp2_FILT[cffIdx] is exactly 27 elements in size, can it be duplicated to double length for more efficient optimization?
What are your performance requirements for this filter application? "As fast as possible" is not a practical answer.
What are your performance measurements now?
Which compiler version are you using?
What are your compiler optimization settings?

Robert W said:
Another question is about DSPLIB function DSP_fir_gen(). There are input data x[], FIR coef h[] and output data r[]. Where is the FIR internal states for repeat function calling? These function is only for one time call for initial state 0's?

There is some confusion here, either with me or with our documentation. The DSP_fir_gen() function does an FIR sum-of-products for each of the nr coefficients with the first nr of the nr+nh-1 input samples, then it repeats that for the nr coefficients with the next nr input samples starting at input sample 1. This is repeated until nh output samples have been generated by running the FIR calculation on nh sequential overlapping sets of input samples. There is no need for internal states in an FIR function in this way, because this is run on a block of samples for best optimization. The next call to DSP_fir_gen() will reuse some of the input samples from the previous call to DSP_fir_gen(), and that may be what addresses your concern. Please see the description of the DSP_fir_gen() function in the DSPLIB Programmer's Reference Guide.

Regards,
RandyP

0 RandyP over 12 years ago in reply to RandyP

TI__Guru* 84110 points

I would like to add that, in my opinion, it is best or easiest to stay away from the assembly circular addressing. This causes issues in a system that requires real-time response to external interrupts while the processing is going on. It only works well for a very tightly closed and predictable environment.

Using C and the Optimizing Compiler to solve your performance requirements is my recommendation.

Regards, RandyP

0 Andy Polyakov over 12 years ago in reply to RandyP

Expert 1340 points

RandyP said:
I would like to add that, in my opinion, it is best or easiest to stay away from the assembly circular addressing. This causes issues in a system that requires real-time

Not true. Issues, if there are any, have nothing to do with real-time requirements. AMR handling in interrupt handlers is either free or charge or taikes additional pair of cycles, literally. The only possible issue is that target multi-process/thread environment might lack support for AMR and you're not in position to change that.

Now to original question. Originator asserted that "software pipeline is disabled because of the internal if clause." This isn't true. Compiler should generate conditional instruction and pipeline the code. But there is one interesting point. Consider following:

for(i=0;i<N;i++) tmp+=_mpylir(inp[i],coeff);

This is/should be compiled to SPLOOP 1, meaning that it effectively spends one cycles per input element. Now consider something like code discussed by originator. It is/should be compiled as SPLOOP 3, because this is how long it takes to handle curTapIdx. Meaning that loop would be three times slower. If one can arrange handling as following:

curTapIdx-=2; if (curTapIdx) curTapIdx=54;

Then it should be compiled as SPLOOP 2, i.e. would be only 2 times slower.

Would it pay off to code it in assembler with circular addressing? Yes, because using circular buffer would allow to operate at SPLOOP 1.

As for two and three times faster/slower. This is asymptotic coefficient for "indefinite" loops. In reality, and especially in this case when you take 27 spins, it's not two/three times difference, because you have to account for loop epilogues. I mean loop executes in N*M+epilogue, where N is SPLOOP factor, M is number of spins, which in turn means that you spend N+epilogue/M cycles per input element. So that improvement coefficient is (N1+epilogue/M)/(N2+epilogue/M) and not N1/N2.

Bottom line is that it's all about SPLOOP coefficient and what is sufficient for given application.

There is another possibility. If you arrange two accumulators, e.g.

for(i=0;i<N;i+=2) { tmp1+=_mpylir(inp[i],coeff); tmp2+=_mpylir(inp[i+1],coeff); } result=tmp1+tmp2;

then you should be able to utilize both multiplicators at same time.

0 Andy Polyakov over 12 years ago in reply to Andy Polyakov

Expert 1340 points

Andy Polyakov said:
There is another possibility. If you arrange two accumulators, e.g.
for(i=0;i<N;i+=2) { tmp1+=_mpylir(inp[i],coeff); tmp2+=_mpylir(inp[i+1],coeff); } result=tmp1+tmp2;

then you should be able to utilize both multiplicators at same time.

To avoid possible misunderstanding. Above is not a complement to SPLOOP 1, but an alternative. I mean I don't think it would be possible to process pair of input elements from original example with SPLOOP 1. It would take SPLOOP 2, but then it would make no sense as performance would be same. So that if you can achieve SPLOOP 1, there would be hardly a way to improve the result. Above is rather a way to improve performance of slower loops, for example by improving say SPLOOP 2 processing one element at a time with SPLOOP 3 processing a pair, thus resulting in 2/1.5 improvement. In other words above approach is not likely to actually double performance. Well, it would double performance of the above oversimplified example, but not more complicated code like Robert's.

0 RandyP over 12 years ago in reply to Andy Polyakov

TI__Guru* 84110 points

Robert,

You are in good hands with Andy Polyakov. If you want to pursue the questions I asked above, I will be happy to help.

Also, if you would like a Moderator to move this thread to the TI C/C++ Compiler Forum, we can do that.

Regards,
RandyP

0 Robert W over 12 years ago in reply to RandyP

Mastermind 7100 points

Thank you and Andy. I have much to digest from your posts. These info is really for me to get better performance in DSP applications.

0 Andy Polyakov over 12 years ago in reply to Robert W

Expert 1340 points

Even though circular buffer arguably appears appropriate, one should ask yourself if it's possible to solve the problem by wasting some memory. For example in this case it should be possible to place two copies of vector that is addressed with cumbersome curTapIdx next to each other and by pointing curTapIdx to second half prior entering the loop, you can avoid adjusting it inside the loop. The adjustment would effectively be taken outside the loop and nothing would prevent compiler from generating SPLOOP 1. Of course one should remember to weight costs of maintaining second copy against SPLOOP factor decrease... I mean SPLOOP 2 as results of arranging adjustment as if (!(curTapIdx-=2) might turn out adequate.

Processors

Processors forum

How to improve this decimation FIR performance?