FPU FIR bug?

Hi,

There seems to be a bug in the FPU FIR implementation. When feeding an impulse input to an 255 order FIR, I expect the output would be the same as the FIR coefficients. It seemed to work fine for lower order FIRs like 127. Here is the code and linker cmd part.

code at http://snipt.org/APm2

firldb align(0x800) > RAML0 PAGE = 0
firfilt align(0x800) > RAML1 PAGE = 0
coefffilt align(0x800)> RAML2 PAGE = 0
sigIn align(0x800) > RAML6 PAGE = 1
sigOut align(0x800) > RAML7 PAGE = 1

over 12 years ago

0 yl over 12 years ago

Expert 1215 points

Come on, any one? It should be easy to verify and fix. I am just reluctant to spend time learning another assembly that I am not gonna need for anything else.

0 Vishal_Coelho over 12 years ago in reply to yl

TI__Mastermind 20850 points

I see the issue. The routine uses circular addressing which, on the C28x, only supports buffers of size 256 16-bit words....thats why it works with order 127 and below.

With 255 and above, the dbuffer needs to be atleast 512 words and as it goes through the multiply-accumulate it wraps around the dbuffer halfway through the coefficients. I need to investigate this further; I have filed a bug against it for the time being.

If you dont mind waiting a day or 2, Ill write a routine to work with a larger set of coefficients.

0 yl over 12 years ago in reply to Vishal_Coelho

Expert 1215 points

Hi Vishal, Thank you very much. I don't minding waiting a day or 2. Looking forward to your fix.

0 Vishal_Coelho over 12 years ago in reply to yl

TI__Mastermind 20850 points

Hi Yan8622,

Ok the circular addressing route is proving more difficult to fix, it'll take a while to figure out how to get it to work with N > 128.

I did write up an alternative implementation that uses the same object structure but uses a delay line implementation without circular buffering (uses data move instruction MOVD32 instead). It slower than the original code but works correctly for any order filter.

5224.FPU_FIR_130731.asm

You will have to replace the FIR_FP_calc function with the one I wrote and recompile the library to use it.

0 yl over 12 years ago in reply to Vishal_Coelho

Expert 1215 points

It seems to generate correct result. But it is way too slow to do anything useful.

This is really frustrating. This project is based on the assumption that F28069 is able to perform close to the published benchmarks.

0 Vishal_Coelho over 12 years ago in reply to yl

TI__Mastermind 20850 points

The code I have posted is slow because it uses a branch statement in there to repeat over the N-1 taps. What I typically do, to speed things, up is manually unroll that loop and repeat the MAC instruction N-1 times to give the fastest run time. How fast are you sampling?

0 yl over 12 years ago in reply to Vishal_Coelho

Expert 1215 points

48K audio, frame size of 16.

0 Vishal_Coelho over 12 years ago in reply to yl

TI__Mastermind 20850 points

Ok,so ive unrolled the loop. Each tap takes 2 cycles so for NTAPS=256 you are looking at 512 cycles for the FIR plus some overhead which Ive assumed to be a 100 (its probably lower)

That means 612 cycles per run of the FIR; if you are running the 28069 @ 90MHz you are looking at a max possible frerquency of 147058.823 ~ 147K that gives you an oversampling rate of 3X. This should work for your application. The downside is you need to manually unroll that loop if you decide to change filter size. let me know if this works

6607.FPU_FIRUnrolled_130805.asm

0 yl over 12 years ago in reply to Vishal_Coelho

Expert 1215 points

It isn't good enough for me. I need to run 2 256-tap FIR on two 48k audio stream and a bunch of other things, the maximum cycle count for a 256-tap FIR I can take is 350 cycles (mind you, the published benchmark is 305 cycles).

0 yl over 12 years ago in reply to Vishal_Coelho

Expert 1215 points

Also there seems to be a one sample delay and flipped in the original FIR implementation. If you can fix that, it will be great because I can stitch a 256-FIR out of two 128 FIRs.

This is the delay

input = 1 0 0 0 0 0 0 0 0

tap = 0.9 0.8 0.7

expected output = 0.9 0.8 0.7

I can also work with = 0.7 0.8 0.9

actual output = 0.8 0.9 0.0

0 Frank Bormann over 12 years ago in reply to Vishal_Coelho

Mastermind 7585 points

Vishal,

why didn't you use bit-reverse adressing mode for the buffer? This addressing mode is not limited to 256.

0 Vishal_Coelho over 12 years ago in reply to yl

TI__Mastermind 20850 points

Hi Yan,

There was a typo in defining the structure of the FIR. Instead of the element order, it should have be nTaps. If you set the macro FIR_ORDER to the number of taps( 128) instead of the order it will work correctly.

I have attached a sample project with the .asm; you will need to change the value of the project macro DEVSUPPORT_ROOT_DIR to match the location of the f2833x header files on your machine.

4721.FPU_FIR_Template_130821.zip

I tested it by generating coefficients of value 1 to 128 and then checking the impulse response. It works correctly

I have put in a bug fix request on this module. We will try to get the fix out as soon as we can. I appreciate your patience in the matter and apologize for the problems it has caused.

0 Vishal_Coelho over 12 years ago in reply to Frank Bormann

TI__Mastermind 20850 points

Frank,

Honestly, it never crossed my mind to try that approach. Now that you bring it up I recall reading a white paper we published way back about implementing circular buffers that way. Ill take a look at it and post a fix shortly.

0 Anandhi Ramesh over 12 years ago in reply to Vishal_Coelho

Prodigy 130 points

Hi Vishal,

I came across this same issue today, and got to the point where I could identify that the circular addressing was the limitation on the tap length. I tried using AMODE=1, whereby 15bits are compared instead of 7 for circular addressing, but I am not familiar enough with the 28x assembly programming to make this change myself. Any idea if this is a possible way to going around the issue of tap length limitation?

Also, any update on the bit reverse addressing approach to the circular buffer? I saw the whitepaper http://www.ti.com/lit/an/spra292/spra292.pdf. Is this what you were referring to?

Thanks,

Anandhi

0 Anandhi Ramesh over 12 years ago in reply to Vishal_Coelho

Prodigy 130 points

Hi Vishal,

There is a small issue in this code, I think. In the Init function for the circular addressing asm, we find

MOV ACC, *+XAR4[3] << #1 ; AL = order*2 (for float)
ADDB AL, #1

But now that the value being sent is ntaps, this should be

MOV ACC, *+XAR4[3] << #1 ; AL = order*2 (for float)
SUBB AL, #2

Am I correct?

Thanks,

Anandhi

0 Vishal_Coelho over 12 years ago in reply to Anandhi Ramesh

TI__Mastermind 20850 points

Hi Anandhi,

Anandhi Ramesh said:
Any idea if this is a possible way to going around the issue of tap length limitation?

I am pretty sure AMODE=1 circular addressing mode can be used, the only difference being that amode=0 uses the XAR6 pointer, amode =1 uses xar6+ar1 (base+offset) with the offset ar1 incrementing mod N (N = buffer size). The code would need to be altered a bit to work with this mode.

Anandhi Ramesh said:
Also, any update on the bit reverse addressing approach to the circular buffer? I saw the whitepaper http://www.ti.com/lit/an/spra292/spra292.pdf. Is this what you were referring to?

Yup this is the right paper. I have this function on my TODO list, however, i am currently wrapped up in another release and it may be a while before i get to it.

0 Vishal_Coelho over 12 years ago in reply to Vishal_Coelho

TI__Mastermind 20850 points

Ok, I think I fixed it. Please try it out and let me know if there are issues. I will be doing further tests on my end before pushing this out with the next FPU lib release

2474.FIR_f32.asm

0 Anandhi Ramesh over 12 years ago in reply to Vishal_Coelho

Prodigy 130 points

Vishal,

I tried this code, and find that it assumes a symmetrical filter. Is this correct? For an assymmetric FIR filter, it only works if the coefficients are given in reverse.

Anandhi

0 Vishal_Coelho over 12 years ago in reply to Anandhi Ramesh

TI__Mastermind 20850 points

These are the lines of code in the filter that do the multiply- accumulate. It accesses the filter taps in order but the pointer to the incoming data always points to the last entry x(N-1) at the start of the process

MOVL *+XAR6[AR1%++], XAR2 ; Store x(0), advance AR1 to point to x(N-1)
MOVZ AR3, AR1 ;
RPT AR0 ; Repeat order + 1 times
|| MACF32 R7H, R3H, *+XAR6[AR1%++], *XAR7++ ; R3H = R3H + R2H | R2H = x(N-k-1) * h(k) odd cycles
; R7H = R7H + R6H | R6H = x(N-k-1) * h(k) even cycles

The filter should be doing a convolution with time reversed input, the filter taps needn't be reversed in my opinion

y(k) = h(0)x(N-1) + h(1)x(N-2) + .... + h(N-1)x(0)

Are you not getting the right results with an asymmetric filter?

0 Anandhi Ramesh over 12 years ago in reply to Vishal_Coelho

Prodigy 130 points

For a FIR filter, we have

y(k) = h(0)*x(0) + h(1)*x(1)+.. h(N-1)x(N-1) as the usual implementation equation. According to your equation, the filter coefficients are reversed, arent they? This does give me the wrong output when i feed in an asymmetric filter, it works for symmetric filters because the flip doesnt matter.

If I try to modify it by using a decrement instead of increment on the filter register (XAR7) there is a compilation error. For fixing this, the cbIndex needs to be decremented, but again, this needs to be done cyclically, and this isnt doable currently in the assembly code either. If i write it in C, that increases cycles quite a bit.

0 Anandhi Ramesh over 12 years ago in reply to Anandhi Ramesh

Prodigy 130 points

Vishal,

I have another query regarding the performance of the function.

In the document sprue02a, its mentioned that

When repeated the MACF32 takes 3 + N cycles where N is the number of times the instruction is repeated.

But in my code i see that it takes 3 + (N*2) cycles instead. Is this because of the different mode in which we are operating? Or is it an erroneous observation that I can fix somehow?

N*2 cycles equal to or greater than the number of cycles we saw without the circular buffering (including a separate delay buffer manipulation function). So, its really not helpful.

anandhi

0 Anandhi Ramesh over 12 years ago in reply to Anandhi Ramesh

Prodigy 130 points

Vishal,

updated: I find that the numbers are as mentioned in the document. Please ignore the previous query. I don't yet know why the project in which i was testing it earlier shows larger numbers, but might be due to command file and optimization setting differences.

I did notice on error in your code. XAR2 is being used without being saved elsewhere. In my project, this was causing issues, changing the value of a different variable declared in my C code. I had to resolve it by adding a PUSH and POP. Same applies to many of the FPU registers, of which only R6, R7 are being saved before use, and then restored. This doesnt cause an issue in my code, but I suppose it might in other cases?

Thanks,

Anandhi

0 Vishal_Coelho over 12 years ago in reply to Anandhi Ramesh

TI__Mastermind 20850 points

Anandhi,

Anandhi Ramesh said:
I did notice on error in your code. XAR2 is being used without being saved elsewhere. In my project, this was causing issues, changing the value of a different variable declared in my C code. I had to resolve it by adding a PUSH and POP.

Yes, that is an error. I actually have to save XAR1, 2 ,3 per calling convention. Thanks for bringing that to my attention.

C2000™︎ microcontrollers

C2000 microcontrollers forum

FPU FIR bug?