Execution time for FIR16 filter

Ayaka Ohira

Other Parts Discussed in Thread: CONTROLSUITE

Hi there,

I am using the Fixed point FIR filter library provided by TI on F28069.

The filter I have is a BPF fir16 with order of 16 and it is working great but the problem I have is that it's taking longer than what's specified in the filter library document.

I am referring to the "C28x Fixed Point DSP Library" v1.01 written on 10th Jan 2011. On the page 32 it has a Benchmark information with a table showing how many cycles of instruction it takes for a different number of taps.

As I'm using 16 taps, I expected my fir calculation to take 58 cycles, which is 644ns at 90MHz. However, it's taking approx. 1.1us, nearly double the expected time. The way I've measured the time is by toggling a GPIO before the fir calculation and toggling the GPIO after the fir calculation has been done.

Could someone please explain to me why the FIR calculation takes so much longer than the specification?

Thanks

Ayaka

#include <fir.h>

/* FIR16 coefficients (fir_16o_400k)
   FIR16 BP order=16 Hamming Fs=400000Hz Fc=5000-80000Hz
   {-126,11,189,-298,-1815,-2091,1990,8979,12563,8979,1990,-2091,-1815,-298,189,11,-126,} */
#define FIR_ORDER_RX 16
#define RX_COEFF {\
	12563,-8248557,722886,12449749,-19466007,-118882602,-137035587,130416651,588513154}

#define FIR_ORDER_SIZE ((FIR_ORDER_RX+2)/2)
#pragma DATA_SECTION(alRxBuf,"firRx");
long alRxBuf[FIR_ORDER_SIZE];

/* Define Constant Co-efficient Array (used for Tx and Rx and place the .econst/.const section in
non-volatile memory */
const long RxCoeff_const[FIR_ORDER_SIZE]= RX_COEFF;
long RxCoeff[FIR_ORDER_SIZE];

#define FIR_ORDER FIR_ORDER_RX

FIR16 fir_rx = FIR16_DEFAULTS;

void init_filter(void)
{
	fir_rx.dbuffer_ptr = alRxBuf;
	fir_rx.coeff_ptr=(long *)RxCoeff;
	fir_rx.order=FIR_ORDER;
	fir_rx.init(&fir_rx);
}

void process_task(int sample)
{
	GpioDataRegs.GPATOGGLE.bit.GPIO31 = 1;
	fir_rx.inpunt = sample;
	fir_rx.calc(&fir_rx);
	GpioDataRegs.GPATOGGLE.bit.GPIO31 = 1;
}

over 9 years ago

0 Gautam Iyer over 9 years ago

Guru 192875 points

Hi Ayaka,

I hope your clock is not divided into half using some DIV2 term? Are you sure about 90MHz?

Regards,
Gautam

0 Ayaka Ohira over 9 years ago in reply to Gautam Iyer

Intellectual 440 points

Hello,

Thanks for your reply.

I don't think DIV2 term matters to me?
I thought DIV2 was for the USB clock (correct me if I'm wrong).

Anyway my clock set up is to use an external crystal input (20MHz) from GPIO19 and PLLCR[DIV] is set to 9 and PLLSTS[DIVSEL] is set to 2.
That gives me the SYSCLK of 90MHz.

Thanks,
Ayaka

0 Gautam Iyer over 9 years ago in reply to Ayaka Ohira

Guru 192875 points

SYSCLK =90MHz is what we need. BTW can you explain:

As I'm using 16 taps, I expected my fir calculation to take 58 cycles, which is 644ns at 90MHz.

How did you calculate the number of cycles? Also, you can use the clock profiler to check out the actual number of cycles consumed.
You can refer: processors.wiki.ti.com/.../Profiling_on_C28x_Targets

Regards,
Gautam

0 Vishal_Coelho over 9 years ago

TI__Mastermind 20850 points

It looks like your coefficients (.econst) is in FLASH, can you copy them over to RAM at startup and then retry the filter.

0 Ayaka Ohira over 9 years ago in reply to Vishal_Coelho

Intellectual 440 points

Hi Vishal,

Oh I actually do that already. Forgot to copy and paste it. Sorry. See the code below. I do copy it across using memcpy.

Gautam,

The number of cycles are specified on their fixed point library document as I've referred to in the post. I have not checked the actual number of cycles but since I'm using the TI's library the document should be accurate enough.

Thanks for your help!

Ayaka

void init_filter(void)
{
     // RxCoeff is in RAM, RxCoeff_const is in .econst
     memcpy(RxCoeff, RxCoeff_const, FIR_ORDER_SIZE * 2);

     fir_rx.dbuffer_ptr = alRxBuf;
     fir_rx.coeff_ptr = (long*) RxCoeff;
     fir_rx.order = FIR_ORDER;
     fir_rx.init(&fir_rx);
}

0 Vishal_Coelho over 9 years ago in reply to Ayaka Ohira

TI__Mastermind 20850 points

Hi Ayaka,

I benchmark the filters running out of RAM, coefficients in RAM, all interrupts turned off. I use the clock tool in CCS (in debug view, Run->Clock->Enable) from the point of the function call, i.e. in the disassembly window i find the LCR <fn_name> instruction, to the return.

Im not sure if setting/clearing a GPIO would add that much overhead to account for the discrepancy you are seeing. Can you use the clock to verify the cycle count for just the FIR function call? if there is still a large discrepancy between the user guide and what you measure, I can take a look at it.

0 Ayaka Ohira over 9 years ago in reply to Vishal_Coelho

Intellectual 440 points

Hi Vishal,

Thanks for your reply.

I've tried that. It's a bit fiddly that I'd hoped.

so when I put a breakpoint in fir_rx.calc(&fir_rx); and enabled the clock, set another breakpoint at the toggling of the GPIO, the clock cycle was showing 80. That's a little over what I've expected I guess.

what do you think? Is that expected?

Thanks,
Ayaka

0 Vishal_Coelho over 9 years ago in reply to Ayaka Ohira

TI__Mastermind 20850 points

Hmm, Ok are the coefficients and delay line in separate physical RAMs? The reason i ask is, the FIR16 routine uses the DMAC instruction to do the mulitply-accumulate, and it uses both the program and data buses to read one coefficient and one delay line element; if the two are on the same RAM block, you will have contention, and the operation gets delayed.

0 Ayaka Ohira over 9 years ago in reply to Vishal_Coelho

Intellectual 440 points

Hi Vishal,

Thanks for the reply.

My linker script would have put two of them in the same RAM (RAMM0).

When I moved one of them to another RAM (RAML5678), the calculation took 73 cycles opposed to 81 cycles before.

That's definitely an improvement but still bit far from the expected.. What else could I try?

Thanks,

Ayaka

0 Vishal_Coelho over 9 years ago in reply to Ayaka Ohira

TI__Mastermind 20850 points

Can you refer to the .map file and tell me the addresses for alRxBuf and RxCoeffs?

0 Ayaka Ohira over 9 years ago in reply to Vishal_Coelho

Intellectual 440 points

Sure.

alRxBuf is at 0x50 and RxCoeff is at 0x13be4

In the map file:

SECTION ALLOCATION MAP

output attributes/
section page origin length input sections
-------- ---- ---------- ---------- ----------------
firRx
* 1 00000050 00000012 UNINITIALIZED
00000050 00000012 rx_process.obj (firRx)
...
...
RxCoeff
* 1 00013be4 00000012 UNINITIALIZED
00013be4 00000012 rx_process.obj (RxCoeff)

GLOBAL DATA SYMBOLS: SORTED BY DATA PAGE

address data page name
-------- ---------------- ----
00000050 1 (00000040) _alRxBuf
...
...
00013be4 4ef (00013bc0) _RxCoeff

Let me know if you'd like to see the actual map file then maybe I can email it to you.

Thanks,
Ayaka

0 Vishal_Coelho over 9 years ago in reply to Ayaka Ohira

TI__Mastermind 20850 points

Hmm, these seem to be in order. Lets look at the disassembly window, can you post the assembly code from the line that you set the breakpoint, i.e. fir_rx.calc(&fir_rx) to the next line toggling the GPIO - im tryin to see whether that extra 20 cycles is just code overhead in those two lines of code.

You also have the option of generating assembly with interlisted C code. What you can do is right click on the C file, go to its properties, then under C2000 Compiler->Advanced Options->Assembler Options: Select --keep_asm and right under that is the option to source interlist, select --src_interlist. Now rebuild the .c file, you will find the assembly in the ouput folder; it will have the C code and the corresponding assembly for it - we should be able to figure out if there are actually 20 cycles between the function call to the toggling of the GPIO.

0 Ayaka Ohira over 9 years ago in reply to Vishal_Coelho

Intellectual 440 points

Hi Vishal,

Here's the assembly around where fir.cal is called. I find it bit difficult to read it but maybe you can have a look and find what's going on.

Thanks,

Ayaka

;***	-----------------------g4:
;*** 475	-----------------------    led_set(0);
;*** 476	-----------------------    fir_Rx.input = adcVal[i]-adcave;
;*** 477	-----------------------    (*fir_Rx.calc)(&fir_Rx);
;*** 478	-----------------------    led_clear(0);
;*** 480	-----------------------    dwErxAgc = claDwAgcResult;
;*** 482	-----------------------    dwTmp = __lmin(__lmax((long)fir_Rx.output*dwErxAgc>>8, (-32768L)), 32767L);
;*** 494	-----------------------    Rx_decode_task((int)dwTmp);
;*** 498	-----------------------    calculate_agc((int)dwTmp);
;*** 470	-----------------------    if ( (++i) < 512u ) goto g4;
	.dwpsn	file "../rx_process.c",line 475,column 3,is_stmt
        MOVB      AL,#0                 ; [CPU_] |475| 
$C$DW$123	.dwtag  DW_TAG_TI_branch
	.dwattr $C$DW$123, DW_AT_low_pc(0x00)
	.dwattr $C$DW$123, DW_AT_name("_led_set")
	.dwattr $C$DW$123, DW_AT_TI_call
        LCR       #_led_set             ; [CPU_] |475| 
        ; call occurs [#_led_set] ; [] |475| 
        MOVW      DP,#_fir_Rx+6    ; [CPU_U] 
	.dwpsn	file "../rx_process.c",line 476,column 3,is_stmt
        MOV       AL,*+XAR2[AR1]        ; [CPU_] |476| 
	.dwpsn	file "../rx_process.c",line 477,column 3,is_stmt
        MOVL      XAR4,#_fir_Rx    ; [CPU_U] |477| 
	.dwpsn	file "../_rx_process.c",line 476,column 3,is_stmt
        SUB       AL,AR3                ; [CPU_] |476| 
        MOV       @_fir_Rx+6,AL    ; [CPU_] |476| 
	.dwpsn	file "../rx_process.c",line 477,column 3,is_stmt
        MOVL      XAR7,@_fir_Rx+10 ; [CPU_] |477| 
$C$DW$124	.dwtag  DW_TAG_TI_branch
	.dwattr $C$DW$124, DW_AT_low_pc(0x00)
	.dwattr $C$DW$124, DW_AT_TI_call
	.dwattr $C$DW$124, DW_AT_TI_indirect
        LCR       *XAR7                 ; [CPU_] |477| 
        ; call occurs [XAR7] ; [] |477| 
	.dwpsn	file "../rx_process.c",line 478,column 3,is_stmt
        MOVB      AL,#0                 ; [CPU_] |478| 
$C$DW$125	.dwtag  DW_TAG_TI_branch
	.dwattr $C$DW$125, DW_AT_low_pc(0x00)
	.dwattr $C$DW$125, DW_AT_name("_led_clear")
	.dwattr $C$DW$125, DW_AT_TI_call
        LCR       #_led_clear           ; [CPU_] |478| 
        ; call occurs [#_led_clear] ; [] |478|

0 Vishal_Coelho over 9 years ago in reply to Ayaka Ohira

TI__Mastermind 20850 points

Ok, i see, so the GPIO toggle is a function call. In the disassembly window i would set the breakpoint at the instruction on line 36

LCR *XAR7

that is the filter function call,set the breakpoint here, once reached start the clock and single step (in the disassembly window itself) to the next instruction i,e

MOVB AL, #0

and check if the number of cycles is 58. That is how i benchmark the library functions, from the point of the LCR instruction. Its possible that in setting the breakpoint in C code, you are executing code that does not directly pertain to the FIR filter, and that is where the extra 20 cycles is coming from

Also, FYI i think v1.20.00.00 of the library is available in controlSUITE

C2000™︎ microcontrollers

C2000 microcontrollers forum

Execution time for FIR16 filter