This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Execution time for FIR16 filter

Other Parts Discussed in Thread: CONTROLSUITE

Hi there,

I am using the Fixed point FIR filter library provided by TI on F28069.

The filter I have is a BPF fir16 with order of 16 and it is working great but the problem I have is that it's taking longer than what's specified in the filter library document.

I am referring to the "C28x Fixed Point DSP Library" v1.01 written on 10th Jan 2011. On the page 32 it has a Benchmark information with a table showing how many cycles of instruction it takes for a different number of taps.

As I'm using 16 taps, I expected my fir calculation to take 58 cycles, which is 644ns at 90MHz. However, it's taking approx. 1.1us, nearly double the expected time. The way I've measured the time is by toggling a GPIO before the fir calculation and toggling the GPIO after the fir calculation has been done.

Could someone please explain to me why the FIR calculation takes so much longer than the specification?

Thanks

Ayaka 

#include <fir.h>
/* FIR16 coefficients (fir_16o_400k)
   FIR16 BP order=16 Hamming Fs=400000Hz Fc=5000-80000Hz
   {-126,11,189,-298,-1815,-2091,1990,8979,12563,8979,1990,-2091,-1815,-298,189,11,-126,} */
#define FIR_ORDER_RX 16
#define RX_COEFF {\
	12563,-8248557,722886,12449749,-19466007,-118882602,-137035587,130416651,588513154}
#define FIR_ORDER_SIZE ((FIR_ORDER_RX+2)/2)
#pragma DATA_SECTION(alRxBuf,"firRx");
long alRxBuf[FIR_ORDER_SIZE];

/* Define Constant Co-efficient Array (used for Tx and Rx and place the .econst/.const section in
non-volatile memory */
const long RxCoeff_const[FIR_ORDER_SIZE]= RX_COEFF;
long RxCoeff[FIR_ORDER_SIZE];

#define FIR_ORDER FIR_ORDER_RX

FIR16 fir_rx = FIR16_DEFAULTS;

void init_filter(void)
{
	fir_rx.dbuffer_ptr = alRxBuf;
	fir_rx.coeff_ptr=(long *)RxCoeff;
	fir_rx.order=FIR_ORDER;
	fir_rx.init(&fir_rx);
}

void process_task(int sample)
{
	GpioDataRegs.GPATOGGLE.bit.GPIO31 = 1;
	fir_rx.inpunt = sample;
	fir_rx.calc(&fir_rx);
	GpioDataRegs.GPATOGGLE.bit.GPIO31 = 1;
}

  • Hi Ayaka,

    I hope your clock is not divided into half using some DIV2 term? Are you sure about 90MHz?

    Regards,
    Gautam
  • Hello,

    Thanks for your reply.

    I don't think DIV2 term matters to me?
    I thought DIV2 was for the USB clock (correct me if I'm wrong).

    Anyway my clock set up is to use an external crystal input (20MHz) from GPIO19 and PLLCR[DIV] is set to 9 and PLLSTS[DIVSEL] is set to 2.
    That gives me the SYSCLK of 90MHz.

    Thanks,
    Ayaka
  • SYSCLK =90MHz is what we need. BTW can you explain:
    As I'm using 16 taps, I expected my fir calculation to take 58 cycles, which is 644ns at 90MHz.

    How did you calculate the number of cycles? Also, you can use the clock profiler to check out the actual number of cycles consumed.
    You can refer: processors.wiki.ti.com/.../Profiling_on_C28x_Targets

    Regards,
    Gautam
  • It looks like your coefficients (.econst) is in FLASH, can you copy them over to RAM at startup and then retry the filter.

  • Hi Vishal,

    Oh I actually do that already. Forgot to copy and paste it. Sorry. See the code below. I do copy it across using memcpy.

    Gautam,

    The number of cycles are specified on their fixed point library document as I've referred to in the post. I have not checked the actual number of cycles but since I'm using the TI's library the document should be accurate enough.

    Thanks for your help!

    Ayaka

    void init_filter(void)
    {
         // RxCoeff is in RAM, RxCoeff_const is in .econst
         memcpy(RxCoeff, RxCoeff_const, FIR_ORDER_SIZE * 2);
    
         fir_rx.dbuffer_ptr = alRxBuf;
         fir_rx.coeff_ptr = (long*) RxCoeff;
         fir_rx.order = FIR_ORDER;
         fir_rx.init(&fir_rx);
    }

  • Hi Ayaka,

    I benchmark the filters running out of RAM, coefficients in RAM, all interrupts turned off. I use the clock tool in CCS (in debug view, Run->Clock->Enable) from the point of the function call, i.e. in the disassembly window i find the LCR <fn_name> instruction, to the return.

    Im not sure if setting/clearing a GPIO would add that much overhead to account for the discrepancy you are seeing. Can you use the clock to verify the cycle count for just the FIR function call? if there is still a large discrepancy between the user guide and what you measure, I can take a look at it.
  • Hi Vishal,

    Thanks for your reply.

    I've tried that. It's a bit fiddly that I'd hoped.

    so when I put a breakpoint in fir_rx.calc(&fir_rx); and enabled the clock, set another breakpoint at the toggling of the GPIO, the clock cycle was showing 80. That's a little over what I've expected I guess.

    what do you think? Is that expected?

    Thanks,
    Ayaka
  • Hmm, Ok are the coefficients and delay line in separate physical RAMs? The reason i ask is, the FIR16 routine uses the DMAC instruction to do the mulitply-accumulate, and it uses both the program and data buses to read one coefficient and one delay line element; if the two are on the same RAM block, you will have contention, and the operation gets delayed.

  • Hi Vishal,

    Thanks for the reply.

    My linker script would have put two of them in the same RAM (RAMM0). 

    When I moved one of them to another RAM (RAML5678), the calculation took 73 cycles opposed to 81 cycles before. 

    That's definitely an improvement but still bit far from the expected.. What else could I try?

    Thanks,

    Ayaka

  • Can you refer to the .map file and tell me the addresses for alRxBuf and RxCoeffs?
  • Sure.

    alRxBuf is at 0x50 and RxCoeff is at 0x13be4

    In the map file:

    SECTION ALLOCATION MAP

    output attributes/
    section page origin length input sections
    -------- ---- ---------- ---------- ----------------
    firRx
    * 1 00000050 00000012 UNINITIALIZED
    00000050 00000012 rx_process.obj (firRx)
    ...
    ...
    RxCoeff
    * 1 00013be4 00000012 UNINITIALIZED
    00013be4 00000012 rx_process.obj (RxCoeff)


    GLOBAL DATA SYMBOLS: SORTED BY DATA PAGE

    address data page name
    -------- ---------------- ----
    00000050 1 (00000040) _alRxBuf
    ...
    ...
    00013be4 4ef (00013bc0) _RxCoeff


    Let me know if you'd like to see the actual map file then maybe I can email it to you.

    Thanks,
    Ayaka
  • Hmm, these seem to be in order. Lets look at the disassembly window, can you post the assembly code from the line that you set the breakpoint, i.e. fir_rx.calc(&fir_rx) to the next line toggling the GPIO - im tryin to see whether that extra 20 cycles is just code overhead in those two lines of code.

    You also have the option of generating assembly with interlisted C code. What you can do is right click on the C file, go to its properties, then under C2000 Compiler->Advanced Options->Assembler Options: Select --keep_asm and right under that is the option to source interlist, select --src_interlist. Now rebuild the .c file, you will find the assembly in the ouput folder; it will have the C code and the corresponding assembly for it - we should be able to figure out if there are actually 20 cycles between the function call to the toggling of the GPIO.
  • Hi Vishal,

    Here's the assembly around where fir.cal is called. I find it bit difficult to read it but maybe you can have a look and find what's going on.

    Thanks,

    Ayaka

    ;***	-----------------------g4:
    ;*** 475	-----------------------    led_set(0);
    ;*** 476	-----------------------    fir_Rx.input = adcVal[i]-adcave;
    ;*** 477	-----------------------    (*fir_Rx.calc)(&fir_Rx);
    ;*** 478	-----------------------    led_clear(0);
    ;*** 480	-----------------------    dwErxAgc = claDwAgcResult;
    ;*** 482	-----------------------    dwTmp = __lmin(__lmax((long)fir_Rx.output*dwErxAgc>>8, (-32768L)), 32767L);
    ;*** 494	-----------------------    Rx_decode_task((int)dwTmp);
    ;*** 498	-----------------------    calculate_agc((int)dwTmp);
    ;*** 470	-----------------------    if ( (++i) < 512u ) goto g4;
    	.dwpsn	file "../rx_process.c",line 475,column 3,is_stmt
            MOVB      AL,#0                 ; [CPU_] |475| 
    $C$DW$123	.dwtag  DW_TAG_TI_branch
    	.dwattr $C$DW$123, DW_AT_low_pc(0x00)
    	.dwattr $C$DW$123, DW_AT_name("_led_set")
    	.dwattr $C$DW$123, DW_AT_TI_call
            LCR       #_led_set             ; [CPU_] |475| 
            ; call occurs [#_led_set] ; [] |475| 
            MOVW      DP,#_fir_Rx+6    ; [CPU_U] 
    	.dwpsn	file "../rx_process.c",line 476,column 3,is_stmt
            MOV       AL,*+XAR2[AR1]        ; [CPU_] |476| 
    	.dwpsn	file "../rx_process.c",line 477,column 3,is_stmt
            MOVL      XAR4,#_fir_Rx    ; [CPU_U] |477| 
    	.dwpsn	file "../_rx_process.c",line 476,column 3,is_stmt
            SUB       AL,AR3                ; [CPU_] |476| 
            MOV       @_fir_Rx+6,AL    ; [CPU_] |476| 
    	.dwpsn	file "../rx_process.c",line 477,column 3,is_stmt
            MOVL      XAR7,@_fir_Rx+10 ; [CPU_] |477| 
    $C$DW$124	.dwtag  DW_TAG_TI_branch
    	.dwattr $C$DW$124, DW_AT_low_pc(0x00)
    	.dwattr $C$DW$124, DW_AT_TI_call
    	.dwattr $C$DW$124, DW_AT_TI_indirect
            LCR       *XAR7                 ; [CPU_] |477| 
            ; call occurs [XAR7] ; [] |477| 
    	.dwpsn	file "../rx_process.c",line 478,column 3,is_stmt
            MOVB      AL,#0                 ; [CPU_] |478| 
    $C$DW$125	.dwtag  DW_TAG_TI_branch
    	.dwattr $C$DW$125, DW_AT_low_pc(0x00)
    	.dwattr $C$DW$125, DW_AT_name("_led_clear")
    	.dwattr $C$DW$125, DW_AT_TI_call
            LCR       #_led_clear           ; [CPU_] |478| 
            ; call occurs [#_led_clear] ; [] |478| 
    

  • Ok, i see, so the GPIO toggle is a function call. In the disassembly window i would set the breakpoint at the instruction on line 36

    LCR  *XAR7

    that is the filter function call,set the breakpoint here, once reached start the clock and single step (in the disassembly window itself) to the next instruction i,e 

    MOVB AL, #0

    and check if the number of cycles is 58. That is how i benchmark the library functions, from the point of the LCR instruction. Its possible that in setting the breakpoint in C code, you are executing code that does not directly pertain to the FIR filter, and that is where the extra 20 cycles is coming from

    Also, FYI i think v1.20.00.00 of the library is available in controlSUITE