This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CALCULATION TAKING TOO MUCH TIME??

Other Parts Discussed in Thread: TMS320C6472, SPRC542

Hello

I want to use TMS320C6472 for real time simulation. so I want to check how much time does a core take to compute a multiplication of 20*20 matrix and 20*1 column vector. all the data are in 'int'. I am using evm6472.gel that comes with the kit. presently I want to use a single core to calculate. I am using core-3.in the gel file the PLL1 multiplier is configured at 25 so that the core gets 625 MHz.

my linker file contains:

-c
-lrts64plus.lib
-heap  0xa000
-stack 0x2000

/* Memory Map 1 - the default */
MEMORY
{
        L1D:     o = 00f00000h   l = 00008000h
        L1P:     o = 00e00000h   l = 00008000h
        L2:      o = 00800000h   l = 00098000h
}

SECTIONS
{
    .text       >       L2
    .stack      >       L2
    .bss        >       L2
    .cinit      >       L2
    .cio        >       L2
    .const      >       L2
    .data       >       L2
    .switch     >       L2
    .sysmem     >       L2
    .far        >       L2
}

my main program contains:

#include<stdio.h>

#define CNTLO3 (*((volatile unsigned int *) 0x02610010))
#define CNTHI3 (*((volatile unsigned int *) 0x02610014))
#define PRDLO3 (*((volatile int *) 0x02610018))
#define PRDHI3 (*((volatile int *) 0x0261001C))
#define TCR3 (*((volatile int *) 0x02610020))
#define TGCR3 (*((volatile int *) 0x02610024))
#define EMUMGT_CLKSPD3 (*((volatile int *) 0x02610004))

void main(void)
{
int sum,ord,i,j,arr[20][20],b[20],x[20];   

PRDHI3=0xFFFFFFFF;
PRDLO3=0xFFFFFFFF;
CNTLO3=0x00000000;
CNTHI3=0x00000000;
TGCR3=0x00000003;
TCR3=0x00000000;   

ord=20;
for(i=0;i<ord;i++){   
b[i]=i+1;
for(j=0;j<ord;j++){
    arr[i][j]=i+j+2;
}
}      //generating matrix 'arr' & vector 'b'.

TCR3|=0x00000040;    //timer starts

//main matrix by vector multiplication starts
for(i=0;i<ord;i++){
sum=0;
for(j=0;j<ord;j++){
sum+=(arr[i][j])*b[j];
}
x[i]=sum;
}

TCR3&=0xFFFFFFBF;    //timer stops

printf("\ncalculation ends");

printf("\ncounter low:%x",CNTLO3);
printf("\ncounter high:%x",CNTHI3);

for(i=0;i<ord;i++)
printf("\nx[%d]=%d",i,x[i]);

}

problem:

I get a count 0xB19  ie, 2841 in decimal. so the calculation time is 2841*6/625 usec = 27.27 usec. the calculation involves:

20 assignments (sum=0)

400 multiply and accumulate

400 increment (j) and compare

20 increment (i) and compare

with a clock of 625 MHz the clock cycle becomes 1.6ns. with such a high speed clock it takes 27.27 usec to do less than 1000 operations. so 1 operation needs something more than 27.27ns. isn't it taking a much larger time? or I may be mistaken somewhere. please correct me. the timers are getting correct frequency. i have verified that by making a timer interrupt after 0xB19 counts. I kept the timer in clock mode. the output pin (TIMO2) is available at the HPI DC port. at that pin a 50%duty cycle square wave is obtained with time period around 2*27 usec.

Regards,

AC.

  • Hi AC,

    I suspect you havn't choosen -o3 for the Code generation tools. This is the first thing I'd review.

    Then I really recommend you have a look at the C6000 Programmer's Guide:

    http://focus.ti.com/lit/ug/spru198j/spru198j.pdf

    That will help you to identify areas of improvement on various fronts ...

     

    Kind regards,

    one and zero

  • Hi AC,

    Optimized code for matrix multiplication can be obtained in the DSPLIB for C64p that can be downloaded here

    http://focus.ti.com/docs/toolsw/folders/print/sprc265.html

    You will be able to perform a matrix multiplication of a 20*20 matrix with a 20*1 matrix in 156 cycles.

    Regards,

    Rahul

  • Hello,

     

    Thanks for the answers. It was true I was not using optimization level 3. After doing that the time consumption has drastically reduced. Now I can do the multiplication of 20*20 matrix with a 20*1 vector in less than 1 Us but only for integer data. For floating point data the time is around 100 times larger may be because TMS320C6472 is a fixed point processor. Any trick to come out of this situation? It means how to do floating point calculations reasonably efficiently(though it is a relative word) in a fixed point processor?

     

    Regards,

    AC.

  • AC,

    Have you considered using Q formats with your floating point matrices. With Q format, you can convert your floating point value to appropriate fixed point value  and after the matrix multiplication convert it back to floating point format. To convert a floating point number to Qm.n format multiply the number with 2^n.

    For a simple example consider the following

    We want to multiply 3.25 and 7.425 on a fixed point DSP.

    Q15 format of 3.25 is 3.25*32768 = 106496

    Q15 format of 7.425 is 7.425*32768 = 243302 (rounding)

    Q15(3.25) * Q15(7.425) = 106496 *243302  = 25910689792

    Product of two Q15 formats is a number in Q30 format so to convert the result back to floating point value we divide by 2^30(1073741824d).

    Answer=  25910689792/1073741824 = 24.1312 = 3.25 * 7.425

    Note: Using Q point format can introduce precision errors.

    For more details you can take a look at the IQMath library  which can be found here

    http://focus.ti.com/docs/toolsw/folders/print/sprc542.html

     

    Regards,

    Rahul

     

  • Thanks Rahul,

     

    I did not know about this library. I think this will help.Thanks for the example.

     

    Regards,

    AC