This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Ti c66x multiply intrinsics for 64bit output

Hi,

I did not find suitable instruction for below operation .

I  am planning to use below qmpy32 instruction ,But the problem is i did not find any instruction which will give me result in 2x64 bit instead of 4x32 bit .and because of above instruction I am getting wrong result .Can any one tell me is there any other instruction available DSP c66x to multiply two 32 bit values and get the result in 64 bit 

int32_t beta[4] ={185931936,84529224,-144944792,-175891288};
int32_t alfa[4] ={28505,24851,11653,13268};

C code : int64_t mult =0;
 for(int i=0;i<4;i++){
     mult += beta[i] * alfa[i];
 }

C66x CODE:

__x128_t _qmpy32 (__x128_t src1, __x128_t src2);
  • Response may be delayed due to holidays in USA. Thank you for your patience.
  • By definition _qmpy32 rounds the results to 32 bits

    I wrote a small C code that does what you want. The compiler gives me one multiplication per cycle (2 cycles look, 2 additions in the loop) and it does not use intrinsic . Look at the code, compile it and see what are the performances:
    If these performances are good for you, then you are done;

    long long function1 (long *a, long *b)
    {


    long long xx = 0 ;
    long long aa,bb ;
    int i ;


    _nassert((int) a % 8 == 0);
    _nassert((int) b % 8 == 0);

    #pragma UNROLL(2)
    for (i=0; i<1024; i++)
    {
    aa = (long long) *a++ ;
    bb = (long long) *b++ ;
    xx = xx + aa * bb ;
    }

    return xx ;
    }

    And the assembly says:


    ;*----------------------------------------------------------------------------*
    ;* SOFTWARE PIPELINE INFORMATION
    ;*
    ;* Loop found in file : ../main.c
    ;* Loop source line : 22
    ;* Loop opening brace source line : 23
    ;* Loop closing brace source line : 27
    ;* Loop Unroll Multiple : 2x
    ;* Known Minimum Trip Count : 512
    ;* Known Maximum Trip Count : 512
    ;* Known Max Trip Count Factor : 512
    ;* Loop Carried Dependency Bound(^) : 2
    ;* Unpartitioned Resource Bound : 2
    ;* Partitioned Resource Bound(*) : 2
    ;* Resource Partition:
    ;* A-side B-side
    ;* .L units 1 1
    ;* .S units 0 0
    ;* .D units 1 1
    ;* .M units 1 1
    ;* .X cross paths 1 1
    ;* .T address paths 1 1
    ;* Logical ops (.LS) 0 0 (.L or .S unit)
    ;* Addition ops (.LSD) 3 3 (.L or .S or .D unit)
    ;* Bound(.L .S .LS) 1 1
    ;* Bound(.L .S .D .LS .LSD) 2* 2*
    ;*
    ;* Searching for software pipeline schedule at ...
    ;* ii = 2 Schedule found with 6 iterations in parallel
    ;* Done
    ;*


    Doe sit answer your question? If so, close the thread

    Ran