Ti c66x multiply intrinsics for 64bit output

studinstru sggs

Hi,

I did not find suitable instruction for below operation .

I am planning to use below qmpy32 instruction ,But the problem is i did not find any instruction which will give me result in 2x64 bit instead of 4x32 bit .and because of above instruction I am getting wrong result .Can any one tell me is there any other instruction available DSP c66x to multiply two 32 bit values and get the result in 64 bit

int32_t beta[4] ={185931936,84529224,-144944792,-175891288};
int32_t alfa[4] ={28505,24851,11653,13268};

C code : int64_t mult =0;

 for(int i=0;i<4;i++){
     mult += beta[i] * alfa[i];
 }

C66x CODE:

__x128_t _qmpy32 (__x128_t src1, __x128_t src2);

over 8 years ago

0 Raja over 8 years ago

TI__Guru* 81335 points

Response may be delayed due to holidays in USA. Thank you for your patience.

0 ran35366 over 8 years ago in reply to Raja

TI__Genius 12805 points

By definition _qmpy32 rounds the results to 32 bits

I wrote a small C code that does what you want. The compiler gives me one multiplication per cycle (2 cycles look, 2 additions in the loop) and it does not use intrinsic . Look at the code, compile it and see what are the performances:
If these performances are good for you, then you are done;

long long function1 (long *a, long *b)
{

long long xx = 0 ;
long long aa,bb ;
int i ;

_nassert((int) a % 8 == 0);
_nassert((int) b % 8 == 0);

#pragma UNROLL(2)
for (i=0; i<1024; i++)
{
aa = (long long) *a++ ;
bb = (long long) *b++ ;
xx = xx + aa * bb ;
}

return xx ;
}

And the assembly says:

;*----------------------------------------------------------------------------*
;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop found in file : ../main.c
;* Loop source line : 22
;* Loop opening brace source line : 23
;* Loop closing brace source line : 27
;* Loop Unroll Multiple : 2x
;* Known Minimum Trip Count : 512
;* Known Maximum Trip Count : 512
;* Known Max Trip Count Factor : 512
;* Loop Carried Dependency Bound(^) : 2
;* Unpartitioned Resource Bound : 2
;* Partitioned Resource Bound(*) : 2
;* Resource Partition:
;* A-side B-side
;* .L units 1 1
;* .S units 0 0
;* .D units 1 1
;* .M units 1 1
;* .X cross paths 1 1
;* .T address paths 1 1
;* Logical ops (.LS) 0 0 (.L or .S unit)
;* Addition ops (.LSD) 3 3 (.L or .S or .D unit)
;* Bound(.L .S .LS) 1 1
;* Bound(.L .S .D .LS .LSD) 2* 2*
;*
;* Searching for software pipeline schedule at ...
;* ii = 2 Schedule found with 6 iterations in parallel
;* Done
;*

Doe sit answer your question? If so, close the thread

Ran

Processors

Processors forum

Ti c66x multiply intrinsics for 64bit output