How to calculate the multiplication of two short number efficiently

Liang Xiao

Prodigy 60 points

Hi,

I want to calculate the multiplication of two short (16bit) number fast, as indicated by codes listed below,

short a[1000];

short b[1000];

short c[1000];

for (i=0;i<1000;i++)

c[i]=a[i]*b[i];

there exists dsplib function DSP_mul32, but the two number should be int (32bit), and the result is also int. If someone has solution? Thanks.

over 13 years ago

0 Norman Wong over 13 years ago

Guru 26430 points

Literature Number: SPRU187S
March 2011
TMS320C6000 Optimizing Compiler v 7.2
7.5.5 Using Intrinsics to Access Assembly Language Statements
Table 7-3. TMS320C6000 C/C++ Compiler Intrinsics
int _mpy (int src1, int src2);
MPY
Multiplies the 16 LSBs of src1 by the 16 LSBs of src2 and returns the result. Values can be signed or unsigned.

Never used it myself. Might work in your situation. Very non-portable.

0 Liang Xiao over 13 years ago in reply to Norman Wong

Prodigy 60 points

Hi, Wong,

Maybe I did not express the question clear. I need to do a lot of multiplication of two shor number, but there is no dsplib function to solve it. DSP_mul32 implements the similar task, but the two operands need to be int (32bit), and the time converting the short number to int number is unacceptable.

0 Norman Wong over 13 years ago in reply to Liang Xiao

Guru 26430 points

I think you'll have to implement you own. Something like this:

void DSP_mul16(const short *x, const short *y, short *r, short nx)
{
register short i=nx;
register short a;
register short b;
register int   c;

if(i==0) return;

do
{
    a   = *x++;
    b   = *y++;
    c   = _mpy(a, b);
   *r++ = (short)c;
}
while(--i);
}

In theory, each line of C code should translate into one line of assembler.

0 Liang Xiao over 13 years ago in reply to Norman Wong

Prodigy 60 points

I inserted your subroutine code into my program, but the time cost is still high, does the function need to be optimized when compiled?

0 Norman Wong over 13 years ago in reply to Liang Xiao

Guru 26430 points

Sorry, I've never been able to figure out the optimization controls on the C6000 compiler. The compiler seems to unroll loops or combine repeated code on whim. You could use the assembler output of the compiler as basis for your own assembler module. I have doubts you can signiificantly reduce the number of instructions per loop. If your nx is constant, you could do replace the loop with a fixed number of multiplies. However the compiler may combine your repeated multiplys with a loop. Note unrolling a loop may result in code larger than the L1 cache and slower code than using a tight loop.

0 DanRinkes over 13 years ago in reply to Norman Wong

TI__Expert 8055 points

Let me clarify a few points here.

The reason that there is no DSP lib function for a 16-bit multiply is because it can be accomplished by a single assembly statement. The way to access these statements through C code is to use intrinsics, as Norman mentioned. But there is a better approach than using _mpy
The C6000 family are 32-bit processors. There are no 16-bit registers. So, while we can get 16-bit results, we have to do arithmetic operations on 32-bit values and then shift the results to capture the portions of the data that we need.
Optimization defininitely needs to be turned on to get full performance. the C6000 architecture is very complex.

Norman said "In Theory, each line of C code should translate into an assembly instruction." I don't believe that's good theory.

Example: consider the following C code. If the value of x (stored in memory) is equal to the value of y (stored in memory) then increment the value of C (stored in memory). This is more than 2 assembly instructions. Yes, some C instructions will translate to one assembly instruction, but most don't.

if (x == y )

c++;

Fetch X from memory and store to a register (call it A0)
Fetch Y from memory and store to a register (call it B0)
compare A0 and B0 and store the result to another register
fetch C from memory and store it to a register (call it A1)
(conditional on result of 3 being true) add 1 to A1
(conditional on result of 3 being true) write the value in A1 to location C in memory.

The point is, without any optimization on, this is likely going to take more than 6 cycles. With optimization on, much of this stuff will be done in parallel. We could fetch X and Y on the first cycle. On the 2nd cycle, we could compare A0 and B0 and fetch C. On the 3rd, we could conditionally add 1 and then write back on the 4th cycle.

What is the processing time in your current configuration? And what are you expecting to get? What happens if you remove debug symbols and turn up the optimization level of your initial code? How much improvement do you see.

Here is a great application note on Optimizing Loops

http://www.ti.com/lit/an/spra666/spra666.pdf

Also, I recommend reading the lastest version of the compiler guide. It details all of the switches that can be passed to the optimizer and explanations for how they are used.

http://www.ti.com/lit/ug/spru187t/spru187t.pdf

Also, familiarize yourself with the assembler instructions. Knowing what the CPU can do, will allow you to more thoroughly understand what the compiler is doing.

http://www.ti.com/lit/ug/sprugh7/sprugh7.pdf

Here's a link to a 3.5 day workshop and it's materials, all on Optimization on the C6000.

http://processors.wiki.ti.com/index.php/TMS320C6000_DSP_Optimization_Workshop

Regards,

Dan

0 Liang Xiao over 13 years ago in reply to DanRinkes

Prodigy 60 points

Thanks, I will continue the work according to your suggestion.

Processors

Processors forum

How to calculate the multiplication of two short number efficiently