Hardware Multiplier 16x16 MSP430AFE254

MAC Engineering

Expert 1460 points

We are seeking the advice of experienced MSP programmers regarding the Hardware Multiplier.

Our goal it to keep the SD24, 24-bit resolution, and have an efficient processing of a polynomial.

Example Expression:

+6.383091E-08 * 1EFFFFh

Q1) Is it efficient to use the 16x16 Hardware Multiplier for greater than 16-bits?

Q2) If so, can one demonstrate how to preform the above example multiplication using C-language?

This example is over simplified, referencing TI document slaa042.pdf, page 21 is similar to our end goal.

Comment, we do not program in assembly language, thus document slaa042.pdf, Fig 4, 40x40 Unsigned Multiplication MPYU40 makes very little sense to us.

We are using CCS v4.

Thanks in advance for clear help of how to load the registers and do efficient 24-bit calculations,

MAC Engineering

over 13 years ago

0 Antonio Espirito-Santo over 13 years ago

Expert 2855 points

Hi,

If you really want efficiency you should program this in assembly. The code bellow shows you how to implement the multiplying by hand procedure in C. Confirm that you enabled the hardware multiplier.

Best Regards,

AES

//*************************************************************************************************
// Multiplication by hand algorithm implemented in C.
// This example multiplies two unsigned 32-bits values. the result is stored in 64-bits.
//
//                                      b1 : b0
//                                    x a1 : a0
//                       ____________________________
//                                   c0_high : c0_low ------------> c0 (32 bits) --- a0*b0
//                         c1_high : c1_low             ------------> c1 (32 bits) --- a0*b1
//                         c2_high : c2_low             ------------> c2 (32 bits) --- a1*b0
//      + c3_high : c3_low                       ------------> c3 (32 bits) --- a1*b1
//            ______________________________________
//            res1_1 : res1_0 : re0_1 : res0_0   -----------> res (64 bits)
//
//    res0_0 = c0_low
//    res0_1 = c0_high + c1_low + c2_low
//    res1_0 = c1_high + c2_high + c3_low + carry
//    res1_1 = c3_high + carry
//*************************************************************************************************

#include <msp430af254.h>

unsigned long int a;
unsigned long int b;

unsigned int a1, a0, b1, b0;

unsigned long int c3,c2,c1,c0;

unsigned long int res1, res0;

void main(void)
{
WDTCTL = WDTPW+WDTHOLD;                   // Stop watchdog timer

b = 6383091;
a = 0x1EFFFF;

a0 = a & 0xffff;
a1 = (a & 0xffff0000)>>16;

b0 = b & 0xffff;
b1 = (b & 0xffff0000)>>16;

c0 = (unsigned long int) a0*b0;
c1 = (unsigned long int) a0*b1;
c2 = (unsigned long int) a1*b0;
c3 = (unsigned long int) a1*b1;

res0 = c0;
res0 += (c1 & 0xffff)<<16;
res1 = (_get_SR_register()) & 0x01;
res0 += (c2 & 0xffff)<<16;
res1 += (_get_SR_register()) & 0x01;

res1 += (c1 & 0xffff0000)>>16;
res1 += (c2 & 0xffff0000)>>16;
res1 += c3;
}

0 MAC Engineering over 13 years ago in reply to Antonio Espirito-Santo

Expert 1460 points

Thanks for the informative answer to my question.

Can you clarify “enabled the hardware multiplier” for CCS compiler?

0 Antonio Espirito-Santo over 13 years ago in reply to MAC Engineering

Expert 2855 points

Hi,

Even if your device has a hardware multiplier, if you select the option to build without a hardware multiplier, the generated code will perform the multiplication operation by software. This will require more computational time.
If your device has hardware multiplier, CCS will automatically enable this option. Check this option in Project properties >> C/C++ Build >> MSP430Linker >> Basic Options.

Best Regards,

AES

0 MAC Engineering over 13 years ago in reply to Antonio Espirito-Santo

Expert 1460 points

Dear AES

Thank you for the highly skilled and accurate answer to my question!

Now all I need is CCS v5 so I can store the answer to the expression in a 64-bit variable!

Is it time for me to learn assembly language? Can you give me an idea if there is a savings in program size by using assembly language. I know that there are a ton of variables (like CCS ability), but say one has a 2k program in C, would an equivalent program be smaller in assembly (thus more room for additional function)?

I have a project that I need a little more memory for an F2013, but with C I am out of room.

Thanks again,

MAC

0 Jens-Michael Gross over 13 years ago in reply to MAC Engineering

Guru 227245 points

MAC Engineering said:
Can you give me an idea if there is a savings in program size by using assembly language.

When it comes down to low-level math, assembly indeed is much better than C, especially for addition and multiplication.

The reason is the limitation of C language itself: if you multiply two values in C, the result will be of the same size as the larger one of the two operand. So multiplying two 16 bit values (or adding) will always give a 16 bit result only. If you want a 32 bit result, you need to cast one of the operands to 32 b it (and the compiler will then cast the other one to 32 bit too) and the operation performed will be a 32x32->32 bit multiplication. Which is a waste of time and also some code space, since the hardware multiplier can easily do a 16x16->32 multiplication (and the MPY32 can do 32x32->64 or 8/16x32->64 as well in a few cycles).

In my own projects I wrote some inline funciton for multiplication:
inline unsigned long int mul16x16(int a, int b) etc.
which contain inline assembly code. MSPGCC handles this very well and it even does not hinder optimization too much (as MSPGCC inline assembly allows the compiler to pick registers used in the code by itself, and to be notified about register clobbering).
As a result, this is much faster than the standard '*' operator (especially since it does not require a function call), not larger (since only required registers are used, allowing better optimization than across a function call) and also saves the space for the runtime library multiplicaiton function.

Anything that goes beyond a single 'monolithic' operation (e.g. polynoms or encryption that uses multiply-accumulate, which also is supported by the hardware) is outside the scope of C at all and requires assembly language if you want it done efficiently.

BTW, if you don't use the '*' operator anywhere, it makes no difference whether HWM usage is enabled or disabled in the compiler settings.
THis setting only controls whether the compiler generates calls to swmult32 or hwmult32 runtime library functions when you do a 32 bit multiplication.
On MSPGCC, there's a similar setting, btu there, for 16x16, inline code is generated if the multiplier is used (default) or a call to (sw) mult16 is generated if not. Also, you can define whether ISRs may use the multiplier, in which case the inlined code gets wrapped with interrupts disabled (default again). If your ISRs are not using the multiplier, you can save additional 6 bytes for every multiplication, as no wrapping is necessary.

However, if your code doesn't fit into the flash anymore, you can try IAR, which does a slightly better optimization. The free IAR version might be what you need, as long as the code limite for the free version isn't exceeded.

0 Bernhard Weller over 13 years ago in reply to MAC Engineering

Genius 4925 points

MAC Engineering said:
would an equivalent program be smaller in assembly (thus more room for additional function)?

I'd say probably yes, but it comes at a price - software maintenance, maybe it's just for me, but I find it a lot harder to get my head round an assembly program than a C program and well the time I spend on understanding the assembler program could already be spent (to some amount) to change the C program.

I agree that there are cases in which you probably don't want to use C/C++, but before switching to assembler (that is you already have a C program) you should make sure that all the optimizations are in place, and if you are hitting size constraints tune it to size optimization.

Just as an example: one of my codes in debug mode (no optimizations at all) takes 3kB and has a cycle time of nearly 100µs. If I compile my retail code, with all optimizations in place but tuned for maximum speed (I don't hit size constraints) the code takes 2.2kB and the cycle time is reduced to roughly 60µs. So a quite massive improvement (and yes my code is still working 100% fine with all optimizations, it's not like half of my program was optimized away because I forgot a volatile somewhere or something like that).

Well you can of course say: that's just an example of how bad you can write code in C, maybe so - nevertheless the optimizations should not be forgotten when it comes to the decision which language to use. Of course it's hard to give an estimate how good the optimizations will work in general (I guess my example is one where optimization is really working well)

I think that an approach like the one proposed by Jens might work out very well - replace the C code where it really bloats the flash usage and keep it there where it doesn't matter which language is used (because it's more descriptive).

Hmm maybe my statements fall in the category "Well thank you Captain Obvious" - if so, I'm sorry - it's just hard to guess the background of all the people here...

0 MAC Engineering over 13 years ago in reply to Bernhard Weller

Expert 1460 points

Bernhard Weller said:
you should make sure that all the optimizations are in place, and if you are hitting size constraints tune it to size optimization.

This is getting interesting, I am using CCS v4, and will switch to v5 soon so I can use 64-bit variables.

Where in CCS does one set the "optimizations"? I need to follow your method if you saved 1/3 space! In addition, I need to learn assembler for the F2013 because it does not have the Hardware Multiplier.

Your comments are very good, if something is obvious my feelings do not get hurt. Ego keeps smart people from learning :)

MAC

0 Antonio Espirito-Santo over 13 years ago in reply to MAC Engineering

Expert 2855 points

Hi,

MAC Engineering said:
Where in CCS does one set the "optimizations"?

In CCS 4.2 Go to Project properties >> C/C++ Build >> MSP430 Compiler >> Optimizations. Be sure to read C/C++ Compiler V3.3 documentation.

In CCS 5.1 Goto Project properties >> C/C++ >> MSP430 Compiler >> Optimizations. Be sure to read C/C++ Compiler V4 documentation.

Best regards,

AES

0 MAC Engineering over 13 years ago in reply to Antonio Espirito-Santo

Expert 1460 points

Dear AES,

I am finding your comments very helpful and clear. I want to tap into your experience if I can.

CCS:

Do you know if,

a) CCS v4.2 can be on the same computer as v5.1 (finish v4.2 projects before jumping into v5.1 and and any new set of problems intorduced)?

b) CCS v4.2 has to be removed, and then v5.1 can be installed and used without too much pain?

MATH:
Example 1: a=x/z

Example 2: y=k(x/z), where k is a constant such as 1000 and x and z are 32-bit variables (ratiometric conversion).

Example 3: y=(x/z) + a , where a is the ADC offset calculated Example 1.

a) Example 1, if z=32, is assembler language more efficient than using a C-language shift statement? If so can you give me a examples of the above math {Example 1, 2, 3} in assembler language?

Hardware Multiplier:

I take note by stepping statement-by-statement, the example you provided me earlier in C-language, watching the Hardware Multiplier registers, 'operand 1' never has a value loaded into it, but 'operand 2' has changing values. Is it that the operation happens too fast to see in the debugger? It seems that 'operand 1' should have the last value loaded into it remain until new values are loaded??

Thanks for your help, MAC

0 Antonio Espirito-Santo over 13 years ago in reply to MAC Engineering

Expert 2855 points

Hi,

MAC Engineering said:
CCS:

You don't need to remove the CCSV4 to install the CCSV5. You can have both IDE versions installed on the same computer. But I suggest that you must definitively upgrade to the CCSV5.

MAC Engineering said:
MATH:

Just to clarify, "z" will be allways equal to 32 or other 2^N number?.

MAC Engineering said:
Hardware Multiplier:

Are you stepping your code within the C view?. if yes, you should notice that each C instruction generates several assembly instructions. Alternativelly, you can open the assembler view and step through the assmbley code generated. This last option, will give you a very interesting point of view about how the C compiler manage the multiplication operation in hardware.

Best Regards,

AES

0 MAC Engineering over 13 years ago in reply to Antonio Espirito-Santo

Expert 1460 points

AES said:
You don't need to remove the CCSV4 to install the CCSV5.

Thanks, sounds like you may have this on your computer and it works?

AES said:
Just to clarify, "z" will be allways equal to 32 or other 2^N number?.

Well, yes as my task is to keep things smooth, thus I try to always average by 2^n, but if you will, show me how to do division by a variable also.

AES said:
Are you stepping your code within the C view?

Yes, but I was not looking at assembly. I was looking at the Registers view. I expected to see operand 1 in the MPY register.

AES said:
Alternativelly, you can open the assembler

Good idea, I have been assembler adverse. I do not have a book, but I do use RPN calculator (the other guys HP-48) so I fully understand how to use a stack for calculation. I was thinking if you can show me how to use assembler within a C-program to preform these basic math functions, the example will be worth 1000 assembler books :)

Thanks again for your help!

0 Antonio Espirito-Santo over 13 years ago in reply to MAC Engineering

Expert 2855 points

Hi MAC

MAC Engineering said:
Thanks, sounds like you may have this on your computer and it works?

yes, I have installed in my computer the CCS4.2, CCS5.1, IAR and mspgcc with Eclipse.

MAC Engineering said:
but if you will, show me how to do division by a variable also.

As you probably know, MSP430 doesn’t support division operation by hardware. This mathematical operation must be performed by software. A quick search on the internet returned me a very simple algorithm that you can try to implement in assembly. Probably isn’t the most efficient but is a good starting point.
If you want to learn to program in assembly for the MSP430, why not implement this in assembly? At the end, you could compare computational time required by your implementation with the time required by the division operation used by the C compiler.

MAC Engineering said:
I was thinking if you can show me how to use assembler within a C-program to preform these basic math functions

inline assembly is a solution, the other solution will be to mix C and Assembler With the MSP430.

Best Regards,

AES

0 Jens-Michael Gross over 13 years ago in reply to MAC Engineering

Guru 227245 points

MAC Engineering said:
a) Example 1, if z=32, is assembler language more efficient than using a C-language shift statement?

No. If oyu indeed use ">>5" for the division, it will probably generate exactly th esame code as if you'd do it in assembly language.
However, if you use /32, the compiler will mso tlikely call the software division function. At least that' swhat MSPGCC 3.x did when I checked it some time ago. Maybe IAR or CCS (or the new MSPGCC4) can detect a division by a constant 2^x value and turn it into a shift automatically.

MAC Engineering said:
It seems that 'operand 1' should have the last value loaded into it remain until new values are loaded?

Yes. Operand 1 needs to be written only once and determines the type of operation. Operand2 can then be written multiple times to carry out several operations in a row. to carry out. It allows for fast multiply-accumulate operations where one factor is constant. Or for quick multiply of an array with always the same value.

0 Bernhard Weller over 13 years ago in reply to Jens-Michael Gross

Genius 4925 points

Jens-Michael Gross said:
Maybe IAR or CCS (or the new MSPGCC4) can detect a division by a constant 2^x value and turn it into a shift automatically.

volatile int a = PAIN / 32; // despite a being an int this is an unsigned int division
volatile int b = PAIN >> 5;

turned into:

MOV.W &PAIN+0,r15 ;
RPT #5 || RRUX.W r15 ;
MOV.W r15,0(SP) ;
MOV.W &PAIN+0,r15 ;
RPT #5 || RRUX.W r15 ;
MOV.W r15,2(SP) ;

So based on that I'd say yes CCSv5.1 using optimization level 4 for speed will indeed detect a power of 2 division. (Which is a good thing, as /32 is just a bit better to read than >> 5).

This works only if you are indeed doing a unsigned int division (the fact that I did that is a bit concealed). If you use signed division what you will get is this:

volatile int u = -64;

volatile int a = u / 32;
volatile int b = u >> 5;

-->

MOV.W #0xffc0,0x0000(SP)
MOV.W @SP,R12
MOV.W #0x0020,R13
CALLA #__divi
MOV.W R12,0x0002(SP)
MOV.W @SP,R15
RPT #5 RRAX.W R15
MOV.W R15,0x0004(SP)

And as you can see in that case the optimization fails (I'll post this over at the compiler forum, maybe they can do something about it). The shift correctly uses RRAX and the result in both cases is -2, but the __divi is of course taking "ages".

Edit: If I would have thought just a bit more about it - the reason why divi is called becomes clear - in case something smaller than -32 is divided by 32 the expected result is 0, what you will get using the shift approach is -1.

But I guess checking if the numerator is smaller than the divisor and then returning 0 otherwise do a shift by 5 is still faster than calling divi.

Edit 2: As Archaeologist pointed out to me over at the Compiler forum - I've missed even more things (-33/32 = -1 but -33 >> 5 = -2) so probably it's all right calling divi... I should just stop posting things if I notice that my brain is not doing what I think it should be doing...

0 MAC Engineering over 13 years ago in reply to Bernhard Weller

Expert 1460 points

Thanks to all of you for the kind help,

Stay tuned, I just renewed my subscription to CCS, TI is mailing me something, I assume it is CCS v5.1? I can only guess, seems dumb to send me a card to make me feel good, but such is marketing.

When I load v5.1, I will post the difference in memory and speed between y = x/32 and y=x >> 5.

In addition I will test the optimization memory savings at level '5?'

If there is not a speed nor memory savings using assembler, I question taking the time to learn it. I have other tasks that are more pressing now.

Thanks again,

MAC

0 Jens-Michael Gross over 13 years ago in reply to MAC Engineering

Guru 227245 points

MAC Engineering said:
If there is not a speed nor memory savings using assembler, I question taking the time to learn it.

Some things simply cannot be done in C because they are outside the scope of the C language.
For the GIE bit and the LPM modes (which reside in the status register whose existence is unknown to the C language), there are compiler intrinsics ('fake' functions, provided by the compiler) but some other things are simply impossible. Because you don't have control over what the compiler will generate from your C code. Even a new compiler version can produce completely different binary code, breaking all your careful optimizations you did for timing-critical tasks.

However, in most cases you won't need assembly language.

**Attention** This is a public forum

MSP low-power microcontrollers

MSP low-power microcontroller forum

Hardware Multiplier 16x16 MSP430AFE254