Fastest 8bit x 8bit multiply ever made for value line msp430 (probably)

Tony Philipsson

Guru 12050 points

Set up R12 and R13 with the numbers you want to multiply, both sides can be any number 0 to 255.
Returns a 16bit value in R13
Every instruction are single word type and it take 27-to-35cycles (plus 3 for ret)
I don't have to clear c before rrc is due to if JNC was done it was of course already clear
if the add.w is done it will always clear c too.

I want to see you C programmer make this a function call and post it here.

Example     mov.b  #240,r12
            mov.b  #221,r13
            call   #multiply
            bis.w  #CPUOFF+GIE,SR 
;--------------------------------
multiply    swpb   R12
            clrc
            rrc.w  r12
_bit0       rrc.w  r13
            jnc    _bit1
            add.w  r12,r13            
_bit1       rrc.w  r13
            jnc    _bit2
            add.w  r12,r13
_bit2       rrc.w  r13
            jnc    _bit3
            add.w  r12,r13
_bit3       rrc.w  r13
            jnc    _bit4
            add.w  r12,r13
_bit4       rrc.w  r13
            jnc    _bit5
            add.w  r12,r13
_bit5       rrc.w  r13
            jnc    _bit6
            add.w  r12,r13
_bit6       rrc.w  r13
            jnc    _bit7
            add.w  r12,r13
_bit7       rrc.w  r13
            jnc    _bitend
            add.w  r12,r13
_bitend     ret

over 11 years ago

0 Tony Philipsson over 11 years ago

Guru 12050 points

I was able to change the eight jnc to addc = 8 cycles saved
subtract one for the added inv = now only 20-to-28 cycles !!
It now also auto AND 0xFF the values coming in, without extra cycles.

Adding #1 to ProgramCounter may cause reset on some msp familys(?)

multiply2   inv.b R13     ; invert and clears bit 8-F
            add.b #0,R12 ; clear c and clears bit 8-F
            swpb   R12
            rrc.w r12
_bit0       rrc.w r13
            addc.w #1,PC
            add.w r12,r13

0 Jens-Michael Gross over 11 years ago in reply to Tony Philipsson

Guru 227245 points

Looks great. Where’s the ‘like’ button? J

Sure that addc.w is faster than jnc? If PC is the target, the instructions need an extra cycle, because the next instruction fetch needs to be delayed until the operation is complete. While for all other registers, register update and fetch of the next instruction can be done parallel.
Nice idea that adding #1 is ignored as it only changes the LSB, while #1+C will add 2 and will increment PC.
But I dimly remember that the state of PC after fetching an instruction is different on different MSPs (was it a MSP/MSPX core difference?) I'm just relying on faint memories of an errata sheet entry here.

0 zrno soli over 11 years ago in reply to Tony Philipsson

Guru 34483 points

As already noted by JMG, as per MSP430x5xx family datasheet, emulated instruction br RX (mov.w RX, PC) will be executed in 3 cycles. I measured it (for MSP430x5xx not MSP430x2xx). Instruction add.w #1, PC will be executed in 3 cycles, too.

0 Tony Philipsson over 11 years ago in reply to zrno soli

Guru 12050 points

Yes, it was IAR Workbench that simulated it wrong.

IAR stops on PC+1 as it does not emulate the bit0 in PC is always fixed to zero.
So I did a multiplication run on 0, as then every addc will the do +2.
IAR calculated the cycles need for PC+2 wrong as it think it's like any other R+2

I instead used a slow ACLK running while it calculates 65536 multiplication with both JNC and ADDC.W,
when breakpoint hits on a real G2553, the TA0R had the same amount of time in both trials.

So the top post is the correct one to use and is still very fast.

Is the fetch stored somewhere writable? so subtracting c on it could change R12 in to R11 and emulate a NOP.

0 Roberto Romano over 10 years ago in reply to Tony Philipsson

Guru 27285 points

Tony Philipsson said:
IAR stops on PC+1 as it does not emulate the bit0 in PC is always fixed to zero.
So I did a multiplication run on 0, as then every addc will the do +2.

Hi Tony, I see now redirected from your post, adding 1 to PC or SP too on a word machine result in odd address and is a fault.

After this processor stop by reset, illegal instruction and illegal address too are mapped to Reset vector.

0 Tony Philipsson over 10 years ago in reply to Roberto Romano

Guru 12050 points

>result in odd address and is a fault

Most msp430 (later models) have a 0 hardcoded in to BIT0 of PC.
You can not make it a odd number even if you try.

But IAR does not emulate that correctly

IAR did also incorrectly calculate that addc #1,PC takes 1 cycle
but it's two just like JNC so there was no cycle improvements.

0 Roberto Romano over 10 years ago in reply to Tony Philipsson

Guru 27285 points

Tony Philipsson said:
IAR did also incorrectly calculate that addc #1,PC takes 1 cycle
but it's two just like JNC so there was no cycle improvements.

This is true, but simulator and not only map result from ALU to a register and PC, Sp are not different from others general purpose. So if register has bit 0 it is set and consequently can be a problem on some silicon revision or why not IPCore too, see here:

http://opencores.org/project,openmsp430

PC and SP are 16 bit and not truncated nor bit 0 reset.

Anyway this macro code I think can be the faster 8x8 multiply from usual looping mode, sure not rearranging test as looping can be shortest too?

0 old_cow_yellow over 10 years ago in reply to Tony Philipsson

Guru 58965 points

Tony Philipsson said:
I was able to change the eight jnc to addc = 8 cycles saved

Due to the pre-fetch pipeline of MSP CPU (or CPUX), it is not possible to execute any jump/branch (conditional or otherwise) in 1 MCLK cycle.

**Attention** This is a public forum

MSP low-power microcontrollers

MSP low-power microcontroller forum

Fastest 8bit x 8bit multiply ever made for value line msp430 (probably)