Rabbit (Stream Cipher)

old_cow_yellow

Other Parts Discussed in Thread: TMS320VC5509A

The Rabbit Stream Cipher Algorithm is a relatively simple yet secure cipher and suitable for limited-resource devices. It was developed in 2003 by a Denmark Company Cryptico A/S (http://www.cryptico.com/). Numerous descriptions, analyses, and discussions of this Algorithm can be found on the Web. For example, "rabbit_p3.pdf" and "rabbit_p3source.zip" (http://www.ecrypt.eu.org/stream/p3ciphers/rabbit/).

I am very interested in porting Rabbit to MSP430F5xx which has 32-bit multiply hardware. I wonder if someone here is willing to have a joint effort to take on this small task. (By the way, I do this just for fun. No commercial application is intended.) The reference source code sited above is written in ANSI C and is designed to be applicable for CPU of all word sizes and Little- or Big-ending alike.

I have already ported the most often used procedures, "static u32 RABBIT_g_func()" and "static void RABBIT_next_state()". I think during encryption and decryption, the CPU spends more than 90% of its time executing these two procedures. They are also used repeatedly during setup.

I am currently using KickStart, the 4K limit on c-project is more than adequate for Rabbit. I have no objection to move over to CCE, but I will have to learn how to use it and I will only get the free version (I am cheap).

The g_func is called by next_state only, and the call appears only in one place inside a loop that is executed 8 times. Thus the optimizer of the compiler generates g_func inline as part of next_state. The total execution of next_state and g_func takes 1728 MCLK cycles. The total code size is 902 bytes. The stack used is 90 bytes.

Actually, I also wrote an assembly version of next_state. It is "plug compatible" with the ANSI C version compiled with KickStart for F5xx. The execution takes 726 MCLK cycles. The code size is 490 bytes. The stack is 60 bytes.

So, anyone interested in a joint effort to finish porting the rest of Rabbit to MSP430? Challenging the F5xx implementation of next_state is welcome too!

How about TI Application Eng.s? This could make F5xx look good!

-- OCY

over 16 years ago

0 nasser ramazani over 14 years ago

Prodigy 30 points

i implement this algorithm with tms320vc5509A and all of this algorithm wrote in assembly version... the clock of next_state 300 clock !!!!

if i could help u send me an email....

nasser_ra21@yahoo.com

0 Jeff Tenney over 14 years ago

Guru 12160 points

I hadn't heard of Rabbit before this post, but I like the idea behind it.

old_cow_yellow said:

The reference source code sited above is written in ANSI C and is designed to be applicable for CPU of all word sizes and Little- or Big-ending alike.

If the code is written to be generic in this way, can you explain a little bit more about the effort to "port" the code? Is it mostly about optimizing things for a 16-bit machine with a 32-bit hardware multiplier? Or something else perhaps?

By the way, the following improvement is pretty impressive:

ORIGINAL C: 1728 MCLK cycles; 902 bytes CODE; 90 bytes STACK

OCY ASM: 726 MCLK cycles; 490 bytes CODE; 60 bytes STACK

I'm not sure I'd challenge that performance, but it really is fun to write efficient and elegant assembly language where there is a payoff. If 90% of the time is in this code, then there's a huge payoff. Nice work.

Jeff

0 old_cow_yellow over 14 years ago in reply to nasser ramazani

Guru 58965 points

Nasser,

Good to hear that you implemented it with tms320.

Did you try the test-vectors in the file test-vectors.txt inside rabbit_p3source.zip? I found that some of them are incorrect. See:
http://www.ecrypt.eu.org/stream/phorum/read.php?1,1274

--OCY

0 old_cow_yellow over 14 years ago in reply to Jeff Tenney

Guru 58965 points

Jeff,

I used the word port too loosely.

First of all, the reference source code I cited did not compile as-is with IAR. I had to make some trivial changes to make it compile. After that I had to test that it actually encrypt/decrypt correctly. I loosely called those efforts porting to MSP430.

The reference source code I cited is just a procedure to encrypt/decrypt one bock. To use it in a real application, a lot more need to be done. (Design a set of efficient library routines with easy to use Application Interface, etc.) I also loosely called that porting to MSP430.

The IAR compiler for MSP430 generates code compatible with 16-bit hardware multiplier of other MSP430. It did not take advantage of the 32-bit hardware multiplier. When 32-bit multiplication is used heavily, it hurts.

-- OCY

0 Jeff Tenney over 14 years ago in reply to old_cow_yellow

Guru 12160 points

Cool. That helps me understand the effort a little better.

Did you figure out how to convince IAR to generate code that takes advantage of the 32-bit multiplier? It will do if you give it the right settings.

Unfortunately I don't have time to contribute to this effort but I hope you and any other contributors will keep us updated on progress.

Jeff

0 Jens-Michael Gross over 14 years ago in reply to Jeff Tenney

Guru 227245 points

Jeff Tenney said:
Did you figure out how to convince IAR to generate code that takes advantage of the 32-bit multiplier

Even if it would, it wouldn't. Sounds odd, but is unfortunately true. It is part of the C language definition.

If you do a 16x1^6 multiplication, you'll get a 16 bit result. To get a 32 bit result (whcih may easily happen on a 16x16 multiplicaiton), you'll have to cast one of the operands to 32 bit. But then the operation will unnecessarily be a 32x32 multiplication.

The MSPs HWM can easily do a 16x16->32 multiplication (and the 32 bit HWM can do 32x32->64), but the upper half of the result is alsways discarded due to the C language specification.
So as long as you code in C, the HWM cannot be used as efficiently as it could.

For my own projects, I wrote some hand-crafted inline macros with asembly code that makes optimum use of the HWM depending on the required operand and result size. But this means not using the "*" operator at all. It's a speedup factor of 2 to 10 compared to the built-in usage of the HWM.

Unfortunately, these macros work on MSPGCC only. And belong to my employer :)

0 old_cow_yellow over 14 years ago in reply to Jens-Michael Gross

Guru 58965 points

Jens-Michael Gross said:

Did you figure out how to convince IAR to generate code that takes advantage of the 32-bit multiplier
Even if it would, it wouldn't. Sounds odd, but is unfortunately true. It is part of the C language definition. ...[/quote]

In the case of 32-bit multiply 32-bit, it could help. Yes, C requires it to discard the upper 32 bits of the result and returns only the lower 32 bits. This takes one round of 32-bit HWM to do it. But currently, IAR compiler will chop both 32-bit operands into two 16-bit, perform 3 separate 16-bit multiplications with the 32-bit HWM in the 16-bit HWM mode, and combined the 3 results to produce a 32-bit result.

The greatest common denominator of a great many numbers is usually a very very small number.

0 Jeff Tenney over 14 years ago in reply to old_cow_yellow

Guru 12160 points

OCY,

My IAR-built code uses the 32-bit multiplier fully, not just as a 16-bit multiplier. By chance do you build using the command line instead of the IDE? I have two relevant settings in my command-line parameters to icc430.exe:

--multiplier=32 --multiplier_location=xyz

I just ran a quick test to make sure the IDE does it right as long as you specify a properly equipped MSP430 as the target processor, and it does.

I know IAR can do this right, but I'm not sure why your IAR isn't doing it right.

JMG,

I don't mind 16x16=32 being done as 32x32=32 when I have a 32-bit multiplier. Just like I don't mind 8x8=16 being done as 16x16=16 when I have either hw multiplier.

However, the one that hurts is 32x32=64 being done as 64x64=64 because all I have is a 32-bit multiplier. I wrote little inlines (like you) for this case.

Jeff

0 old_cow_yellow over 14 years ago in reply to Jeff Tenney

Guru 58965 points

Jeff,

Thank you for pointing that out. I do not use C. My original posting was a few years old. At that time, I did look at what IAR C compiler was doing.

--OCY

0 Jens-Michael Gross over 14 years ago in reply to Jeff Tenney

Guru 227245 points

Jeff Tenney said:
I don't mind 16x16=32 being done as 32x32=32 when I have a 32-bit multiplier.

You should. It is clobbering two more registers unnecessarily for the parameter passing. :)
And of course you have to cast one of the 16 bit parameters manually to 32 bit or the compiler will do just a 16x16=16 multiplication. It's something many people simply forget (with wrong results then). It even still happens to me sometimes.

Jeff Tenney said:
However, the one that hurts is 32x32=64 being done as 64x64=64 because all I have is a 32-bit multiplier.

The original MPY16 (that's the 'correct' TI naming) code in MSPGCC (it did not support the MPY32) for 64 bit multiplicaiton (the GCC long long type) was not using the MPY at all. And the 32x32 was done as three 16x16 multiplications. Only 16x16 and below was done inline.
At least they had optimized the calling conventions to these multplication functions, so there was not too much register clobbering.

I wonder how you do a 32x32->64 multiplication (or 64x64->64) at all if there is no long long datatype. I mean in plain C with the standard operator. Doing four 32x32->32 multiplications with helper 32bit variables? I guess so. *shudder*. I really love the MPY32.

0 Jeff Tenney over 14 years ago in reply to Jens-Michael Gross

Guru 12160 points

Jens-Michael Gross said:

It is clobbering two more registers unnecessarily for the parameter passing.

Yeah, it's not ideal. I just decided to let it go. One of those things where you pick your battles. I don't have any applications where I multiply under real-time constraints.

Jens-Michael Gross said:

I wonder how you do a 32x32->64 multiplication (or 64x64->64) at all if there is no long long datatype.

Ugh, that is a terrible thought. I have never used a compiler without support for 64-bit integers. They often save me from using floating point in intermediate calculations.

Jeff

0 Jens-Michael Gross over 14 years ago in reply to Jeff Tenney

Guru 227245 points

Floating point is just another battlefield. In our energy metering server (Linux with Tomcat and Java backend), we just had a case where we reached the end of the floating point resolution. x+1 resulted in x again, causing the data acqusition to stop dead. Okay, the backend was running almost two years without reboot, but still, nobody would even imagine that somethign like this could ever happen (actually nobody did, hence the crash)

0 old_cow_yellow over 14 years ago in reply to old_cow_yellow

Guru 58965 points

old_cow_yellow said:

Jeff,

Thank you for pointing that out. I do not use C. My original posting was a few years old. At that time, I did look at what IAR C compiler was doing.

--OCY

I tried with the IAR KickStart that I downloaded yesterday. Yes, it did generate code to use the 32-bit*32-bit=>64-bit HWM. But the code size and execution cycles are still very close to what I reported 3 years ago.

I do not use CCS or gcc. Anybody wants to try?

**Attention** This is a public forum

MSP low-power microcontrollers

MSP low-power microcontroller forum

Rabbit (Stream Cipher)