Hi
I am porting public key infrastructure RSA from PolarSSL (http://polarssl.org/) on C6748, it is working fine except the signing or decryption process is very slow, take roughly 1.5s for each operation. When I tried to optimize it at level 1 then it still work well but the speed is not optimized much, so I tried level 3, but then it goes wrong. I logged and found out the part that not correct, but I don't know the reason and how to fix it.
https://www.dropbox.com/s/6vm4edlfnogun99/bn_mul.h
https://www.dropbox.com/s/tr4nkyd27sjr6zr/bignum.h
https://www.dropbox.com/s/edaqs027jxd3h0y/bignum.c
In file bignum.c, function mpi_montmul to compute Montgomery multiplication: A = A * B * R^-1 mod N, values of u1 is not correct if optimization level = 2 or 3, these values are affected by function mpi_mul_hlp. Look into function mpi_mul_hlp more details, it calls to macros MULADDC_INIT, MULADDC_CORE & MULADDC_STOP, which are implemented optimal based on each chipset such as intel i386, amd, ARM v3, alpha, ..., but certainly not for our C6748, so it will use default C code like below:
#define MULADDC_INIT \
{ \
t_uint s0, s1, b0, b1; \
t_uint r0, r1, rx, ry; \
b0 = ( b << biH ) >> biH; \
b1 = ( b >> biH );
#define MULADDC_CORE \
s0 = ( *s << biH ) >> biH; \
s1 = ( *s >> biH ); s++; \
rx = s0 * b1; r0 = s0 * b0; \
ry = s1 * b0; r1 = s1 * b1; \
r1 += ( rx >> biH ); \
r1 += ( ry >> biH ); \
rx <<= biH; ry <<= biH; \
r0 += rx; r1 += (r0 < rx); \
r0 += ry; r1 += (r0 < ry); \
r0 += c; r1 += (r0 < c); \
r0 += *d; r1 += (r0 < *d); \
c = r1; *(d++) = r0;
#define MULADDC_STOP \
}
So in summery, experts here please help me to find out:
1. Why it is working wrong if optimization level >= 2
2. And how to implement these MULADDC_INIT / MULADDC_CORE for C6748. This part consumes most of the time when doing RSA decrypting or signing, so if we are able to implement it specifically for C6748 it can improves a lot.
Thank you very much!
Long