This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Accessing 32 most-significant bits from multiplication

Expert 1110 points

Hi,

Sorry for asking simple question. The first question is if C67x DSPs are able to store the result of a 32-bit *32-bit multiplication in a 64-bit register. Because in the first page of datasheet it's mentioned that they are able to do two 32-bit *32-bit multiplication in one clock cycle.

If yes, I don't know how to get the 32 MSBs. I am using two Uint32 variables for multiplication and I'm getting 32 least-significant bits.

I have another question. I want to use fixed-point computations. I know that choosing fixed-point or floating-point computations by chip (and compiler) depends on the defined variables. If the variable is float then the processing by the chip would be a floating point arithmetic. I want to make sure if the processor (and compiler) treat a Uint32 variable and its computations as fixed point arithmetic.

Thanks,

Vala

  • Some guesses. Compiler optimization has significant effects on what gets done. See spru187.

    #include <intrinsics.h>
    
    unsigned int x; // 32 bit
    unsigned int y; // 32 bit
    unsigned long long z; // 64 bit
    unsigned int i; // 32 bit
    
    // Promote x and y to 64 bits, do 64x64 to 64 multiply.
    z = x * y;
    
    // Do 32x32 to 64 multiply
    z = _mpy32u(x, y);
    
    z >>= 32; // Shift down upper 32 bits, could also use intrinsic _hill()
    i = (unsigned int)z; // Truncate to 32 bits.

    To avoid using floating point, make sure all variables and constants are not "float" or "double".

  • Vala,

    Asking very detailed questions like this without the context behind it will usually not end up with you getting where you need to get. Please explain what you are trying to do, whether it is to implement a specific algorithm or study the Instruction Set Architecture of the device, or whatever your goal is.

    To learn about all the instructions and what can be stored where, please refer to the CPU & Instruction Set Reference Manual for the DSP that you are interested in. C67x is not enough to recommend a specific document; The ISA varies as the processor architectures have matured and advanced over the years; they are generally compatible from one generation to the next, but not the other way around.

    To learn about what connections you can make in C programming to the selected C67x ISA, please refer to the latest Optimizing C Compiler User's Guide for the C67x DSP that you are interested in. There will be a list of available intrinsic functions, like the _mpy32 that Norman uses above, that will allow you to get to features that do not easily or normally get accessed in the implementation of C code.

    The second half of your questions is about how to write a C program. It does not matter what the underlying architecture of a device is when you are writing C, at least not in terms of the bit exact results. Any C program that implements valid C code and does not go into undefined conditions, like overflow, will generate the same results. In C, the result of int32 x int32 is going to be the lower 32 bits of the result. If the product is greater than would fit in a signed 32-bit integer, then the result is undefined, but it will usually be the lower 32-bits of the 64-bit result.

    Regards,
    RandyP

  • Hi Norman Wong,

    Thanks for your answer. It was a very good hint. But some problems I faced: I used the basic * operator with those Uint32 x,y variables and z = x * y; where z is a unsigned long long z; I saw that it results the correct answer only if x and y is assigned to a 64-bit variable before multiplication. 

    As I used a signed representation I used z>>=30;

    Regards

  • I am surprised that you had to assign x and y to 64 bit first. Usually the compiler will promote all variables in an expression to the largest type in the expression. The compiler should generate a call to a 64x64 to 64 bit multiply function because it cannot assume x and y are less than 32 bits. I guessing a 64x64 multiply that would consist of possibly four 32x32 to 64 bit multiplications. The intrinsic _mpy32u() should result in one 32x32 to 64 bit multiply. Just would not be portable code.

  • Hi RandyP,

    Appreciate your complete answer.

    I want through the 'Optimization Compiler' user Guide and 'CPU and Instruction' description. I have a couple of questions. What does this sentence exactly mean? 'Fixed-Point multiply supports two 32*32 Bit multiplies per clock cycle'. Does it mean that LSBs are resulted in one cycle or the shifted ones, or the whole 64 bit? 

    MPY32 does the Multiply Signed 32-Bit × Signed 32-Bit Into 32-Bit Result (LSBs) which I thought is the simplest form of multiplication. Even this one has 3 delay slots, why is that? 

    About MPY32 (_mpy32ll), can two of them be executed in parallel? Some information from 'CPU and Instruction' file:
    Unit: .M1 or .M2 (Does it mean if needed each of them can handle a MPY32 instruction?)
    Delay Slots: 3

    Thanks.

    Regards

  • Yes, this is exactly what I thought about, that what is happening is actually a 64-bit * 64-bit multiplication. I'm going to try _mpy32u() right now.

    Thanks a lot.

  • Norman,

    _mpy32u() works perfect. My code was:

    signed int NN[501];
    unsigned long long z;
    
    z=_mpy32ll(NN[i],0x20000000);
    z >>= 30;
    buffer2=(unsigned int)z;

    Also the sign extension was done correctly.

     

  • You don't need intrinsics in this case, ([unsigned] long long)a*b in place of _mpy32* does just as well. Advantage is that the code doesn't have to be TI compiler-specific and can be reused on other platform.

  • Andy,

    Thanks for your comment I will try it.