This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

SIMD floating point multiplications

Hi,

I had a code in which I do a lot of multiplications between some elements of a table. I wanted to optimize it by using the _qmpysp intrinsic.

The thing is that I get more cycles when I use this intrinsic followed by 4 _get32f_128() instructions (to get my values).

Does it mean that I don't use the instructions correctly, or that my code is far from being optimized even without the instructions of the DSP, or something else ?

 

I saw that the 4 instructions to get my values take 28 cycles, so I wanted to compare this number by using a mask.

This is how I tried to do it :

from a __x128_t value, which I'm gonna name "a" here, I do output = a >> 96 & 0xFFFFFFFF to get my first value of the __x128_t. But the error "expression must have integral type" appears. So, how can I do that ?

Thanks by advance for your answers.

Best regards,

Alex

  • __x128_t is not a native, built-in type in which all operations are supported (like, say, float).  It's a "container type" that allows vector data to be stored.  That's why you're getting the error.  The _get32f_128 intrinsics are the proper way to extract values.  In most cases, these "get" intrinsics will result in no instruction or 1 move instruction.

    Regarding the suboptimal results: Can you post the C/C++ code of your loop or function?  In some cases, if a loop is not M unit bound (i.e. isn't limited by multiply bandwidth), the use of qmpsp can make things worse because of increased register constraints.  But there could be many other things going on, so it would be best if you could post the code.

    -Todd

  • Hi Todd and thank you for your answer. Below is my two ways to code my function :

    // 1st way : without intrinsics

    void LU_simple(float *tableau, int nombreLignes, int nombreColonnes){
        int j,k=0,l;
        while(k<nombreColonnes){
            for(j=k;j<nombreLignes-1;j++){
                tableau[(j+1)*nombreColonnes+k] = tableau[(j+1)*nombreColonnes+k]/tableau[k*nombreColonnes+k];
            }

            for(l=k;l<nombreColonnes-1;l++){
                for(j=k;j<nombreLignes-1;j++){
                    tableau[(j+1)*nombreColonnes+(l+1)] = tableau[(j+1)*nombreColonnes+(l+1)]-tableau[(j+1)*nombreColonnes+k]*tableau[k*nombreColonnes+(l+1)];  (******)
                }
            }
            k++;
        }
    }

    // 2nd way : with intrinsic

    void LU_simple(float *tableau, int nombreLignes, int nombreColonnes){
       
    __x128_t a, b, result;
        float x=4;
       
    float var1 = nombreLignes/x;
       
    int var2 = var1; if(var2 == nombreLignes/x) var2--;
       
    int i=0,j=0,k=0,l=0;

        while(k < nombreColonnes){
           
    for(j=k;j<nombreLignes-1;j++){
               
    tableau[(j+1)*nombreColonnes+k] = tableau[(j+1)*nombreColonnes+k]/tableau[k*nombreColonnes+k];
           
    }
            f
    or(l = k; l<nombreColonnes-1 ; l++){
                j = 0;
               
    b = _fto128(tableau[k*nombreColonnes+l+1],tableau[k*nombreColonnes+l+1],tableau[k*nombreColonnes+l+1],tableau[k*nombreColonnes+l+1]);
                   
    while(j<=var2){
                       
    if(4*j+4 < nombreLignes){
                           
    a = _fto128(tableau[(4*j+m+4)*nombreColonnes+k],tableau[(4*j+m+3)*nombreColonnes+k],tableau[(4*j+m+2)*nombreColonnes+k],tableau[(4*j+m+1)*nombreColonnes+k]);
                            result = _qmpysp(a,b);
                            tableau[(4*j+m+1)*nombreColonnes+l+1] -= _get32f_128(result,0);
                            tableau[(4*j+m+2)*nombreColonnes+l+1] -= _get32f_128(result,1);
                            tableau[(4*j+m+3)*nombreColonnes+l+1] -= _get32f_128(result,2);
                            tableau[(4*j+m+4)*nombreColonnes+l+1] -= _get32f_128(result,3);
                            j++;
                        }

                        else{
                            for(i = 0; i < nombreLignes-(4*(j-1)+4)-1; i++){
                                if(4*var2+i+m+1<nombreLignes){
                                    tableau[(4*var2+i+m+1)*nombreColonnes+l+1] -= tableau[(4*var2+m+(i+1))*nombreColonnes+k]*tableau[k*nombreColonnes+l+1];
                                }
                            }
                            j++;
                        }
                    }
               
    }
                k++;
           
    }
       
    }

     

    So... I want to change the line (******) of the first function by using the _qmpysp intrinsic.
    I do 4 computations, and depending the size of the table, I do the computations remaining without using the intrinsic.
    I don't think the content of the function has to be understood, it is an algorithm of LU factorization that we can find on the Internet.
    Well I let you see this and see if something's wrong in my way to code it... Thanks !

    Regards,
    Alex

  • Alex,

    I don't think qmpysp is going to help you much here.  It looks like this code is bound by the number of loads and stores that must occur.  Furthermore, the inner loop of the first version of your code (******) has a loop carried dependence bound of 13.  It seems the compiler thinks there is a dependence from the store into "tableau" to the next iteration of the load where we're loading values out of "tableau".  Loop carried dependence bounds can inhibit the speed of the generated software pipelined loop.  Therefore, this software pipelined loop cannot go faster than 13 cycles per iteration.

    The loop carried dependence bound also appears in the second version of your code and seems to be the limiting factor.  Use -k -mw when compiling to take a look at the software pipelining feedback in the assembly file.  The documents "Hand-Tuning Loops and Control Code on the TMS320C6000" (spra666) and "C6000 Programmer's Guide" (spru198k) have more information.

    -Todd

  • Would you be interested to see an optimized version of Cholesky decomposition and see what performances they give?

    By the way, how large is your matrix?

    Ran

  • Hmmm :(

    Ok well I'm going to take a look at those documents. I already "read" them but I may find something new.

    Thanks again !

    EDIT : I just saw your message, ran. I don't really know the size of my matrix, the program is supposed to work with any sizes. I went to 500x500, I could go further but I don't know the limit.

    Yes I'd like to see the Cholesky decomposition, maybe I will be inspired by it ^^

     

    Alex