This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS320F28335: Calculation Optimization for tanhf() and matrix multiply

Part Number: TMS320F28335
Other Parts Discussed in Thread: CONTROLSUITE

Hi everyone,

I'm currently working on the motor control application. I want to replace PI controller by a fuzzy-neural network controller. In each interruption, I need to perform 3 matrices multiplication and use tanhf() to calculate the result.

In the code, I multiplied element by element and use tanhf() from the math.lib. But the interruption can not be finished within 0.1 ms. Is there any way I can optimize the code or I need to change to a better microchip? 

Thanks

Yang Sun

// Matrix_multiply(*W1,inputlayer,hiddenlayer1,6,5);
p=&W1[0][0];
q=&inputlayer[0];
hiddenlayer1[0]= *p**q + *(p+1)**(q+1) + *(p+2)**(q+2) + *(p+3)**(q+3) + *(p+4)**(q+4);
hiddenlayer1[1]= *(p+5)**q + *(p+6)**(q+1) + *(p+7)**(q+2) + *(p+8)**(q+3) + *(p+9)**(q+4);
hiddenlayer1[2]= *(p+10)**q + *(p+11)**(q+1) + *(p+12)**(q+2) + *(p+13)**(q+3) + *(p+14)**(q+4);
hiddenlayer1[3]= *(p+15)**q + *(p+16)**(q+1) + *(p+17)**(q+2) + *(p+18)**(q+3) + *(p+19)**(q+4);
hiddenlayer1[4]= *(p+20)**q + *(p+21)**(q+1) + *(p+22)**(q+2) + *(p+23)**(q+3) + *(p+24)**(q+4);
hiddenlayer1[5]= *(p+25)**q + *(p+26)**(q+1) + *(p+27)**(q+2) + *(p+28)**(q+3) + *(p+29)**(q+4);

hiddenlayer1[0]=(float)tanhf(hiddenlayer1[0]);
hiddenlayer1[1]=(float)tanhf(hiddenlayer1[1]);
hiddenlayer1[2]=(float)tanhf(hiddenlayer1[2]);
hiddenlayer1[3]=(float)tanhf(hiddenlayer1[3]);
hiddenlayer1[4]=(float)tanhf(hiddenlayer1[4]);
hiddenlayer1[5]=(float)tanhf(hiddenlayer1[5]);

// the 2nd hidden layer
//Matrix_multiply(*W2, hiddenlayer1, hiddenlayer2,6,7);
p=&W2[0][0];
q=&hiddenlayer1[0];
hiddenlayer2[0]= *p**q + *(p+1)**(q+1) + *(p+2)**(q+2) + *(p+3)**(q+3) + *(p+4)**(q+4)+ *(p+5)**(q+5)+ *(p+6)**(q+6);
hiddenlayer2[1]= *(p+7)**q + *(p+8)**(q+1) + *(p+9)**(q+2) + *(p+10)**(q+3) + *(p+11)**(q+4)+ *(p+12)**(q+5)+ *(p+13)**(q+6);
hiddenlayer2[2]= *(p+14)**q + *(p+15)**(q+1) + *(p+16)**(q+2) + *(p+17)**(q+3) + *(p+18)**(q+4)+ *(p+19)**(q+5)+ *(p+20)**(q+6);
hiddenlayer2[3]= *(p+21)**q + *(p+22)**(q+1) + *(p+23)**(q+2) + *(p+24)**(q+3) + *(p+25)**(q+4)+ *(p+26)**(q+5)+ *(p+27)**(q+6);
hiddenlayer2[4]= *(p+28)**q + *(p+29)**(q+1) + *(p+30)**(q+2) + *(p+31)**(q+3) + *(p+32)**(q+4)+ *(p+33)**(q+5)+ *(p+34)**(q+6);
hiddenlayer2[5]= *(p+35)**q + *(p+36)**(q+1) + *(p+37)**(q+2) + *(p+38)**(q+3) + *(p+39)**(q+4)+ *(p+40)**(q+5)+ *(p+41)**(q+6);

hiddenlayer2[0]=(float)tanhf(hiddenlayer2[0]);
hiddenlayer2[1]=(float)tanhf(hiddenlayer2[1]);
hiddenlayer2[2]=(float)tanhf(hiddenlayer2[2]);
hiddenlayer2[3]=(float)tanhf(hiddenlayer2[3]);
hiddenlayer2[4]=(float)tanhf(hiddenlayer2[4]);
hiddenlayer2[5]=(float)tanhf(hiddenlayer2[5]);

// the 3rd hidden layer
p=&W3[0][0];
q=&hiddenlayer2[0];
outputlayer[0]= *p**q + *(p+1)**(q+1) + *(p+2)**(q+2) + *(p+3)**(q+3) + *(p+4)**(q+4)+ *(p+5)**(q+5)+ *(p+6)**(q+6);
outputlayer[1]= *(p+7)**q + *(p+8)**(q+1) + *(p+9)**(q+2) + *(p+10)**(q+3) + *(p+11)**(q+4)+ *(p+12)**(q+5)+ *(p+13)**(q+6);
outputlayer[0]=(float)tanhf(outputlayer[0])*25.719642299223366;
outputlayer[1]=(float)tanhf(outputlayer[1])*25.719642299223366;

  • Hi Yang Sun,

    I don't think it's going to be possible on F28335.  Those 14 x tanh() functions will take about 726 cycles each using the RTS library, which is about 68 us at 150 MHz, so a lot of your time budget is gone just on that.

    If you can change to an F28377D (for example) you'd have 200 MHz on each core, and the benefit of the TMU. What you could do is replace the tanh() with its' exponential equivalent.  Instead of this:

    x = (float) tanh(v);

    ...do this...

    y = expf(-2*v);
    x = (1 - y) / (1 + y);

    The exponential is still expensive, but the TMU division is not, so you'd be done in about 398 cycles instead of 726.  That's about 28 us.

    The matrix multiplications are going to be expensive doing it in C.  Your second equation looks like a (6x7)(7x1) product, which on my machine is consuming 13,826 cycles (about 92 us ad 150 MHz).  Since you know the matrix dimensions in advance you could speed this up in several ways.  The fastest is going to be hand coded assembly, but you might try experimenting with the C compiler optimizer first, if you haven't already done so.

    We don't have a matrix library right now but it's something we have in the works.  Apart from multiplication, what other matrix operations are you doing? 

    Regards,

    Richard

  • Hi Richard,

    Thanks for your response, I just need to process 3 multiplication, They are (6*5)*(5*1), (6*7)*(7*1) and (2*7)*(7*1). The dimensions are pre-known. They are proceeded as 30, 42 and 14 multiplications in my code. I'm not good at writing assemble language, is there any example I may find in the control SUITE?

    Regards,

    Yang Sun

  • Hi Yang Sun,

    Yes, you will also have 24, 36, and 12 additions to do, respectively. I regret there aren't any matrix multiplication functions in controlSUITE. However, there are some assembly coded vector multiplication functions in the floating point DSP library, which you can find in the directory:

    ...\controlSUITE\libs\dsp\FPU\v1_50_00_00\source\vector

    I have not tried to use them in this way, but you might be able to treat each matrix term as a real vector product.

    As I said, optimized support for matrix operations is something we are looking at, but no release date has been set. I'm really sorry not to have more for you on this.

    Regards,

    Richard
  • Hi Richard,
    I replaced the tanh function with an approximation and now the code could work. And it would be great if the DSP can support the matrix calculation in the future.

    Thanks,
    Yang Sun