Double multiplication is slightly faster than float multiplication (3x3 Matrix)

gokhanozdogan107

Hi,

I am currently working on C6678 multicore DSP and made a small benchmark test. What I tested is the time it takes to multiply two 3x3 sized matrices for both in debug mode and in optimization mode (-O3). I measured the time using RTOS analyzer exec graph. I multiplied the two matrices 1000 times and calculated the mean CPU time for it.

So here are the results of average time:

Debug Mode, Float(REAL32) : 1.251us (micro seconds)

Optimization Mode, Float(REAL32) : 0.231us

Debug Mode, Double(REAL64) : 1.262us

Optimization Mode, Double(REAL64) : 0.228us

Here is my question: Are not double operations supposed to take longer CPU time than float operations? Why in optimization mode double type definition shows better performance?

Thank you.

over 11 years ago

0 Sudarshan R over 11 years ago

Expert 1240 points

Hi,

Can you paste here the piece of code in which you are performing the above mentioned multiply operation?

Well, I am no expert to answer about which operation is faster, but I do can comment on the optimization part.

Basically, multiplication operation is an hardware operation and architecture dependent. The optimization (-O3) just optimizes the code and compiles it. That is, any redundant operation will be excluded for core execution. So the 1000 times multiplication may not be done for the entire 1000 times. There are three ways where in you can check whether its executed for 1000 times; 1. Use a flag to count the no. of times it is executed/multiplied. 2. See the converted assembly code after compilation. 3. See the end result and compare with that w/o -O3 and check whether the results are matching.

-O3 is just a software optimization, not a hardware optimization. Hence, practically there should be no difference in profile values.

Also, for a better understanding of how the two different multiplication is done, see the assembly code in the Disassembly window. The profile values stated above includes the loop iteration time, assignment operation time and time required for multiplication. I believe assignment operation time is different for both cases. So its hard to conclude anything from these profile values. Please correct me if I'm wrong. ☺

Regards

Sud

0 gokhanozdogan107 over 11 years ago in reply to Sudarshan R

Prodigy 60 points

Hi Sud,

First of all, thanks for the answer. I am certain about multiplication executed for 1000 times in optimization mode. I checked it with your option 3 : "See the end result and compare with that w/o -O3 and check whether the results are matching".

I agree with you about the difficulty of concluding anything just from these time values I measured. Below I share the piece of code I tested. In optimization mode, I tested this code with both REAL32 type definition and REAL64 type definition. My prediction was REAL64 test would take longer but it didn't. I wonder if C6678 multicore DSP has special hardware, package or somehing like that for double operations which provides comparable or even better performance than float operations in some cases. Would that be the reason?

Thank you

Gokhan

//***********************************************************************************************************

typedef REAL32 Matrix3x3REAL32Typ[3][3];



void MultiplyMatrix3x3x3(Matrix3x3REAL32Typ argMatrix1, Matrix3x3REAL32Typ argMatrix2, Matrix3x3REAL32Typ& resultMatrix) {
    
    REAL32 sum = 0.0;
    
    for(int i = 0 ; i < 3 ; i++)
    {
      for(int j = 0 ; j < 3 ; j++)
      {
        sum = 0;
        for(int k = 0 ; k < 3 ; k++)
        {
          sum += argMatrix1[i][k] * argMatrix2[k][j];
        }          
        resultMatrix[i][j] = sum;
      }
    }
}



void Task() {
  

    Matrix3x3REAL32Typ  C1, C2;
    
    
    while(1)
    {
        /* Get access to resource */
        Semaphore_pend(semDusuk, BIOS_WAIT_FOREVER);
        
        
        C1[0][0] = 4.0;
        C1[0][1] = 2.0;
        C1[0][2] = 2.0;
        C1[1][0] = 2.0;
        C1[1][1] = 4.0;
        C1[1][2] = 2.0;
        C1[2][0] = 2.0;
        C1[2][1] = 2.0;
        C1[2][2] = 4.0;
        
        C2[0][0] = 1.0001;
        C2[0][1] = 0.0001;
        C2[0][2] = 0.0001;
        C2[1][0] = 0.0001;
        C2[1][1] = 1.0001;
        C2[1][2] = 0.0001;
        C2[2][0] = 0.0001;
        C2[2][1] = 0.0001;
        C2[2][2] = 1.0001;
        
        for(UINT32 i=0; i<1000; i++)
        {           
           MultiplyMatrix3x3x3(C1,C2,C1);      
        } 
        
    }
    	
    
}

0 Sudarshan R over 11 years ago in reply to gokhanozdogan107

Expert 1240 points

Since C6678 is a floating-point device, there would not be much difference in profile values for float and double multiplication. You can do a small test by calculating the profile value for a single float and double multiplication. Since its completely hardware/architecture dependent, its hard to comment further on this.

I'm just curious about the reduction in profile value with -O3 enabled. The function

void MultiplyMatrix3x3x3(Matrix3x3REAL32Typ argMatrix1, Matrix3x3REAL32Typ argMatrix2, Matrix3x3REAL32Typ& resultMatrix) {
    
    REAL32 sum = 0.0;
    
    for(int i = 0 ; i < 3 ; i++)
    {
      for(int j = 0 ; j < 3 ; j++)
      {
        sum = 0;
        for(int k = 0 ; k < 3 ; k++)
        {
          sum += argMatrix1[i][k] * argMatrix2[k][j];
        }          
        resultMatrix[i][j] = sum;
      }
    }
}

with -O3 enabled, the compiler will surely optimize this piece of code with reduced instructions! As I said before, best way is to examine the assembly code.

The assembly for the above function without -O3:

          MultiplyMatrix:
00802000:   07FFF052            ADDK.S2       -32,B15
00802004:   EC65                STW.D2T1      A6,*B15[3]
00802006:   7246                MV.L1X        B4,A3
00802008:   AC45     ||         STW.D2T1      A4,*B15[1]
0080200a:   CC35                STW.D2T1      A3,*B15[2]
14        	REAL32 sum = 0.0;
0080200c:   0627                MVK.L2        0,B4
0080200e:   9CC5                STW.D2T2      B4,*B15[4]
17        	for(i = 0 ; i < 3 ; i++)
00802010:   023CA2F6            STW.D2T2      B4,*+B15[5]
00802014:   001068DA            CMPGT.L2      3,B4,B0
00802018:   3074A120     [!B0]  BNOP.S1       $C$L6 (PC+232 = 0x008020e8),5
0080201c:   E1C00008            .fphead       n, l, W, BU, nobr, nosat, 0001110b
19        		for(j = 0 ; j < 3 ; j++)
          $C$DW$L$MultiplyMatrix$2$B, $C$L1:
00802020:   0627                MVK.L2        0,B4
00802022:   DCC5                STW.D2T2      B4,*B15[6]
00802024:   001068DA            CMPGT.L2      3,B4,B0
00802028:   305AA120     [!B0]  BNOP.S1       $C$DW$L$MultiplyMatrix$5$E (PC+180 = 0x008020d4),5
21        			sum = 0;
          $C$DW$L$MultiplyMatrix$2$E, $C$DW$L$MultiplyMatrix$3$B, $C$L2:
0080202c:   0627                MVK.L2        0,B4
0080202e:   9CC5                STW.D2T2      B4,*B15[4]
22        			for(k = 0 ; k < 3 ; k++)
00802030:   023CE2F6            STW.D2T2      B4,*+B15[7]
00802034:   001068DA            CMPGT.L2      3,B4,B0
00802038:   303AA120     [!B0]  BNOP.S1       $C$DW$L$MultiplyMatrix$4$E (PC+116 = 0x00802094),5
0080203c:   E1200000            .fphead       n, l, W, BU, nobr, nosat, 0001001b
24        				sum += argMatrix1[i][k] * argMatrix2[k][j];
          $C$DW$L$MultiplyMatrix$3$E, $C$DW$L$MultiplyMatrix$4$B, $C$L3:
00802040:   02BCA2E6            LDW.D2T2      *+B15[5],B5
00802044:   043C42E6            LDW.D2T2      *+B15[2],B8
00802048:   BC6D                LDW.D2T2      *B15[1],B6
0080204a:   CCCD                LDW.D2T1      *B15[6],A4
0080204c:   7246                MV.L1X        B4,A3
0080204e:   E247                MV.L2         B4,B7
00802050:   765A     ||         SHL.S1X       B4,0x3,A5
00802052:   66CB     ||         SHL.S2        B5,0x3,B4
00802054:   0210BC43            ADDAW.D2      B4,B5,B4
00802058:   01947C40            ADDAW.D1      A5,A3,A3
0080205c:   E3800180            .fphead       n, l, W, BU, nobr, nosat, 0011100b
00802060:   01A07079            ADD.L1X       A3,B8,A3
00802064:   8341     ||         ADD.L2        B4,B6,B4
00802066:   F44D                LDW.D2T2      *B4[B7],B4
00802068:   018C8A64 ||         LDW.D1T1      *+A3[A4],A3
0080206c:   0FBC82E6            LDW.D2T2      *+B15[4],B31
00802070:   00004000            NOP           3
00802074:   01907E00            MPYSP.M1X     A3,B4,A3
00802078:   00006000            NOP           4
0080207c:   E0400008            .fphead       n, l, W, BU, nobr, nosat, 0000010b
00802080:   020FF79A            FADDSP.L2X      B31,A3,B4
00802084:   2C6E                NOP           2
00802086:   9CC5                STW.D2T2      B4,*B15[4]
00802088:   27C1                ADD.L2        B7,1,B4
0080208a:   FCC5                STW.D2T2      B4,*B15[7]
0080208c:   001068DA            CMPGT.L2      3,B4,B0
00802090:   2FE0A120     [ B0]  BNOP.S1       $C$DW$L$MultiplyMatrix$3$E (PC-64 = 0x00802040),5
26        			*resultMatrix[i][j] = sum;
          $C$DW$L$MultiplyMatrix$4$E, $C$DW$L$MultiplyMatrix$5$B, $C$L4:
00802094:   DCCD                LDW.D2T2      *B15[6],B4
00802096:   BCDD                LDW.D2T2      *B15[5],B5
00802098:   043C62E6            LDW.D2T2      *+B15[3],B8
0080209c:   E4C00000            .fphead       n, l, W, BU, nobr, nosat, 0100110b
008020a0:   9CFD                LDW.D2T2      *B15[4],B7
008020a2:   0C6E                NOP           1
008020a4:   01907CA0            SHL.S1X       B4,0x3,A3
008020a8:   0494ACA2            SHL.S2        B5,0x5,B9
008020ac:   02A4BC43            ADDAW.D2      B9,B5,B5
008020b0:   030C1FDA ||         MV.L2X        A3,B6
008020b4:   0220A07B            ADD.L2        B5,B8,B4
008020b8:   02989C42 ||         ADDAW.D2      B6,B4,B5
008020bc:   E0200000            .fphead       n, l, W, BU, nobr, nosat, 0000001b
008020c0:   A241                ADD.L2        B5,B4,B4
008020c2:   1075                STW.D2T2      B7,*B4[0]
008020c4:   DCCD                LDW.D2T2      *B15[6],B4
008020c6:   6C6E                NOP           4
008020c8:   2641                ADD.L2        B4,1,B4
008020ca:   DCC5                STW.D2T2      B4,*B15[6]
008020cc:   001068DA            CMPGT.L2      3,B4,B0
008020d0:   2FB6A120     [ B0]  BNOP.S1       $C$DW$L$MultiplyMatrix$2$E (PC-148 = 0x0080202c),5
          $C$DW$L$MultiplyMatrix$5$E, $C$DW$L$MultiplyMatrix$6$B, $C$L5:
008020d4:   BCCD                LDW.D2T2      *B15[5],B4
008020d6:   6C6E                NOP           4
008020d8:   2641                ADD.L2        B4,1,B4
008020da:   BCC5                STW.D2T2      B4,*B15[5]
008020dc:   ECE00000            .fphead       n, l, W, BU, nobr, nosat, 1100111b
008020e0:   001068DA            CMPGT.L2      3,B4,B0
008020e4:   2FA0A120     [ B0]  BNOP.S1       $C$L1 (PC-192 = 0x00802020),5
29        }

And the assembly code with -O3:

          MultiplyMatrix:
00802180:   31F7                STW.D2T2      B3,*B15--[2]
17        	for(i = 0 ; i < 3 ; i++)
00802182:   64A7                MVK.L2        3,B1
00802184:   04A6     ||         MVK.L1        0,A1
00802186:   0392     ||         MVK.S1        0,A7
00802188:   04000040 ||         MVK.D1        0,A8
          $C$DW$L$MultiplyMatrix$2$B, $C$L1:
0080218c:   01149028            MVK.S1        0x2920,A2
00802190:   01004068            MVKH.S1       0x800000,A2
00802194:   43E0                ADD.L1        A2,A7,A6
19        		for(j = 0 ; j < 3 ; j++)
00802196:   6427                MVK.L2        3,B0
00802198:   0000A358 ||         MVK.L1        0,A0
0080219c:   E460080E            .fphead       n, l, W, BU, nobr, nosat, 0100011b
21        			sum = 0;
          $C$DW$L$MultiplyMatrix$2$E, $C$DW$L$MultiplyMatrix$3$B, $C$L2:
008021a0:   0194A42A            MVK.S2        0x2948,B3
008021a4:   0180406A            MVKH.S2       0x800000,B3
008021a8:   7051                ADD.L2X       B3,A0,B5
008021aa:   40C0     ||         ADD.L1        A2,A1,A4
008021ac:   A40E     ||         MV.S1         A8,A5
22        			for(k = 0 ; k < 3 ; k++)
008021ae:   25A7                MVK.L2        1,B3
008021b0:   D9EF                MVC.S2        B3,ILC
          $C$DW$L$MultiplyMatrix$3$E, $C$L3:
008021b2:   0D67                SPLOOPD       3
          $C$DW$L$MultiplyMatrix$5$B, $C$L4:
008021b4:   01903665            LDW.D1T1      *A4++[1],A3
008021b8:   021476E6 ||         LDW.D2T2      *B5++[3],B4
008021bc:   E3900030            .fphead       p, l, W, BU, nobr, nosat, 0011100b
008021c0:   01907E00            MPYSP.M1X     A3,B4,A3
008021c4:   2C6E                NOP           2
008021c6:   0C6E                NOP           1
008021c8:   04434001            SPKERNEL      0x11
008021cc:   02946798 ||         FADDSP.L1       A3,A5,A5
          $C$DW$L$MultiplyMatrix$5$E, $C$DW$L$MultiplyMatrix$7$B, $C$L5:
008021d0:   00002000            NOP           2
26        			*resultMatrix[i][j] = sum;
008021d4:   02987674            STW.D1T1      A5,*A6++[3]
008021d8:   0003E05A            SUB.L2        B0,0x1,B0
008021dc:   E0400000            .fphead       n, l, W, BU, nobr, nosat, 0000010b
008021e0:   2FE08120     [ B0]  BNOP.S1       $C$DW$L$MultiplyMatrix$2$E (PC-64 = 0x008021a0),4
008021e4:   8400                ADD.L1        A0,4,A0
          $C$DW$L$MultiplyMatrix$7$E, $C$DW$L$MultiplyMatrix$8$B:
008021e6:   EC91                ADD.L2        B1,-1,B1
008021e8:   4FD66120     [ B1]  BNOP.S1       $C$L1 (PC-84 = 0x0080218c),3
008021ec:   81B2                MVK.S1        36,A3
008021ee:   63F0                ADD.L1        A3,A7,A7
008021f0:   8CAE     ||         ADDK.S1       0x12,A1
29        }

As you can see there is hell lot of reduction in the instructions. So the above function executed 1000 times will have a better profile time with -O3 enabled. I hope this is the only reason. Please let me know if there would be other reasons.

And by the way, are you using the RTOS execution graph just to get the profile time values? If so, I would suggest you to use the real-time clock which is available in the CCS. During debug, goto Run-->Clock-->Setup... and select CPU.cycle and then enable it.

Just double click on the clock icon and the timer will reset. Its as simple and easy.

Regards

Sud

0 Sudarshan R over 11 years ago in reply to gokhanozdogan107

Expert 1240 points

Additionally, it also performs software pipelining, due to -O3 option, which increases the performance.

Regards

Sud

0 gokhanozdogan107 over 11 years ago in reply to Sudarshan R

Prodigy 60 points

So it will remain mystery unless I dive into the assembly details. But you say since C6678 is a floating-point device, there would not be much difference in profile values for float and double multiplication.

Thanks for all the useful informations you provide.

0 Sudarshan R over 11 years ago in reply to gokhanozdogan107

Expert 1240 points

The best and ultimate way of debugging any code is to see the machine code that the compiler converts, since the result is partly compiler dependent.

If you would like to know more about -O3 operation, go through the chapter 3 of optimization guide (SPRU187U). It is not always advisable to use -O3 operation unless we know its operation fully. The compiler will optimize the code and perform code/assignment operation which will not yield the desired output result. Hence, the code has to be written properly if using optimization level 3.

G��khan ��zdoğan said:
But you say since C6678 is a floating-point device, there would not be much difference in profile values for float and double multiplication .

Yes and you can check it out by profiling a single multiply operation.

Regards

Sud

Processors

Processors forum

Double multiplication is slightly faster than float multiplication (3x3 Matrix)