Optimizing loop for absolute difference

Pay Giesselmann

Hi,

I'm trying to optimize a loop on C6657 to calculate the absolute difference from two input images. what I got so far is the following:

typedef union

{

unsigned int raw;

unsigned char b[sizeof(unsigned int)/sizeof(unsigned char)];

} u_source;

void absDiffImage(void* const restrict t_sourceA,

void* const restrict t_sourceB,

void* const restrict t_destination,

const int t_height,

const int t_width)

{

const int a_loopMax = t_height * t_width;

unsigned int* restrict a_sourceA = (unsigned int*)t_sourceA;

unsigned int* restrict a_sourceB = (unsigned int*)t_sourceB;

unsigned int* restrict a_destination = (unsigned int*)t_destination;

u_source a_sourceA7_0, a_sourceB7_0, a_destination7_0;

std::_nassert((int) a_sourceA % 8 == 0);

std::_nassert((int) a_sourceB % 8 == 0);

std::_nassert((int) a_destination % 8 == 0);

#pragma MUST_ITERATE(1024,,32)

#pragma UNROLL( 1 )

for( int i = 0; i < a_loopMax/4; i++ )

{

// read data

a_sourceA7_0.raw = a_sourceA[i];

a_sourceB7_0.raw = a_sourceB[i];

// compute absolute difference

a_destination7_0.b[0] = a_sourceA7_0.b[0] > a_sourceB7_0.b[0]?

a_sourceA7_0.b[0] - a_sourceB7_0.b[0]:

a_sourceB7_0.b[0] - a_sourceA7_0.b[0]

a_destination7_0.b[1] = a_sourceA7_0.b[1] > a_sourceB7_0.b[1]?

a_sourceA7_0.b[1] - a_sourceB7_0.b[1]:

a_sourceB7_0.b[1] - a_sourceA7_0.b[1];

a_destination7_0.b[2] = a_sourceA7_0.b[2] > a_sourceB7_0.b[2]?

a_sourceA7_0.b[2] - a_sourceB7_0.b[2]:

a_sourceB7_0.b[2] - a_sourceA7_0.b[2];

a_destination7_0.b[3] = a_sourceA7_0.b[3] > a_sourceB7_0.b[3]?

a_sourceA7_0.b[3] - a_sourceB7_0.b[3]:

a_sourceB7_0.b[3] - a_sourceA7_0.b[3];

// write back result

a_destination[i] = a_destination7_0.raw;

}

looking to the asm I see the following:

;*----------------------------------------------------------------------------*
;*   SOFTWARE PIPELINE INFORMATION
;*
;*      Loop found in file               : ../source/main.cpp
;*      Loop source line                 : 115
;*      Loop opening brace source line   : 116
;*      Loop closing brace source line   : 137
;*      Loop Unroll Multiple             : 2x
;*      Known Minimum Trip Count         : 512
;*      Known Max Trip Count Factor      : 16
;*      Loop Carried Dependency Bound(^) : 18
;*      Unpartitioned Resource Bound     : 19
;*      Partitioned Resource Bound(*)    : 19
;*      Resource Partition:
;*                                A-side   B-side
;*      .L units                     5        3
;*      .S units                    19*      18
;*      .D units                     1        3
;*      .M units                     0        0
;*      .X cross paths               8        7
;*      .T address paths             2        2
;*      Long read paths              0        0
;*      Long write paths             0        0
;*      Logical ops (.LS)           5        3     (.L or .S unit)
;*      Addition ops (.LSD)          4        4     (.L or .S or .D unit)
;*      Bound(.L .S .LS)            15       12
;*      Bound(.L .S .D .LS .LSD)    12       11
;*
;*      Searching for software pipeline schedule at ...
;*         ii = 19 Schedule found with 3 iterations in parallel
;*
;*      Register Usage Table:
;*          +-----------------------------------------------------------------+
;*          |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB|
;*          |00000000001111111111222222222233|00000000001111111111222222222233|
;*          |01234567890123456789012345678901|01234567890123456789012345678901|
;*          |--------------------------------+--------------------------------|
;*       0: |* ******       * **            |*     ** *      * ***           |
;*       1: |* *** ***      * **            |*   * ** *      * ***           |
;*       2: |*   ******      * **            |*   * ****      * ***           |
;*       3: |* **** **      * **            |*   ******      * ***           |
;*       4: |* *******      * **            |*    *****      *****           |
;*       5: |* **** **      * **            |*   ******      *****           |
;*       6: |* **** *       * **            |*   ******      * ***           |
;*       7: |* ******       * *             |*   ******      * ***           |
;*       8: |* ******       * *             |*   ******      * ***           |
;*       9: |* * ** *       * **            |*   *****       * ***           |
;*      10: |*    ****         **            |*   *****       * ***           |
;*      11: |* * ****         **            |*   *****         ***           |
;*      12: |* *****          **            |*    *****        ***           |
;*      13: |* *****          **            |*   ******        ***           |
;*      14: |* ******         **            |*   *****         ***           |
;*      15: |* ******         **            |*    *****        ***           |
;*      16: |* *****          **            |*   *** **        ***           |
;*      17: |* *****          **            |*   ***         * ***           |
;*      18: |*   * * *       ****            |*   ****        * ***           |
;*          +-----------------------------------------------------------------+
;*
;*      Done
;*
;*      Epilog not removed
;*      Collapsed epilog stages       : 0
;*      Collapsed prolog stages       : 2
;*      Minimum required memory pad   : 0 bytes
;*
;*      For further improvement on this loop, try option -mh16
;*
;*      Minimum safe trip count       : 2 (after unrolling)
;*      Min. prof. trip count (est.) : 4 (after unrolling)
;*
;*      Mem bank conflicts/iter(est.) : { min 0.000, est 0.000, max 0.000 }
;*      Mem bank perf. penalty (est.) : 0.0%
;*
;*
;*      Total cycles (est.)         : 25 + trip_cnt * 19
;*----------------------------------------------------------------------------*
;*       SETUP CODE
;*
;*                  SUB             B0,1,B0
;*
;*        SINGLE SCHEDULED ITERATION
;*
;*        $C$C57:
;*   0              LDDW    .D2T2   *B19++,B7:B6      ; |118|
;*   1              LDDW    .D1T1   *A18++,A5:A4      ; |119|
;*   2              NOP             4
;*   6              EXTU    .S2     B6,24,24,B4       ; |122|
;*     ||           SHRU    .S1X    B6,24,A9          ; |125|
;*   7              EXTU    .S2     B6,8,24,B8        ; |124|
;*   8              EXTU    .S1     A4,24,24,A3       ; |122|
;*   9              SUB     .L1X    B4,A3,A7          ; |122|
;*     ||           EXTU    .S1     A4,16,24,A3       ; |123|
;*     ||           SHRU    .S2X    A4,24,B17         ; |125|
;* 10              ABS     .L1     A7,A6             ; |122|
;*     ||           EXTU    .S1     A4,8,24,A4        ; |124|
;*     ||           EXTU    .S2     B7,24,24,B9       ; |122|
;* 11              EXTU    .S1     A6,24,24,A6       ; |122|
;*     ||           SUB     .L1X    B8,A4,A4          ; |124|
;*     ||           SUB     .L2X    A9,B17,B8         ; |125|
;* 12              EXTU    .S2     B6,16,24,B6       ; |123|
;*     ||           ABS     .L1     A4,A7             ; |124|
;*     ||           EXTU    .S1     A5,24,24,A4       ; |122|
;* 13              EXTU    .S1     A7,24,24,A7       ; |124|
;*     ||           ABS     .L2     B8,B8             ; |125|
;* 14              EXTU    .S1     A7,24,8,A19       ; |124|
;*     ||           SHL     .S2     B8,24,B18         ; |125|
;*     ||           SUB     .L2X    B9,A4,B8          ; |122|
;* 15              SUB     .L2X    B6,A3,B4          ; |123|
;* 16              ABS     .L2     B4,B6             ; |123|
;*     ||           EXTU    .S1     A5,16,24,A3       ; |123|
;* 17              NOP             1
;* 18              EXTU    .S2     B7,16,24,B4       ; |123|
;* 19              EXTU    .S1     A5,8,24,A7        ; |124|
;* 20              ABS     .L2     B8,B8             ; |122|
;*     ||           SUB     .L1X    B4,A3,A8          ; |123|
;*     ||           EXTU    .S2     B7,8,24,B9        ; |124|
;* 21              ABS     .L1     A8,A5             ; |123|
;*     ||           SHRU    .S1     A5,24,A4          ; |125|
;*     ||           SHRU    .S2     B7,24,B4          ; |125|
;* 22              EXTU    .S2     B8,24,24,B16      ; |122|
;*     ||           SUB     .L1X    B9,A7,A7          ; |124|
;* 23              EXTU    .S2     B6,24,24,B5       ; |123|
;*     ||           EXTU    .S1     A5,24,24,A16      ; |123|
;*     ||           ABS     .L1     A7,A8             ; |124|
;* 24              CLR     .S1     A17,0,7,A7        ; |122| ^
;*     ||           EXTU    .S2     B5,24,16,B9       ; |123|
;*     ||           SUB     .L1X    B4,A4,A3          ; |125|
;* 25              OR      .D1     A6,A7,A3          ; |122| ^
;*     ||           ABS     .L1     A3,A7             ; |125|
;* 26              CLR     .S1     A3,8,15,A6        ; |123| ^
;* 27              SHL     .S2X    A7,24,B5          ; |125|
;* 28              NOP             1
;* 29              OR      .L2X    B9,A6,B4          ; |123| ^
;* 30              CLR     .S2     B4,16,23,B4       ; |124| ^
;* 31              OR      .L2X    A19,B4,B4         ; |124| ^
;* 32              EXTU    .S2     B4,8,8,B4         ; |125| ^
;* 33              OR      .D2     B18,B4,B4         ; |125| ^
;* 34              STW     .D2T2   B4,*++B20(8)      ; |136|
;*     ||           CLR     .S2     B4,0,7,B6         ; |122| ^
;*     ||           EXTU    .S1     A16,24,16,A7      ; |123|
;* 35              OR      .S2     B16,B6,B4         ; |122| ^
;* 36              CLR     .S2     B4,8,15,B9        ; |123| ^
;*     ||           EXTU    .S1     A8,24,24,A4       ; |124|
;* 37              EXTU    .S1     A4,24,8,A4        ; |124|
;* 38              OR      .L1X    A7,B9,A8          ; |123| ^
;*     ||   [ B0]   BDEC    .S2     $C$C57,B0         ; |115|
;* 39              CLR     .S1     A8,16,23,A3       ; |124| ^
;* 40              OR      .D1     A4,A3,A3          ; |124| ^
;* 41              EXTU    .S1     A3,8,8,A3         ; |125| ^
;* 42              OR      .D1X    B5,A3,A17         ; |125| ^
;* 43              STW     .D2T1   A17,*+B20(4)      ; |136|
;* 44              ; BRANCHCC OCCURS {$C$C57}        ; |115|
;*----------------------------------------------------------------------------*

I don't get the Loop Carried Dependency Bound away because I don't understand its source. Can anybody help me improving this loop? Maybe with a complete new concept or with improvements on the existing.

Thanks for your help,

best regards

Pay Gießelmann

over 12 years ago

0 Laurent Gauthier over 12 years ago

TI__Intellectual 1015 points

Hi,

I think that you should check the C66 instruction set, and specifically the subabs4 instruction that seems to be taylor-made for your case.

Here is the change I have made to your code:

#include <c6x.h>
#include <assert.h>
 
void absDiffImage(void* const restrict t_sourceA,
                  void* const restrict t_sourceB,
                  void* const restrict t_destination,
                  const int t_height,
                  const int t_width)
{
   const int a_loopMax = t_height * t_width;
   unsigned int* restrict a_sourceA = (unsigned int*)t_sourceA;
   unsigned int* restrict a_sourceB = (unsigned int*)t_sourceB;
   unsigned int* restrict a_destination = (unsigned int*)t_destination;

   _nassert(((int)a_sourceA % 8) == 0);
   _nassert(((int)a_sourceB % 8) == 0);
   _nassert(((int)a_destination % 8) == 0);
   #pragma MUST_ITERATE(1024,,32)
   for( int i = 0; i < a_loopMax/4; i++ ) { 
      // compute absolute difference
	  a_destination[i] = _subabs4(a_sourceA[i], a_sourceB[i]);
   }
}

According to the Assembly output this loop is unrolled 2 times, and has an ii of 2 cycles.

This means that each absolute value of a difference is costing you 0.25 cycles in total to compute.

Here is the assembly output I get:

;*----------------------------------------------------------------------------*
;*   SOFTWARE PIPELINE INFORMATION
;*
;*      Loop found in file               : abs_diff.cc
;*      Loop source line                 : 19
;*      Loop opening brace source line   : 19
;*      Loop closing brace source line   : 22
;*      Loop Unroll Multiple             : 2x
;*      Known Minimum Trip Count         : 512                    
;*      Known Max Trip Count Factor      : 16
;*      Loop Carried Dependency Bound(^) : 0
;*      Unpartitioned Resource Bound     : 2
;*      Partitioned Resource Bound(*)    : 2
;*      Resource Partition:
;*                                A-side   B-side
;*      .L units                     0        2*    
;*      .S units                     0        0     
;*      .D units                     1        2*    
;*      .M units                     0        0     
;*      .X cross paths               0        2*    
;*      .T address paths             1        2*    
;*      Long read paths              0        0     
;*      Long write paths             0        0     
;*      Logical  ops (.LS)           0        0     (.L or .S unit)
;*      Addition ops (.LSD)          0        0     (.L or .S or .D unit)
;*      Bound(.L .S .LS)             0        1     
;*      Bound(.L .S .D .LS .LSD)     1        2*    
;*
;*      Searching for software pipeline schedule at ...
;*         ii = 2  Schedule found with 4 iterations in parallel
;*
;*      Register Usage Table:
;*          +-----------------------------------------------------------------+
;*          |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB|
;*          |00000000001111111111222222222233|00000000001111111111222222222233|
;*          |01234567890123456789012345678901|01234567890123456789012345678901|
;*          |--------------------------------+--------------------------------|
;*       0: |   **                           |     ** **                      |
;*       1: |   ***                          |    ******                      |
;*          +-----------------------------------------------------------------+
;*
;*      Done
;*
;*      Loop will be splooped
;*      Collapsed epilog stages       : 0
;*      Collapsed prolog stages       : 0
;*      Minimum required memory pad   : 0 bytes
;*
;*      Minimum safe trip count       : 1 (after unrolling)
;*      Min. prof. trip count  (est.) : 2 (after unrolling)
;*
;*      Mem bank conflicts/iter(est.) : { min 0.000, est 0.250, max 1.000 }
;*      Mem bank perf. penalty (est.) : 11.1%
;*
;*      Effective ii                : { min 2.00, est 2.25, max 3.00 }
;*
;*
;*      Total cycles (est.)         : 6 + trip_cnt * 2        
;*----------------------------------------------------------------------------*
;*        SINGLE SCHEDULED ITERATION
;*
;*        $C$C21:
;*   0              LDDW    .D2T2   *B8++,B7:B6       ; |21| 
;*     ||           LDDW    .D1T1   *A3++,A5:A4       ; |21| 
;*   1              NOP             4
;*   5              SUBABS4 .L2X    B7,A5,B5          ; |21| 
;*   6              SUBABS4 .L2X    B6,A4,B4          ; |21| 
;*   7              STDW    .D2T2   B5:B4,*B9++       ; |21| 
;*     ||           SPBR            $C$C21
;*   8              ; BRANCHCC OCCURS {$C$C21}        ; |19| 
;*----------------------------------------------------------------------------*

I hope this helps, Laurent.

0 Laurent Gauthier over 12 years ago in reply to Laurent Gauthier

TI__Intellectual 1015 points

Oh, and specifying a "#pragma UNROLL(4)" for the loop will improve even more the performance down to 0.1875 cycles/absolute difference.

Cheers, Laurent.

0 Pay Giesselmann over 12 years ago in reply to Laurent Gauthier

Prodigy 250 points

Dear Laurent,

you don't know how much help that was!! Thank you very much! I think on this base I can go on developing my own algorithms.

I improved the function still a bit more, but now it's really fast enough.

void

TcMatrix::absDiffImage

(

void* const restrict t_sourceA,

void* const restrict t_sourceB,

void* const restrict t_destination,

const int t_height,

const int t_width

)

{

const int a_loopMax = t_height * t_width;

unsigned int* restrict a_sourceA = (unsigned int*)t_sourceA;

unsigned int* restrict a_sourceB = (unsigned int*)t_sourceB;

unsigned int* restrict a_destination = (unsigned int*)t_destination;

unsigned long long* restrict a_readA = (unsigned long long*)t_sourceA;

unsigned long long* restrict a_readB = (unsigned long long*)t_sourceB;

unsigned long long a_inA, a_inB;

_nassert(((int)a_sourceA % 8) == 0);

_nassert(((int)a_sourceB % 8) == 0);

_nassert(((int)a_destination % 8) == 0);

#pragma MUST_ITERATE(1024,,32)

#pragma UNROLL( 2 )

for( int i = 0; i < a_loopMax/8; i++ )

{

a_inA = a_readA[i];

a_inB = a_readB[i];

a_destination[2*i] = _subabs4((a_inA >> 32)&0xFFFFFFFF,(a_inB >> 32)&0xFFFFFFFF);

a_destination[2*i+1] = _subabs4((a_inA)&0xFFFFFFFF,(a_inB)&0xFFFFFFFF);

}

Pay Gießelmann

0 Laurent Gauthier over 12 years ago in reply to Pay Giesselmann

TI__Intellectual 1015 points

I think you could have kept the source code a bit simpler just using the original code I suggested and using a #pragma UNROLL(4) on the loop.

I had made a second reply, that you might not have seen, pointing out that using UNROLL(4) was improving performance further, down to 0.1875 cycles/absolute difference.

The compiler will do a good job of unrolling loops for you, and your code will look less verbose.

As a matter of style I you want to unroll this loop manually I would suggest that you use the following intrinsics (_hill()/_loll()) as I think they will make the code easier to read.

      a_destination[2*i] = _subabs4(_hill(a_inA),_hill(a_inB));
      a_destination[2*i+1] = _subabs4(_loll(a_inA),_loll(a_inB));

I'm glad this is helping you.

Best Regards, Laurent.

0 Pay Giesselmann over 12 years ago in reply to Laurent Gauthier

Prodigy 250 points

Hello,

back again on a different function but almost the same topic. The function I wrote should multiply a matrix with a single-precision floating point. what I did so far is the following:

input is interpreted as 16 bit unsigned, output is 8 bit ( function should be used to calculate the average over multiple input images, therefore output should be 8 bit again )

void

matMultiply

(

char* const restrict t_sourcePtr,

char* const restrict t_destinationPtr,

const float t_scale

)

{

const int a_loopMax = 256*256;

unsigned short* restrict a_readPtr = (unsigned short*)t_sourcePtr;

unsigned char* const restrict a_dstPtr = (unsigned char*)t_destinationPtr;

__x128_t a_read1_4;

__x128_t a_scale1_4;

__x128_t a_dst1_4;

__float2_t a_read1_2, a_read3_4;

__float2_t a_dst1_2, a_dst3_4;

long long a_write1_2, a_write3_4;

#pragma MUST_ITERATE(1024,,32)

#pragma UNROLL( 1 )

for( int i = 0; i < a_loopMax/4; i++ )

{

a_read1_2 = _dintspu( _itoll(a_readPtr[ 4*i ], a_readPtr[ 4*i +1 ]));

a_read3_4 = _dintspu( _itoll(a_readPtr[ 4*i +2 ], a_readPtr[ 4*i +3 ]));

a_read1_4 = _f2to128( a_read3_4, a_read1_2 );

a_scale1_4 = _f2to128( _ftod( t_scale, t_scale ), _ftod( t_scale, t_scale ));

a_dst1_4 = _qmpysp( a_read1_4, a_scale1_4 );

a_dst1_2 = _lof2_128(a_dst1_4);

a_dst3_4 = _hif2_128(a_dst1_4);

a_write1_2 = _dspint(a_dst1_2);

a_write3_4 = _dspint(a_dst3_4);

a_dstPtr[ 4*i ] = (unsigned char)(a_write3_4 >> 32);

a_dstPtr[ 4*i + 1 ] = (unsigned char)(a_write3_4);

a_dstPtr[ 4*i + 2 ] = (unsigned char)(a_write1_2 >> 32);

a_dstPtr[ 4*i + 3 ] = (unsigned char)(a_write1_2);

}

assembler output as follows:

;*----------------------------------------------------------------------------*

;* SOFTWARE PIPELINE INFORMATION

;* Loop found in file : ../source/main.cpp

;* Loop source line : 169

;* Loop opening brace source line : 170

;* Loop closing brace source line : 189

;* Known Minimum Trip Count : 16384

;* Known Maximum Trip Count : 16384

;* Known Max Trip Count Factor : 16384

;* Loop Carried Dependency Bound(^) : 0

;* Unpartitioned Resource Bound : 4

;* Partitioned Resource Bound(*) : 4

;* Resource Partition:

;* A-side B-side

;* .L units 0 0

;* .S units 0 0

;* .D units 4* 4*

;* .M units 1 0

;* .X cross paths 1 1

;* .T address paths 4* 4*

;* Long read paths 0 0

;* Long write paths 0 0

;* Logical ops (.LS) 5 1 (.L or .S unit)

;* Addition ops (.LSD) 0 0 (.L or .S or .D unit)

;* Bound(.L .S .LS) 3 1

;* Bound(.L .S .D .LS .LSD) 3 2

;* Searching for software pipeline schedule at ...

;* ii = 4 Schedule found with 5 iterations in parallel

;* Done

;* Loop will be splooped

;* Collapsed epilog stages : 0

;* Collapsed prolog stages : 0

;* Minimum required memory pad : 0 bytes

;* Minimum safe trip count : 1

;* Min. prof. trip count (est.) : 2

;* Mem bank conflicts/iter(est.) : { min 1.000, est 1.250, max 3.000 }

;* Mem bank perf. penalty (est.) : 23.8%

;* Effective ii : { min 5.00, est 5.25, max 7.00 }

;* Total cycles (est.) : 16 + min_trip_cnt * 4 = 65552

;*----------------------------------------------------------------------------*

;* SETUP CODE

;* MV B17,A27

;* ADD 7,A27,A27

;* MV B17,A28

;* ADD 6,A28,A28

;* MV B17,B9

;* ADD 5,B9,B9

;* ADD 4,B17,B17

;* MV B16,A3

;* ADD 4,A3,A3

;* MV B16,A26

;* ADD 6,A26,A26

;* MV B16,B8

;* ADD 2,B16,B16

;* SINGLE SCHEDULED ITERATION

;* $C$C28:

;* 0 LDHU .D2T2 *B16++(8),B4 ; |174|

;* || LDHU .D1T1 *A26++(8),A20 ; |174|

;* 1 LDHU .D2T2 *B8++(8),B5 ; |174|

;* 2 LDHU .D1T1 *A3++(8),A21 ; |174|

;* 3 NOP 3

;* 6 DADD .S1 0,A23:A22,A11:A10 ; |175|

;* 7 DINTSPU .S1X B5:B4,A17:A16 ; |174|

;* || DINTSPU .L1 A21:A20,A19:A18 ; |174|

;* 8 DADD .S1 0,A23:A22,A9:A8 ; |175|

;* 9 NOP 1

;* 10 QMPYSP .M1 A19:A18:A17:A16,A11:A10:A9:A8,A7:A6:A5:A4 ; |177|

;* 11 NOP 3

;* 14 DSPINT .L1 A5:A4,A25:A24 ; |187|

;* 15 DSPINT .L2X A7:A6,B7:B6 ; |185|

;* 16 NOP 1

;* 17 STB .D1T1 A24,*A27++(4) ; |188|

;* 18 STB .D2T2 B7,*B17++(4) ; |185|

;* 19 STB .D2T2 B6,*B9++(4) ; |186|

;* || STB .D1T1 A25,*A28++(4) ; |187|

;* || SPBR $C$C28

;* 20 ; BRANCHCC OCCURS {$C$C28} ; |169|

;*----------------------------------------------------------------------------*

So far so good, working on L2 SRAM I need about 122462 cycles to execute on an array of 256x256. Do you see points in this function I could improve? What about the general task to get the average of let's say 10 incoming images?

Thanks again, best regards,

Pay Gießelmann

0 Laurent Gauthier over 12 years ago in reply to Pay Giesselmann

TI__Intellectual 1015 points

One thing you can improve on a bit is to read input 64-bits at a time and write output 32-bits at a time.

I manage to get a 25% cycle count reduction this way (note: the functionality of the code is not verified, specifically on the two lines where I use the _hill() and _loll() intrinsics you might need to swap them. The code is provided below (also note that this gain in performance does require an UNROLL of at least 2).

Another thing I suggest is that you try to get rid of these int->sp and sp->int conversions, by either going for an all sp format, or switching to using some form of simplified fixed point arithmetic instead.

Finally, I do not understand what you mean about getting the average of 10 incoming images... Do you mean that you do this mat_multiply for each of the 10 images and then create a 256x256 array which is the average of the 10 results of the mat_multiplys?

I hope this helps.

-- Laurent.

void
matMultiply
(
   char* const restrict t_sourcePtr,
   char* const restrict t_destinationPtr,
   const float t_scale
)
{
   const int a_loopMax = 256*256;
   unsigned short* restrict a_readPtr = (unsigned short*)t_sourcePtr;
   unsigned char* const restrict a_dstPtr = (unsigned char*)t_destinationPtr;
 
   __x128_t a_read1_4;
   __x128_t a_scale1_4;
   __x128_t a_dst1_4;
//   __float2_t a_read1_2, a_read3_4;
   __float2_t a_dst1_2, a_dst3_4;
   long long a_write1_2, a_write3_4;
   long long * restrict reader = (long long*)a_readPtr;
   uint32_t * restrict writer = (uint32_t*)a_dstPtr;
   long long input;
   
   #pragma MUST_ITERATE(1024,,32)
   #pragma UNROLL( 2 )
   for( int i = 0; i < a_loopMax/4; i++ )
   {   
      // a_read1_2 = _dintspu( _itoll(a_readPtr[ 4*i ], a_readPtr[ 4*i +1 ]));
      // a_read3_4 = _dintspu( _itoll(a_readPtr[ 4*i +2 ], a_readPtr[ 4*i +3 ]));
      // a_read1_4 = _f2to128( a_read3_4, a_read1_2 );
      input = *reader++;
      a_read1_4 = _f2to128(_dintspu(_unpkhu2(_hill(input))),_dintspu(_unpkhu2(_loll(input))));
      a_scale1_4 = _f2to128( _ftod( t_scale, t_scale ), _ftod( t_scale, t_scale ));
 
      a_dst1_4 = _qmpysp( a_read1_4, a_scale1_4 );
 
      a_dst1_2 = _lof2_128(a_dst1_4);
      a_dst3_4 = _hif2_128(a_dst1_4);
 
      a_write1_2 = _dspint(a_dst1_2);
      a_write3_4 = _dspint(a_dst3_4);
 
      *writer++ = _dspacku4(_pack2(_hill(a_write1_2),_loll(a_write1_2)),_pack2(_hill(a_write3_4),_loll(a_write3_4)));
      // a_dstPtr[ 4*i ] = (unsigned char)(a_write3_4 >> 32);
      // a_dstPtr[ 4*i + 1 ] = (unsigned char)(a_write3_4);
      // a_dstPtr[ 4*i + 2 ] = (unsigned char)(a_write1_2 >> 32);
      // a_dstPtr[ 4*i + 3 ] = (unsigned char)(a_write1_2);
   }
}

0 Pay Giesselmann over 12 years ago in reply to Laurent Gauthier

Prodigy 250 points

Hey Laurent,

That's cool, thank you so far, I will try this code in the next week. The goal for me is to calculate the average of 10 incoming pictures i.e. sum them up alltogether and divide elementwise.

Pay Gießelmann

0 Livio Lima over 12 years ago in reply to Pay Giesselmann

Prodigy 150 points

Dear Pay,

thank you for this useful post. I tried to test your function for absolute difference on my DM8168 paltform running the algorithm on DSP side. Unfortunately, I had unexpected results, since with (1280x1024) images the algorithm takes approximately 90 ms that i don't think is an acceptable result.

Can you please tell me the execution time you experience?

Thank you

0 Pay Giesselmann over 12 years ago in reply to Livio Lima

Prodigy 250 points

Livio,

sorry for no answer, I wasn't notified about your post. I've run the absdiff for your image size, with this bigger images it didn't fit into my L2 so I had to work out of DDR SDRAM what is much slower. Nevertheless I got a runtime of

4546689 cycles what means about 4.5 ms. I'm not familliar with DM8168 and not shure if the DSP inside supports the code. As you see I used the intrinsic __subabs4(). Maybe you could check if it's available on your plattform.

best regards

Pay

Processors

Processors forum

Optimizing loop for absolute difference