This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

c6748 intrinsics MPY



Hi all:
I'm a college student who is new to DSP optimization.The DSP we use is c6748 which has the feature of both c64+ and c67x.There are two questions below that confuse me:

Case suppose:

int8 a[1000] = {45,32,-33,9,-87.....}
int8 b[1000] = {1,-1,1,1,-1..............} such like this...they are both 64bit aligned

mission: packed a x b.....

1. How can I use intrinsics to implement packed a x b.I just found "long long _mpysu4ll(int src1, unsigned src2)" and "long long _mpyu4ll(unsigned src1, unsigned
src2)" .Is there something like"long long _mpy4ll(int src1, int src2)"?

2. Is there any efficent way to make a signed 8-bit array members change their signs(+ -) just like the Case shows ?

  • There does not appear to be an intrinsic that does exactly what you want.

    I recommend you check out the DSPLIB package, and see if it contains a function which does what you want, or something close.  More details on this idea are in this forum thread.

    Thanks and regards,

    -George

  • striker Qian,

    You might consider changing the sign array from

    int8 b[1000] = {1,-1,1,1,-1..............}

    to

    int8 b[1000] = {0x00,0xFF,0x00,0x00,0xFF..............}
    int8 c[1000] = {0,1,0,0,1..............}

    then do an XOR (32 bits) with b followed by ADD4 with c:

    d = a XOR b
    r = d ADD4 c

    In a well-pipelined loop, this can achieve single-cycle performance since XOR and ADD4 can use different execution units. Ideally, you could achieve 8 bytes signed-changed per cycle, but I regret that will be an exercise in optimization left for you. Let us know if you try it or if you get good results.

    Regards,
    RandyP

  • Hi,RandyP:

    Thanks for your kind help! I'v tried this to optimize my code:

     
      #pragma UNROLL(4)
     for(i=0;i<10000;i=i+4)
     {
                _amem4(r+i) = _add4((_amem4(a+i))^(_amem4(b+i)), _amem4(c+i));
     }


    and pipeline information are as below

            Loop source line                 : 147
    ;*      Loop opening brace source line   : 148
    ;*      Loop closing brace source line   : 150
    ;*      Loop Unroll Multiple             : 4x
    ;*      Known Minimum Trip Count         : 625                   
    ;*      Known Maximum Trip Count         : 625                   
    ;*      Known Max Trip Count Factor      : 625
    ;*      Loop Carried Dependency Bound(^) : 0
    ;*      Unpartitioned Resource Bound     : 4
    ;*      Partitioned Resource Bound(*)    : 4
    ;*      Resource Partition:
    ;*                                A-side   B-side
    ;*      .L units                     2        2    
    ;*      .S units                     0        0    
    ;*      .D units                     4*       4*   
    ;*      .M units                     0        0    
    ;*      .X cross paths               2        3    
    ;*      .T address paths             4*       4*   
    ;*      Long read paths              0        0    
    ;*      Long write paths             0        0    
    ;*      Logical  ops (.LS)           0        0     (.L or .S unit)
    ;*      Addition ops (.LSD)          1        3     (.L or .S or .D unit)
    ;*      Bound(.L .S .LS)             1        1    
    ;*      Bound(.L .S .D .LS .LSD)     3        3    
    ;*
    ;*      Searching for software pipeline schedule at ...
    ;*         ii = 4  Schedule found with 3 iterations in parallel
    ;*
    ;*      Register Usage Table:
    ;*          +-----------------------------------------------------------------+
    ;*          |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB|
    ;*          |00000000001111111111222222222233|00000000001111111111222222222233|
    ;*          |01234567890123456789012345678901|01234567890123456789012345678901|
    ;*          |--------------------------------+--------------------------------|
    ;*       0: |   ******         *             |    **** *        ****          |
    ;*       1: |   *****        ***             |    ******      *   **          |
    ;*       2: |   *******      * *             |    ******        ****          |
    ;*       3: |   *  ****        *             |    ******      ******          |
    ;*          +-----------------------------------------------------------------+
    ;*
    ;*      Done
    ;*
    ;*      Loop will be splooped
    ;*      Collapsed epilog stages       : 0
    ;*      Collapsed prolog stages       : 0
    ;*      Minimum required memory pad   : 0 bytes
    ;*
    ;*      For further improvement on this loop, try option -mh16
    ;*
    ;*      Minimum safe trip count       : 1 (after unrolling)
    ;*      Min. prof. trip count  (est.) : 2 (after unrolling)
    ;*
    ;*      Mem bank conflicts/iter(est.) : { min 0.000, est 1.000, max 4.000 }
    ;*      Mem bank perf. penalty (est.) : 20.0%
    ;*
    ;*      Effective ii                : { min 4.00, est 5.00, max 8.00 }
    ;*
    ;*
    ;*      Total cycles (est.)         : 8 + min_trip_cnt * 4 = 2508       
    ;*----------------------------------------------------------------------------*
    ;*       SETUP CODE
    ;*
    ;*                  MV              B4,A3
    ;*                  ADD             8,A3,A3
    ;*                  MV              B5,B7
    ;*                  ADD             8,B7,B7
    ;*                  MV              A18,A7
    ;*                  ADD             8,A7,A7
    ;*                  MV              B6,A6
    ;*                  ADD             8,A6,A6
    ;*
    ;*        SINGLE SCHEDULED ITERATION
    ;*
    ;*        $C$C308:
    ;*   0              LDDW    .D2T2   *B4++(16),B21:B20 ; |149|
    ;*   1              LDDW    .D2T2   *B5++(16),B19:B18 ; |149|
    ;*     ||           LDDW    .D1T1   *A3++(16),A9:A8   ; |149|
    ;*   2              LDDW    .D2T2   *B7++(16),B17:B16 ; |149|
    ;*   3              LDDW    .D1T1   *A7++(16),A5:A4   ; |149|
    ;*   4              LDDW    .D1T1   *A18++(16),A17:A16 ; |149|
    ;*   5              NOP             2
    ;*   7              XOR     .L1X    B17,A9,A8         ; |149|
    ;*     ||           XOR     .L2X    B16,A8,B9         ; |149|
    ;*   8              XOR     .L2     B19,B21,B16       ; |149|
    ;*     ||           XOR     .S2     B18,B20,B8        ; |149|
    ;*     ||           ADD4    .L1     A8,A5,A5          ; |149|
    ;*   9              ADD4    .L2X    B16,A17,B9        ; |149|
    ;*     ||           ADD4    .L1X    B9,A4,A4          ; |149|
    ;*  10              ADD4    .L2X    B8,A16,B8         ; |149|
    ;*     ||           STDW    .D1T1   A5:A4,*A6++(16)   ; |149|
    ;*  11              STDW    .D2T2   B9:B8,*B6++(16)   ; |149|
    ;*     ||           SPBR            $C$C308
    ;*  12              ; BRANCHCC OCCURS {$C$C308}       ; |147|
    ;*----------------------------------------------------------------------------*


    seems like I have successfully changed 4 bytes signs per cycle regardless any "Mem bank perf. penalty".

    However how to make it better? Do I need to write Linear Assembly instead?

  • Thanks a lot. I 'll try to check out the C674x DSPLIB.

  • striker Qian said:
    seems like I have successfully changed 4 bytes signs per cycle

    If that was your goal, then you are done. Anything more is research, and it may depend on the goals of your university project..

    striker Qian said:
    Do I need to write Linear Assembly instead?

    Do you see any way to improve the code generated by the compiler? If not, do not take the time to try other optimizations. If you do see ways to improve it, try other optimization techniques. There are a lot of articles on the TI Wiki Pages and a C6000 Optimization Workshop on the TI Wiki Pages that you can study to learn how to do more.

    Regards,
    RandyP