This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Problem engaging DUAL (parallel) MAC unit c5517



Hello everybody,

I'm having trouple engaging the parallel MAC unit on my c5517 DSP. Here is the C code of my fast multiplication (it's used in a IIR filter)

**************************************************************************************************************************************************

inline void uber_mult(onchip Q15 *ca, onchip Q15 *cb, onchip Q15 *x, t_Audio * restrict s, t_Audio * restrict r){

onchip Q15 *cal = ca + 1;
onchip Q15 *cbl = cb + 1;
onchip Q15 *xl = x + 1;

int40_t p1a = _smacsui(0, *(ca), *xl);
int40_t p1b = _smacsui(0, *(cb), *xl);

int40_t p2a = _llsmacsui(p1a, *x, *(cal));
int40_t p2b = _llsmacsui(p1b, *x, *(cbl));

p1a = _llsshl(p2a, -15);
p1b = _llsshl(p2b, -15);

*r = _smac(p1a, *ca, *x);
*s = _smac(p1b, *cb, *x);

}

********************************************************************************************************************************

And here is the assembler generated by the optimizer: (from the corresponding .asm file)

********************************************************************************************************************************

$C$DW$L$_alg_fpIIR_perform$4$B:
;** 560 ----------------------- p1a = _smacsui(0L, *(onchip int *)ca, (unsigned)*((int *)data+1)); // [7]
;** 561 ----------------------- p1b = _smacsui(0L, *(onchip int *)cb, (unsigned)*((int *)data+1)); // [7]
;** 563 ----------------------- p2a = _llsmacsui(p1a, *(int *)data, (unsigned)*((onchip int *)ca+1)); // [7]
;** 564 ----------------------- p2b = _llsmacsui(p1b, *(int *)data, (unsigned)*((onchip int *)cb+1)); // [7]
;** 172 ----------------------- y = _lsadd(y, _smac((long)_llsshl(p2b, (-15)), *(onchip int *)cb, *(int *)data));
;** 173 ----------------------- y$60 = _lsadd(k, _smac((long)_llsshl(p2a, (-15)), *(onchip int *)ca, *(int *)data));
;** 173 ----------------------- k = y$60;
;** 174 ----------------------- if ( (--data) >= delay_base ) goto g6;
;** 174 ----------------------- data += 11;
;** -----------------------g6:
;** 170 ----------------------- ++cb;
;** 170 ----------------------- ++ca;
;** 165 ----------------------- if ( --L$2 != -1 ) goto g12;
MOV *AR4(short(#1)) << #16, AC0

BCLR ST1_FRCT
|| MOV *AR4(short(#1)) << #16, AC2

SFTL AC2, #0, AC2 ; |561|
|| MOV #0, AC1 ; |560|

AMAR *AR4, XCDP
|| SFTL AC0, #0, AC0 ; |560|

MOV #0, AC0 ; |561|
|| MACM *AR2, AC0, AC1 ; |560|

MACM *AR3, AC2, AC0 ; |561|
MOV *AR2(short(#1)) << #16, AC2

BSET ST1_M40
|| SFTL AC2, #0, AC2 ; |563|

MACM *AR4, AC2, AC1 ; |563|

MOV *AR3(short(#1)) << #16, AC2
|| BCLR ST1_M40

BSET ST1_M40
|| SFTL AC2, #0, AC2 ; |564|

MACM *AR4, AC2, AC0 ; |564|

CMPU AR4 >= T2, TC1 ; |174|
|| ASUB #2, AR4 ; |174|

SFTS AC0, #-15, AC0 ; |172|
|| BSET ST1_FRCT

XCC !TC1 ||
AADD #22, AR4 ; |174|

BCLR ST1_M40
|| SFTS AC1, #-15, AC2 ; |173|

MAC *AR3, *CDP, AC0 :: MAC *AR2, *CDP, AC2
ADD dbl(*SP(#4)), AC0, AC0 ; |172|
MOV AC0, dbl(*SP(#4)) ; |172|

ADD dbl(*SP(#0)), AC2, AC0 ; |173|
|| AADD #2, AR3 ; |170|

.dwpsn file "../src/DSP/decimation_filter.c",line 176,column 0,is_stmt

MOV AC0, dbl(*SP(#0)) ; |173|
|| AADD #2, AR2 ; |170|

**********************************************************************************************************************************************************************

As you can see, there is only one MAC :: MAC operation (which most likely corresponds to the two consecutive _smac in the C code, I suppose.

The rest doesn't look very well optimized as far as MAC units are concerned. There are some MACM operations that are followed by a MOV, but there is no "::" in between, so they are not considered a MACM::MOV parallelized operation.

Even then, moving data back to memory (MOV) doesn't seem the best thing to do for speed, I would have preferred that the whole function was completed using only the registers.

Does anybody understand what I'm doing wrong?

Thank you very much for your attention.

  • Vittorio,

    I'm not sure if this is the right forum for your question, so let me see if I can move this post to another forum which can help you.

  • Please see if this wiki article on C5500 compiler optimization is helpful.

    Thanks and regards,

    -George

  • Hello George,

    That article is exactly where I started documenting myself about four days ago. The problem is that, although it looks like I did exactly what the presentation suggests, the compiler still seems to deliver a sub-optimal assembly, and I don't understand why.

    Regards,

    V.

  • Please submit a test case I can compile, by preprocessing the source file which contains the function uber_mult, and attaching it to your next post.  Also show the version of the compiler and the build options as the compiler sees them.

    Thanks and regards,

    -George

  • Hello George, thanks for the answer.

    Here is a simple code you can compile run: (CCS5.5)

    *****************************************************************************************************************************

    #include <stdint.h>
    #include <stdio.h>

    inline void uber_mult(onchip int *ca, onchip int *cb, onchip int *x, long * restrict s, long * restrict r){

    onchip int *cal = ca + 1;
    onchip int *cbl = cb + 1;
    onchip int *xl = x + 1;

    int40_t p1a = _smacsui(0, *(ca), *xl);
    int40_t p1b = _smacsui(0, *(cb), *xl);

    int40_t p2a = _llsmacsui(p1a, *x, *(cal));
    int40_t p2b = _llsmacsui(p1b, *x, *(cbl));

    p1a = _llsshl(p2a, -15);
    p1b = _llsshl(p2b, -15);

    *r = _smac(p1a, *ca, *x);
    *s = _smac(p1b, *cb, *x);

    }

    int main(){

    long operand = 0x7FFFFFFF;
    long coeff_a = 0x12345678;
    long coeff_b = 0x87654321;

    long r, s;

    int i;

    for(i=0;i<10;i++){
    uber_mult((int *)&coeff_a, (int *)&coeff_b, (int *)&operand, &s, &r);
    coeff_a++;
    coeff_b--;
    printf("\n%d\n%d", s, r);
    }

    while(1);

    }

    *****************************************************************************************************************

    I don't know why, CCS says there is a syntax error in the uber_mult definition, but then the compiler runs without any problem. And actually, the resulting assembler is slightly different than in my own code, but I can't make that public I'm sorry. While in my project, at least one mac::mac operation is generated, here no parallelism at all seems to be taking place on the MAC unit.

    I am using compiler version v4.4.1.

    Here is the compiler command line:

    -v5517 --memory_model=small -O3 --include_path="C:/ti/ccsv5/tools/compiler/c5500_4.4.1/include" --define=c5517 --display_error_number --diag_warning=225 --align_functions --ptrdiff_size=16 --optimizer_interlist --remove_hooks_when_inlining --opt_for_speed=5 --gen_opt_info=2


    I hope that helps.

    Regards,
    V.
  • I am unable to get a dual MAC from this C code.  I suspect it is not possible.  Consider rewriting this function in assembly.  Or see if some function from C55x DSPLIB can be used.  

    Vittorio Pascucci said:
    I don't know why, CCS says there is a syntax error in the uber_mult definition, but then the compiler runs without any problem.

    I never saw any syntax error.

    I did see the compiler complain about your use of the switch -v5517.  That device was added after the compiler was released.  The compiler cares about the CPU core in the device, and not any peripherals.  Based on this list of technical documents for the C5517, I surmise that it uses a version 3.X CPU core.  Please confirm that by starting a thread on the C5000 device forum.  If that is correct, then you should change -v5517 to -vcpu:3.X .

    Thanks and regards,

    -George

  • That's right, using v5515 the compiler doesn't complain any more, usually I use that option, for this little project I created for this case I just forgot to change it. Anyway, as you can see in my first post, with a similar code (that I can't make public) I was able to trigger at least a dual MAC operation.

    There the uber_mult function was running in a loop very similarly to the code I just posted.

    Is there a way to know on what does it depend? how can I direct the behavior of the compiler more efficiently? I wrote my function following the rules outlined in the "tips and tricks" you suggested as well, writing always two consecutive multiplications that share an operand, so I don't understand what is wrong.

    In addition to it, would it be possible to have more documentation about the intrinsics? in particular an assembly code would be welcome. On the compiler guide only the _smac intrinsic is documented, but I couldn't find any information about the _smacsui for example (I figured out what it does, so I use it, but I need more control over my code).

    Rewriting all in assembler would be the worst option, because I don't have much time.

    Regards,

    V.

  • Vittorio Pascucci said:
    I wrote my function following the rules outlined in the "tips and tricks"

    That is the best information we have on the topic.  My guess is that you are not doing anything wrong, but you are experiencing a performance deficiency in the compiler.  

    Vittorio Pascucci said:
    would it be possible to have more documentation about the intrinsics?

    They are discussed in the section titled Descriptions of C55x Intrinsics in the C55x compiler manual.

    Vittorio Pascucci said:
    Rewriting all in assembler would be the worst option, because I don't have much time.

    Please reconsider my suggestion to use C55x DSPLIB.

    Thanks and regards,

    -George

  • Ok, I am very grateful for your help.

    By the way, from a careful analysis of the assembly, I believe that the compiler is actually generating the best possible code for the operation I am running. In fact, even if I can't see many (or sometimes any) dual MAC, the parallelism is good anyway.

    The abundance of local variables in my function (and in the loop) probably is a problem considering the limited number of registers that can be used as arguments for a MAC::MAC (AC0-3 and T0-1). My suspect is that, preparing the right registers with the right values (considering also the inter-dependencies) to generate MAC::MACs at all costs might take longer than just running just one MAC unit and execute other instructions in parallel, as it does now.

    Thank you very much for your support, for the time being we will accept the performance we have now. Should we still run short of computational power we will switch to assembler for good.

    Kind regards,

    V.