TMS320F28377D: Dual MAC Required Build Options

Part Number: TMS320F28377D

Hi,

My customer now wants to optimize loop addition statements like the following due to runtime issues.

for (i = 0; i < length; i++) result += (int32_t) p1[i] * p2[i];

As we know, if we want to use DMAC, --opt_level need to be set to 2 or higher. However, in order to ensure the reliability of the application, the customer is unwilling to increase the optimization level(--opt_level ==0). Therefore, I would like to ask whether DMAC can be turned on through configuration in this case. If not, is there a better solution?

  • The compiler supports an intrinsic named __dmac.  Using it causes a DMAC instruction to be generated, even if you build with --opt_level=0.  However, this one change is unlikely to make a significant improvement in performance.  For more information about this intrinsic, please search the C28x compiler manual for the sub-chapter titled __dmac Intrinsic.  Or, see pages 25-30 of the C28x optimization presentation available in the article TI Compiler Presentations.

    Some other methods to consider ...

    Use the DSP libraries available in C2000Ware.

    If your program is organized as a CCS project, then use File Specific Options to increase the setting of --opt_level for only 1 or 2 files in the project.

    Use #pragma FUNCTION_OPTIONS to increase the optimization level of just one function in a file.  To learn more about it, search for FUNCTION_OPTIONS in the C28x compiler manual.

    Thanks and regards,

    -George

  • Hi George,

    Thanks for your reply!  This is a good solution, I will give you feedback after trying it!

  • Hi George,

    As you said, I used  an intrinsic named __dmac. In the performance comparison, I wrote the code as follows:

    • for (i = 0; i < 3; i++)
      {__dmac(p1_32[i], p2_32[i], acc1, acc2, 0);}

    • for (i = 0; i < 3; i++)
      {re2 += (int32_t) p1_32[i] * p2_32[i];}

    I used 28388 for testing and found that both of them took 76 cycles. I check the assembly instructions, the latter does not generate DMAC instructions. That is, the result cannot explain the efficiency of DMAC, please help!

  • A DMAC instruction by itself offers little improvement.  To get the improvement you expect, the surrounding code, especially the loop control, must be optimized too.  You want the compiler to ultimately generate something similar to ...

            RPT       AR5
    ||      DMAC      ACC:P,*XAR4++,*XAR7++
            ADDL      P,ACC
    

    Because of this constraint ...

    the customer is unwilling to increase the optimization level(--opt_level ==0)

    ... the compiler will never generate that code.  Generating that code requires, among other things, use of --opt_level=2 or higher.

    Thanks and regards,

    -George

  • Fine...Recently, customer accepted to modify the optimization level of a single function. However, they encountered a new problem. when they use #pragma FUNCTION_OPTIONS(func , " --opt_level=2") to increase function's optimization level, They found that the calculation time of the function hardly decreased. 

    It is worth mentioning that many loop operations and ternary operators are used in this function. I have asked customers to observe the assembly code, but there is no reply yet. What are the possible reasons for this result?

  • For this ...

    #pragma FUNCTION_OPTIONS(func , " --opt_level=2")

    ... to work, you have to compile with at least --opt_level=0.

    In addition, please be sure you are taking all the steps described in pages 25-30 of the C28x optimization presentation available in the article TI Compiler Presentations.

    Thanks and regards,

    -George

  • Thanks for reply!

     The project are compiled with --opt_level=0 now. Optimizer Assistant is a good tool, I will try to use it.

    So George, back to the first question, when I compile with --opt_level=3, the C code is as follow:

    int32_t a[6]={1,2,3,4,5,6};
    int32_t b[6]={1,2,3,4,5,6};

    long res = 0;
    long temp = 0;
    for (i=0; i < 3; i++) // N does not have to be a known constant
    __dmac(((long *)a)[i], ((long *)b)[i], res, temp, 0);
    res += temp;

    The generated assembly code is as follows:

    So why I cannot generate assembly code as follow even if I compile with --opt_level=3

    RPT AR5

    DMAC ACC:P,*XAR4++,*XAR7++
    ADDL P,ACC

  • Unfortunately, I am unable to generate the same code you do.  I must be doing something different.  To be sure I am doing everything the same, for the source file which contains the DMAC loop, please follow the directions in the article How to Submit a Compiler Test Case.

    Thanks and regards,

    -George