This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS320F28377D: Dual MAC Required Build Options

Part Number: TMS320F28377D
Other Parts Discussed in Thread: C2000WARE


My customer now wants to optimize loop addition statements like the following due to runtime issues.

for (i = 0; i < length; i++) result += (int32_t) p1[i] * p2[i];

As we know, if we want to use DMAC, --opt_level need to be set to 2 or higher. However, in order to ensure the reliability of the application, the customer is unwilling to increase the optimization level(--opt_level ==0). Therefore, I would like to ask whether DMAC can be turned on through configuration in this case. If not, is there a better solution?

  • The compiler supports an intrinsic named __dmac.  Using it causes a DMAC instruction to be generated, even if you build with --opt_level=0.  However, this one change is unlikely to make a significant improvement in performance.  For more information about this intrinsic, please search the C28x compiler manual for the sub-chapter titled __dmac Intrinsic.  Or, see pages 25-30 of the C28x optimization presentation available in the article TI Compiler Presentations.

    Some other methods to consider ...

    Use the DSP libraries available in C2000Ware.

    If your program is organized as a CCS project, then use File Specific Options to increase the setting of --opt_level for only 1 or 2 files in the project.

    Use #pragma FUNCTION_OPTIONS to increase the optimization level of just one function in a file.  To learn more about it, search for FUNCTION_OPTIONS in the C28x compiler manual.

    Thanks and regards,


  • Hi George,

    Thanks for your reply!  This is a good solution, I will give you feedback after trying it!

  • Hi George,

    As you said, I used  an intrinsic named __dmac. In the performance comparison, I wrote the code as follows:

    • for (i = 0; i < 3; i++)
      {__dmac(p1_32[i], p2_32[i], acc1, acc2, 0);}

    • for (i = 0; i < 3; i++)
      {re2 += (int32_t) p1_32[i] * p2_32[i];}

    I used 28388 for testing and found that both of them took 76 cycles. I check the assembly instructions, the latter does not generate DMAC instructions. That is, the result cannot explain the efficiency of DMAC, please help!

  • A DMAC instruction by itself offers little improvement.  To get the improvement you expect, the surrounding code, especially the loop control, must be optimized too.  You want the compiler to ultimately generate something similar to ...

            RPT       AR5
    ||      DMAC      ACC:P,*XAR4++,*XAR7++
            ADDL      P,ACC

    Because of this constraint ...

    the customer is unwilling to increase the optimization level(--opt_level ==0)

    ... the compiler will never generate that code.  Generating that code requires, among other things, use of --opt_level=2 or higher.

    Thanks and regards,


  • Fine...Recently, customer accepted to modify the optimization level of a single function. However, they encountered a new problem. when they use #pragma FUNCTION_OPTIONS(func , " --opt_level=2") to increase function's optimization level, They found that the calculation time of the function hardly decreased. 

    It is worth mentioning that many loop operations and ternary operators are used in this function. I have asked customers to observe the assembly code, but there is no reply yet. What are the possible reasons for this result?

  • For this ...

    #pragma FUNCTION_OPTIONS(func , " --opt_level=2")

    ... to work, you have to compile with at least --opt_level=0.

    In addition, please be sure you are taking all the steps described in pages 25-30 of the C28x optimization presentation available in the article TI Compiler Presentations.

    Thanks and regards,


  • Thanks for reply!

     The project are compiled with --opt_level=0 now. Optimizer Assistant is a good tool, I will try to use it.

    So George, back to the first question, when I compile with --opt_level=3, the C code is as follow:

    int32_t a[6]={1,2,3,4,5,6};
    int32_t b[6]={1,2,3,4,5,6};

    long res = 0;
    long temp = 0;
    for (i=0; i < 3; i++) // N does not have to be a known constant
    __dmac(((long *)a)[i], ((long *)b)[i], res, temp, 0);
    res += temp;

    The generated assembly code is as follows:

    So why I cannot generate assembly code as follow even if I compile with --opt_level=3

    RPT AR5

    DMAC ACC:P,*XAR4++,*XAR7++

  • Unfortunately, I am unable to generate the same code you do.  I must be doing something different.  To be sure I am doing everything the same, for the source file which contains the DMAC loop, please follow the directions in the article How to Submit a Compiler Test Case.

    Thanks and regards,


  • Hi George,

    I follow  the directions in the article How to Submit a Compiler Test Case, but I still cannot generate the desired assembly code. I attach my compiler configuration, please help and give some suggestions

    If everything seems ok, you can attach your code sample so that I can refer to it. Thanks!

  • I follow  the directions in the article How to Submit a Compiler Test Case, but I still cannot generate the desired assembly code.

    I apologize.  The point of the article was for you to prepare a test case you submit to me.  Then I would use that test case to reproduce your results.  Once I can do that, I can probably tell you the changes to make to generate the faster code I see.  

    So, please follow the directions in that article so that I have a the code, and other details, that allow me to reproduce your results.

    Thanks and regards,


  • "C:/ti/ccs1040/ccs/tools/compiler/ti-cgt-c2000_20.2.5.LTS/bin/cl2000" -v28 -ml -mt --cla_support=cla2 --float_support=fpu64 --idiv_support=idiv0 --tmu_support=tmu0 --vcu_support=vcrc -O2 --opt_for_speed=5 --fp_mode=relaxed --include_path="C:/Users/a0488871/workspace_v10/empty_driverlib_project" --include_path="C:/Users/a0488871/workspace_v10/empty_driverlib_project/device" --include_path="C:/ti/c2000/C2000Ware_3_04_00_00/driverlib/f2838x/driverlib" --include_path="C:/ti/ccs1040/ccs/tools/compiler/ti-cgt-c2000_20.2.5.LTS/include" --define=DEBUG --define=CPU1 --preproc_with_comment --preproc_with_compile --diag_suppress=10063 --diag_warning=225 --diag_wrap=off --display_error_number --gen_func_subsections=on --abi=eabi --include_path="C:/Users/a0488871/workspace_v10/empty_driverlib_project/CPU1_RAM/syscfg" "../empty_driverlib_main.c"

    compiler version=  20.2.5.LTS

  • To see how this got resolved, please visit this forum thread.

    Thanks and regards,