This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS320F280039C: Assembly optimization issues when using the memset function for more than 256 words

Part Number: TMS320F280039C
Other Parts Discussed in Thread: TEST2

Hi Team,

There's an issue from the customer need your help:

Compiler version: TI v22.6.0.LTS

I need to clear a large array or structure. The previous solution was to directly use the memset function to clear, for example:

memset((void *)&test1,0,sizeof(test1));

However, I found that the compiler will use the RPT instruction to generate more efficient assembly code when it is within 256 words, and directly call the memset library function when it is more than 256 words. For example, I defined three float arrays for testing, with sizes of 127, 128 and 129 and clear them:

volatile float test1[127];
volatile float test2[128];
volatile float test3[129];

memset((void *)&test1,0,sizeof(test1));         //line 209
memset((void *)&test2,0,sizeof(test2));         //line 210
memset((void *)&test3,0,sizeof(test3));         //line 211

The generated assembly codes are:

	.dwpsn	file "../main.c",line 209,column 5,is_stmt,isa 0
        MOVL      XAR4,#||test1||       ; [CPU_ARAU] |209| 
        RPT       #253
||     MOV       *XAR4++,#0            ; [CPU_ALU] |209| 
	.dwpsn	file "../main.c",line 210,column 5,is_stmt,isa 0
        MOVL      XAR4,#||test2||       ; [CPU_ARAU] |210| 
        RPT       #255
||     MOV       *XAR4++,#0            ; [CPU_ALU] |210| 
	.dwpsn	file "../main.c",line 211,column 5,is_stmt,isa 0
        MOV       ACC,#258              ; [CPU_ALU] |211| 
        MOVB      XAR5,#0               ; [CPU_ALU] |211| 
        MOVL      XAR4,#||test3||       ; [CPU_ARAU] |211| 
$C$DW$325	.dwtag  DW_TAG_TI_branch
	.dwattr $C$DW$325, DW_AT_low_pc(0x00)
	.dwattr $C$DW$325, DW_AT_name("memset")
	.dwattr $C$DW$325, DW_AT_TI_call

        LCR       #||memset||           ; [CPU_ALU] |211| 
        ; call occurs [#||memset||] ; [] |211| 

It can be observed that the clearing of test1 and test2 both uses the RPT instruction, which takes 257 and 259 clock cycles respectively, with an average of about 1 clock cycle per byte; while the clearing of test3 calls the memset library function, which takes 4914 clock cycles, taking an average of 19 clock cycles per byte. The C code and assembly code of the library file are as follows:

Even if I turn on level2 compiler optimization, the compiler's processing of memset remains unchanged. So is there any other better way to clear arrays or structures with more than 256 words?

Thanks & Regards,

Ben

  • The byte in the penultimate paragraph is wrongly typed, it should be every word.

    In addition, I wrote a piece of C code myself and turned on the level2 compiler optimization. The following codes will be automatically generated according to the different array sizes.

    volatile float test3[130];
    float *pt = (float *)&test3;
    for(n=0;n<sizeof(test3)/2;n++)
    {
        *pt = 0;
        pt++;
    }

            MOVB      XAR6,#25              ; [CPU_ALU] 
    	.dwpsn	file "../main.c",line 211,column 9,is_stmt,isa 0
            ZERO      R1H                   ; [CPU_FPU] |211| 
            ZERO      R0H                   ; [CPU_FPU] |211| 
            ZERO      R2H                   ; [CPU_FPU] |211| 
    	.dwpsn	file "../main.c",line 208,column 15,is_stmt,isa 0
            MOVL      XAR4,#||test3||       ; [CPU_ARAU] |208| 
    	.dwpsn	file "../main.c",line 209,column 13,is_stmt,isa 0
            RPTB      ||$C$L43||,AR6        ; [CPU_ALU] |209| 
            ; repeat block starts ; [] 
    ||$C$L42||:    
    	.dwpsn	file "../main.c",line 211,column 9,is_stmt,isa 0
            MOV32     *XAR4++,R1H           ; [CPU_FPU] |211| 
            MOV32     *XAR4++,R0H           ; [CPU_FPU] |211| 
            MOV32     *XAR4++,R2H           ; [CPU_FPU] |211| 
            MOV32     *XAR4++,R1H           ; [CPU_FPU] |211| 
            MOV32     *XAR4++,R0H           ; [CPU_FPU] |211| 
            ; repeat block ends ; [] 

    volatile float test3[131];
    float *pt = (float *)&test3;
    for(n=0;n<sizeof(test3)/2;n++)
    {
        *pt = 0;
        pt++;
    }

            MOVB      XAR6,#64              ; [CPU_ALU] 
    	.dwpsn	file "../main.c",line 211,column 9,is_stmt,isa 0
            ZERO      R1H                   ; [CPU_FPU] |211| 
            ZERO      R0H                   ; [CPU_FPU] |211| 
    	.dwpsn	file "../main.c",line 208,column 15,is_stmt,isa 0
            MOVL      XAR4,#||test3||       ; [CPU_ARAU] |208| 
    ||$C$L42||:    
    ; Peeled loop iterations for unrolled loop:
    	.dwpsn	file "../main.c",line 211,column 9,is_stmt,isa 0
            MOV32     *XAR4++,R1H           ; [CPU_FPU] |211| 
            MOV32     *XAR4++,R0H           ; [CPU_FPU] |211| 
    	.dwpsn	file "../main.c",line 209,column 13,is_stmt,isa 0
            BANZ      ||$C$L42||,AR6--      ; [CPU_ALU] |209| 

    Depending on the size of the array, the compiler sometimes uses the RPTB statement and sometimes the BANZ statement. It takes about 0.5 clock cycles per word when using RPTB and about 1.7 clock cycles per word when using BANZ.

    If you want to ensure that the compiler uses the RPTB statement, you can pad the array or structure to 32 bits that are an integer multiple of 6 or 8. Assembly code for integer multiples of 6 (414 floats):

            MOVB      XAR6,#68              ; [CPU_ALU] 
    	.dwpsn	file "../main.c",line 211,column 9,is_stmt,isa 0
            ZERO      R2H                   ; [CPU_FPU] |211| 
            ZERO      R1H                   ; [CPU_FPU] |211| 
            ZERO      R0H                   ; [CPU_FPU] |211| 
    	.dwpsn	file "../main.c",line 208,column 15,is_stmt,isa 0
            MOVL      XAR4,#||test3||       ; [CPU_ARAU] |208| 
    	.dwpsn	file "../main.c",line 209,column 13,is_stmt,isa 0
            RPTB      ||$C$L43||,AR6        ; [CPU_ALU] |209| 
            ; repeat block starts ; [] 
    ||$C$L42||:    
    	.dwpsn	file "../main.c",line 211,column 9,is_stmt,isa 0
            MOV32     *XAR4++,R2H           ; [CPU_FPU] |211| 
            MOV32     *XAR4++,R1H           ; [CPU_FPU] |211| 
            MOV32     *XAR4++,R0H           ; [CPU_FPU] |211| 
            MOV32     *XAR4++,R2H           ; [CPU_FPU] |211| 
            MOV32     *XAR4++,R1H           ; [CPU_FPU] |211| 
            MOV32     *XAR4++,R0H           ; [CPU_FPU] |211| 
            ; repeat block ends ; [] 

    Assembly code for integer multiples of 8 (408 floats):

            MOVB      XAR6,#50              ; [CPU_ALU] 
    	.dwpsn	file "../main.c",line 211,column 9,is_stmt,isa 0
            ZERO      R1H                   ; [CPU_FPU] |211| 
            ZERO      R0H                   ; [CPU_FPU] |211| 
            ZERO      R2H                   ; [CPU_FPU] |211| 
    	.dwpsn	file "../main.c",line 208,column 15,is_stmt,isa 0
            MOVL      XAR4,#||test3||       ; [CPU_ARAU] |208| 
    	.dwpsn	file "../main.c",line 209,column 13,is_stmt,isa 0
            RPTB      ||$C$L43||,AR6        ; [CPU_ALU] |209| 
            ; repeat block starts ; [] 
    ||$C$L42||:    
    	.dwpsn	file "../main.c",line 211,column 9,is_stmt,isa 0
            MOV32     *XAR4++,R1H           ; [CPU_FPU] |211| 
            MOV32     *XAR4++,R0H           ; [CPU_FPU] |211| 
            MOV32     *XAR4++,R2H           ; [CPU_FPU] |211| 
            MOV32     *XAR4++,R1H           ; [CPU_FPU] |211| 
            MOV32     *XAR4++,R0H           ; [CPU_FPU] |211| 
            MOV32     *XAR4++,R2H           ; [CPU_FPU] |211| 
            MOV32     *XAR4++,R1H           ; [CPU_FPU] |211| 
            MOV32     *XAR4++,R0H           ; [CPU_FPU] |211| 
            ; repeat block ends ; [] 

    However, this method cannot be used on arrays and structures that are not 32-bit aligned, and BANZ is sometimes called when the array length is not complemented. So I want to see if there is a more efficient or convenient way to use compiler optimization to generate efficient assembly code, or if there are direct assembly statements that can be called.

  • I focused on case where the number of words to write is greater than 256.  I can find no combination of compiler options or source changes which cause a call to memset to be inlined.  So I filed EXT_EP-11505 to have this investigated.  Note this entry does not report a bug, but a performance issue.  You are welcome to follow it with that link.

    The rest of this post suggests a workaround.  Write functions that specifically work on one type.  In a header file have code similar to ...

    /* hdr.h */
    
    #include <stddef.h>  /* for size_t */
    
    inline void clear_floats(float *ptr, size_t length)
    {
        while (length--)
            *ptr++ = 0;
    }
    
    inline void clear_longs(long *ptr, size_t length)
    {
        while (length--)
            *ptr++ = 0;
    }

    In one source file in the application, include that header file and have code similar to ...

    #include "hdr.h"
    
    extern inline void clear_floats(float *ptr, size_t length);
    extern inline void clear_longs(long *ptr, size_t length);

    This code insures that, no matter the level of optimization used, call-able versions of these functions are generated as needed.  For details, please see this article (not from TI).  

    For this to work, you have to build with --c99 or --c11.  It does not work with the default setting of --c89.  While it works with the default --abi=coffabi, the underlying implementation can waste memory.  It works well with --abi=eabi.

    Thanks and regards,

    -George