Compiler/TMS320F28379D: memcpy for >256 words does generate RPT || PREAD but does lcr memcpy => how to avoid

Thomas Wappler

Part Number: TMS320F28379D

Tool/software: TI C/C++ Compiler

Hi,

in my code I have to copy some quite big structs, some of them are bigger than 256 words. For the smaller ones the expected combination of RPT and PREAD is generated and this performs well. For the bigger structs a call to memcpy (lcr memcpy) is generated which takes more than 10 times longer to execute.

I read in the compiler manual in the section about --rpt_threshold that under some conditions multiple RPTs may be generated, but whatever options I tried I never got this. Is there a way to force the compiler to generate as much RPTs it takes and avoid memcpy?

I can get it to work when I put multiple lines of memcpy in the code an copy the struct manuall in sections this way. But as my code is generated by Simulink this is not a good solution.

I use version 18.1.4.LTS.

Thanks,
Thomas.

over 5 years ago

0 George Mock over 5 years ago

TI__Guru**** 239845 points

Please see this forum thread.

Thanks and regards,

-George

0 Thomas Wappler over 5 years ago in reply to George Mock

Intellectual 395 points

Hello George,

I know this thread and read it again but I could not find an answer to my question.

Here is a small example code of my issue (basically a smaller and a bigger struct are copied):

typedef struct {
   float a;
} small_type ;

typedef struct {
   float a[150];
} big_type ;

big_type big1_struct;
big_type big2_struct;
small_type small1_struct;
small_type small2_struct;

int main(void)
{
    big1_struct = big2_struct;
    small1_struct = small2_struct;

	return 0;
}

The resulting assembly is:

_main:
	.dwcfi	cfa_offset, -2
	.dwcfi	save_reg_to_mem, 26, 0
	.dwpsn	file "../main.c",line 22,column 5,is_stmt,isa 0
;----------------------------------------------------------------------
;  22 | big1_struct = big2_struct;                                             
;----------------------------------------------------------------------
        MOVL      XAR4,#300             ; [CPU_ARAU] |22| 
        MOVL      ACC,XAR4              ; [CPU_ALU] |22| 
        MOVL      XAR5,#_big2_struct    ; [CPU_ARAU] |22| 
        MOVL      XAR4,#_big1_struct    ; [CPU_ARAU] |22| 
$C$DW$6	.dwtag  DW_TAG_TI_branch
	.dwattr $C$DW$6, DW_AT_low_pc(0x00)
	.dwattr $C$DW$6, DW_AT_name("_memcpy")
	.dwattr $C$DW$6, DW_AT_TI_call

        LCR       #_memcpy              ; [CPU_ALU] |22| 
        ; call occurs [#_memcpy] ; [] |22| 
	.dwpsn	file "../main.c",line 23,column 5,is_stmt,isa 0
;----------------------------------------------------------------------
;  23 | small1_struct = small2_struct;                                         
;----------------------------------------------------------------------
        MOVL      XAR4,#_small1_struct  ; [CPU_ARAU] |23| 
        MOVL      XAR7,#_small2_struct  ; [CPU_ARAU] |23| 
	.dwpsn	file "../main.c",line 25,column 2,is_stmt,isa 0
;----------------------------------------------------------------------
;  25 | return 0;                                                              
;----------------------------------------------------------------------
        MOVB      AL,#0                 ; [CPU_ALU] |25| 
	.dwpsn	file "../main.c",line 23,column 5,is_stmt,isa 0
        RPT       #1
||     PREAD     *XAR4++,*XAR7         ; [CPU_ALU] |23| 
$C$DW$7	.dwtag  DW_TAG_TI_branch
	.dwattr $C$DW$7, DW_AT_low_pc(0x00)
	.dwattr $C$DW$7, DW_AT_TI_return

        LRETR     ; [CPU_ALU] 
        ; return occurs ; []

As you can see the smaller struct is copied by an inlined memcpy using RPT and PREAD, while the bigger one is copied by a function call to memcpy. But performacewise it would be better to use multiple RPTs.

According to the compiler manual the code may be gereated using multiple RPTs, but I could not find how:

--rpt_threshold=k
Generates RPT loops that iterate k times or less (k is a constant
between 0 and 256). Multiple RPT’s may be generated for the same
loop, if iteration count is more than k and if code size does not
increase too much. Using this option when optimizing for code size
disables RPT loop generation for loops whose iteration count can be
greater than k.

This is the compiler command that generated the cited code:

"C:/ti/ccsv8/tools/compiler/ti-cgt-c2000_18.1.4.LTS/bin/cl2000" -v28 -ml -mt --cla_support=cla1 --float_support=fpu32 --tmu_support=tmu0 --vcu_support=vcu2 -O2 --opt_for_speed=5 --include_path="D:/repos/merlin-gansolarwr/CCS_workspace/test_struc_copy" --include_path="C:/ti/ccsv8/tools/compiler/ti-cgt-c2000_18.1.4.LTS/include" -g --c99 --diag_warning=225 --diag_wrap=off --display_error_number -k --asm_listing --c_src_interlist --asm_cross_reference_listing --preproc_with_compile --preproc_dependency="main.d_raw" "../main.c"

Thanks,
Thomas.

0 George Mock over 5 years ago in reply to Thomas Wappler

TI__Guru**** 239845 points

You make a good point. It is similar to the other thread, but not the same. I investigated some more. It turns out that, if you limit the size of the structure to 255 words (127 floats), then memcpy is not called. The operation is done with an RPT loop.

There is no good reason for this limitation. So, I filed the entry CODEGEN-6165 in the SDOWP system to request a change in the compiler. You are welcome to follow it with the SDOWP link below in my signature. This entry does not report a bug, but requests the compiler be improved to generate faster code for this case.

Thanks and regards,

-George

C2000™︎ microcontrollers

C2000 microcontrollers forum

Compiler/TMS320F28379D: memcpy for >256 words does generate RPT || PREAD but does lcr memcpy => how to avoid