TMS320F28379D: CLA Loop unrolling injects unnecessary NOP instructions

Part Number: TMS320F28379D

Tool/software:

I'm comparing a few different methods of passing data to/from the CLA co-processor and the C28x main processor. I'd like for the data transfer to be extensible, and therefore use a memcpy-like routine to perform the copy, however, I've run into the expected issues with the CLA and it's lack of a RPTB instruction. The methods I'm looking at are:

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
void element_by_element_copy(struct_t * in, struct* out){
out->field_1 = in->field_2;
...
...
out->field_n = in->field_n;
}
void memcpy(struct_t * in, struct* out){
uint32_t *src = in;
uint32_t *dst = in;
for (uint16_t i = 0; i < (sizeof(struct_t) / 2); i++){
dst[i] = src[i];
}
}
void unroll_memcpy(struct_t * in, struct* out){
uint32_t *src = in;
uint32_t *dst = in;
#pragma UNROLL(sizeof(struct_t) / 2)
for (uint16_t i = 0; i < (sizeof(struct_t) / 2); i++){
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

As expected, the element-by-element copy is the most performant, but I noticed that using the UNROLL pragma with a for-loop results in nearly the same code. The only differences seems to be that when unrolling the loop, the CLA compiler duplicates the unnecessary MNOP instructions needed for the branch. See sample assembly output below:

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
/* Assembly generation - Element by Element Copy */
; MAR0 assigned to dst_buff;
; Copy first element
MMOV32 MR0,@src_buff ; [CPU_FPU]
MMOV32 *MAR0,MR0 ; [CPU_FPU]
; Copy second element
MMOV32 MR0,@src_buff+2 ; [CPU_FPU]
MMOV32 *MAR0+[#2],MR0 ; [CPU_FPU]
...
...
; Copy nth element
MMOV32 MR0,@src_buff+n ; [CPU_FPU]
MMOV32 *MAR0+[#n],MR0 ; [CPU_FPU]
/* Assembly generation - Unrolled For Loop */
; MAR0 assigned to dst_buff;
; Copy 1st 32 bits
MMOV32 MR0,@src_buff ; [CPU_FPU]
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

So, my question is, is there any way to force the CLA compiler to remove these NOP instructions when unrolling the loop? I'm assuming the NOPs are injected in case the loop is only partially unrolled, but in our use case, it's preferable to fully unroll the loop as this needs to be optimized for speed rather than code size.

  • I tried to reproduce your results.  But I don't see ...

    the element-by-element copy is the most performant, but I noticed that using the UNROLL pragma with a for-loop results in nearly the same code.

    Please attach one source file I can build down to assembly.  Note it does not have to run.  I only need to inspect the generated assembly.  Copy and paste the text of the compiler options exactly as the compiler sees them.  Do not use a screenshot.  Also tell me the version of the compiler.

    Thanks and regards,

    -George

  • Hi George,

    Compiler version: v22.6.0.LTS

    Compiler invocation:

    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    cl2000 -v28 -ml -mt --cla_support=cla1 --float_support=fpu32 \
    --idiv_support=none --isr_save_vcu_regs=off --tmu_support=tmu0 \
    --vcu_support=vcu2 -O2 --opt_for_speed=2 --fp_mode=relaxed --fp_reassoc=off \
    --include_path="C:/ti/ccs1200/ccs/tools/compiler/ti-cgt-c2000_22.6.0.LTS/include" \
    --advice:performance=none -g --symdebug:dwarf_version=4 --c99 --relaxed_ansi \
    --float_operations_allowed=all --fp_single_precision_constant --diag_warning=225 \
    --diag_wrap=off --display_error_number --issue_remarks --quiet --abi=eabi \
    --cla_background_task=off --cla_signed_compare_workaround=off \
    --silicon_errata_fpu1_workaround=on --disable_inlining -k --parallel=8 \
    --obj_directory="source/manual" "copy_from_msgram.cla"
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    Not sure why, but it doesn't seem like I'm able to upload a file. I've placed the contents of my test file in the code block below instead (confirmed to compile and can verify the generated assembly results still produce the issue I'm discussing on my machine, hopefully on yours as well.).

    Setting the COPY_METHOD macro to ELEMENT_BY_ELEMENT , FOR_LOOP, UNROLL, or MANUAL_UNROLL shows the behavior I'm observing. 

    File: copy_from_msgram.cla

    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    // Sample file for reproducing issues with CLA compiler for-loop UNROLLs
    #include <stdint.h>
    #include <float.h>
    typedef struct struct_group1 {
    float val_1;
    float val_2;
    float val_3;
    float val_4;
    } group1_t;
    typedef struct struct_group2 {
    float val_1;
    float val_2;
    float val_3;
    } group2_t;
    typedef struct struct_c28x_to_cla {
    uint16_t val_1;
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    Thanks for looking into this with me,

    - Taylor

     

  • I found a workaround.  But I can only give a partial explanation for it.

    Add the restrict keyword to the definition of the pointers ...

    Fullscreen
    1
    2
    uint32_t * restrict dst = (uint32_t *)data_out;
    uint32_t * restrict src = (uint32_t *)&global_buff;
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    Without that, the compiler must presume expressions like dst[0] and src[1] could reference the same address.  If they refer to the same address, then the extra MNOP instructions are required to allow the write to dst[0] to complete before the read of src[1] can begin.  I don't know the pipeline behavior of CLA well enough to give more specific details.

    Thanks and regards,

    -George