Tool/software:
I'm comparing a few different methods of passing data to/from the CLA co-processor and the C28x main processor. I'd like for the data transfer to be extensible, and therefore use a memcpy-like routine to perform the copy, however, I've run into the expected issues with the CLA and it's lack of a RPTB instruction. The methods I'm looking at are:
void element_by_element_copy(struct_t * in, struct* out){
out->field_1 = in->field_2;
...
...
out->field_n = in->field_n;
}
void memcpy(struct_t * in, struct* out){
uint32_t *src = in;
uint32_t *dst = in;
for (uint16_t i = 0; i < (sizeof(struct_t) / 2); i++){
dst[i] = src[i];
}
}
void unroll_memcpy(struct_t * in, struct* out){
uint32_t *src = in;
uint32_t *dst = in;
#pragma UNROLL(sizeof(struct_t) / 2)
for (uint16_t i = 0; i < (sizeof(struct_t) / 2); i++){
dst[i] = src[i];
}
}
As expected, the element-by-element copy is the most performant, but I noticed that using the UNROLL pragma with a for-loop results in nearly the same code. The only differences seems to be that when unrolling the loop, the CLA compiler duplicates the unnecessary MNOP instructions needed for the branch. See sample assembly output below:
/* Assembly generation - Element by Element Copy */
; MAR0 assigned to dst_buff;
; Copy first element
MMOV32 MR0,@src_buff ; [CPU_FPU]
MMOV32 *MAR0,MR0 ; [CPU_FPU]
; Copy second element
MMOV32 MR0,@src_buff+2 ; [CPU_FPU]
MMOV32 *MAR0+[#2],MR0 ; [CPU_FPU]
...
...
; Copy nth element
MMOV32 MR0,@src_buff+n ; [CPU_FPU]
MMOV32 *MAR0+[#n],MR0 ; [CPU_FPU]
/* Assembly generation - Unrolled For Loop */
; MAR0 assigned to dst_buff;
; Copy 1st 32 bits
MMOV32 MR0,@src_buff ; [CPU_FPU]
MMOV32 *MAR0,MR0 ; [CPU_FPU]
MNOP ; [CPU_FPU]
MNOP ; [CPU_FPU]
MNOP ; [CPU_FPU]
; Second 32 bits
MMOV32 MR0,@src_buff+2 ; [CPU_FPU]
MMOV32 *MAR0+[#2],MR0 ; [CPU_FPU]
MNOP ; [CPU_FPU]
MNOP ; [CPU_FPU]
MNOP ; [CPU_FPU]
...
...
; Copy nth 32 bits
MMOV32 MR0,@src_buff+n ; [CPU_FPU]
MMOV32 *MAR0+[#n],MR0 ; [CPU_FPU]
So, my question is, is there any way to force the CLA compiler to remove these NOP instructions when unrolling the loop? I'm assuming the NOPs are injected in case the loop is only partially unrolled, but in our use case, it's preferable to fully unroll the loop as this needs to be optimized for speed rather than code size.