Normal 0 false false false MicrosoftInternetExplorer4
Hello,
I am trying to run an optimized (as fast as possible) convolution operation on EVM C6472 board. I am using CCS v4.2 and SYS/BIOS 6. The convolution code is very straightforward:
for (i = sig_loop_lower_bound; i <= sig_loop_upper_bound; i += 1) {
sum = 0;
for (j = 0; j < h_len; j++) {
if (j > i)
continue;
sum += ((h[j]) * (x[i - j]));
}
y[i] = sum;
}
Here, h is the filter, x is the input sequence, and y is the output.
So, I thought a TI chip should optimize it
very efficiently. There is a multiply-and-accumulate operation, DDOTPL2, that I
thought would compute the “sum” variable in a single step. Also, there should
be some instructions in parallel due to TI’s pipelining capabilities. However,
that is not what I saw from the disassembler listing. I couldn’t find a single
instance of DDOTPL2, and the whole loop has many more instructions than should
be (I think). It’s eating up a lot of clock cycles, as my running time
measurements show. I am posting the assembler output below (note it has C code
and the corresponding assembly): *** This project uses O3 optimization setting. ***
83 for (i = sig_loop_lower_bound; i <= sig_loop_upper_bound; i++) {
0x00888C20: 7586 MV.L1X B11,A3
0x00888C22: A587 || MV.L2 B11,B5
0x00888C24: 0028A8FA CMPGT.L2 B5,B10,B0
0x00888C28: 20422120 [ B0] BNOP.S1 C$DW$L$_conv$30$E (PC+132 = 0x00888ca4),1
0x00888C2C: 0400016E LDW.D2T2 *+B14[1],B8
0x00888C30: 02000428 MVK.S1 0x0008,A4
0x00888C34: 02001069 MVKH.S1 0x200000,A4
0x00888C38: 021540FA || SUB.L2 B10,B5,B4
0x00888C3C: E0200001 .fphead n, l, W, BU, nobr, nosat, 0000001
0x00888C40: 04107A41 ADDAH.D1 A4,A3,A8
0x00888C44: 2611 || ADD.L2 B4,1,B1
0x00888C46: 61F0 || ADD.L1 A3,A3,A7
85 sum = 0;
C$DW$L$_conv$25$B, C$L9:
0x00888C48: 0180A358 MVK.L1 0,A3
86 for (j = 0; j < h_len; j++) {
0x00888C4C: 00200ADA CMPLT.L2 0,B8,B0
0x00888C50: 30286123 [!B0] BNOP.S2 C$DW$L$_conv$28$E (PC+80 = 0x00888c90),3
0x00888C54: 22000028 || [ B0] MVK.S1 0x0000,A4
0x00888C58: 22004068 [ B0] MVKH.S1 0x800000,A4
0x00888C5C: E0400004 .fphead n, l, W, BU, nobr, nosat, 0000010
0x00888C60: 221C8078 [ B0] ADD.L1 A4,A7,A4
88 continue;
C$DW$L$_conv$26$B, C$DW$L$_conv$25$E:
0x00888C64: 0210002A MVK.S2 0x2000,B4
0x00888C68: 020043EA MVKH.S2 0x870000,B4
86 for (j = 0; j < h_len; j++) {
0x00888C6C: 06A003A2 MVC.S2 B8,ILC
0x00888C70: 4C6E NOP 3
C$L10, C$DW$L$_conv$26$E:
0x00888C72: 0CE6 SPLOOP 2
C$DW$L$_conv$28$B, C$L11:
0x00888C74: 03103445 LDH.D1T1 *A4--[1],A6
0x00888C78: 1E7D || LDH.D2T2 *B4++[1],B7
0x00888C7A: 2CE6 SPMASK L2
0x00888C7C: EA132000 .fphead p, l, W, H, nobr, nosat, 1010000
0x00888C80: 029CDC81 || MPY.M1X A6,B7,A5
0x00888C84: D1C7 || ^ MV.L2X A3,B6
0x00888C86: CEA9 CMPGT.L2 B6,B5,B0
0x00888C88: 31946079 [!B0] ADD.L1 A3,A5,A3
0x00888C8C: 2761 || ADD.L2 B6,1,B6
0x00888C8E: 9CE7 SPKERNEL 3,1
83 for (i = sig_loop_lower_bound; i <= sig_loop_upper_bound; i += 1) {
C$DW$L$_conv$30$B, C$L13, C$L12, C$DW$L$_conv$28$E:
0x00888C90: EC91 ADD.L2 B1,-1,B1
0x00888C92: 15A2 || SHR.S1 A3,0x10,A3
0x00888C94: 26AF || ADDK.S2 0x1,B5
0x00888C96: 47F0 || ADD.L1 A7,2,A7
0x00888C98: 4FE4A121 [ B1] BNOP.S1 C$L9 (PC-56 = 0x00888c48),5
0x00888C9C: E7400700 .fphead n, l, W, BU, nobr, nosat, 0111010
0x00888CA0: 01A03654 || STH.D1T1 A3,*A8++[1]
Looks like the outer loop is split into two parts (probably due to -O3, not sure why), but why are there so many instructions for a simple multiply-accumulate statement?
My initial questions would be whether I am doing something wrong as far as C code, or missing some CCS setting that would enable further optimization? Secondly, where could I find an example of optimized TI code for convolution (in C), and its compiler-generated assembly?
Many thanks,
Andrey