Convolution with a multiply-and-add (DDOTPL2) instruction

Andrey56137

Normal 0 false false false MicrosoftInternetExplorer4

Hello,

I am trying to run an optimized (as fast as possible) convolution operation on EVM C6472 board. I am using CCS v4.2 and SYS/BIOS 6. The convolution code is very straightforward:

for (i = sig_loop_lower_bound; i <= sig_loop_upper_bound; i += 1) {

sum = 0;

for (j = 0; j < h_len; j++) {

if (j > i)

continue;

sum += ((h[j]) * (x[i - j]));

}

y[i] = sum;

}

Here, h is the filter, x is the input sequence, and y is the output.

So, I thought a TI chip should optimize it very efficiently. There is a multiply-and-accumulate operation, DDOTPL2, that I thought would compute the “sum” variable in a single step. Also, there should be some instructions in parallel due to TI’s pipelining capabilities. However, that is not what I saw from the disassembler listing. I couldn’t find a single instance of DDOTPL2, and the whole loop has many more instructions than should be (I think). It’s eating up a lot of clock cycles, as my running time measurements show. I am posting the assembler output below (note it has C code and the corresponding assembly): *** This project uses O3 optimization setting. ***

83 for (i = sig_loop_lower_bound; i <= sig_loop_upper_bound; i++) {

0x00888C20: 7586 MV.L1X B11,A3

0x00888C22: A587 || MV.L2 B11,B5

0x00888C24: 0028A8FA CMPGT.L2 B5,B10,B0

0x00888C28: 20422120 [ B0] BNOP.S1 C$DW$L$_conv$30$E (PC+132 = 0x00888ca4),1

0x00888C2C: 0400016E LDW.D2T2 *+B14[1],B8

0x00888C30: 02000428 MVK.S1 0x0008,A4

0x00888C34: 02001069 MVKH.S1 0x200000,A4

0x00888C38: 021540FA || SUB.L2 B10,B5,B4

0x00888C3C: E0200001 .fphead n, l, W, BU, nobr, nosat, 0000001

0x00888C40: 04107A41 ADDAH.D1 A4,A3,A8

0x00888C44: 2611 || ADD.L2 B4,1,B1

0x00888C46: 61F0 || ADD.L1 A3,A3,A7

85 sum = 0;

C$DW$L$_conv$25$B, C$L9:

0x00888C48: 0180A358 MVK.L1 0,A3

86 for (j = 0; j < h_len; j++) {

0x00888C4C: 00200ADA CMPLT.L2 0,B8,B0

0x00888C50: 30286123 [!B0] BNOP.S2 C$DW$L$_conv$28$E (PC+80 = 0x00888c90),3

0x00888C54: 22000028 || [ B0] MVK.S1 0x0000,A4

0x00888C58: 22004068 [ B0] MVKH.S1 0x800000,A4

0x00888C5C: E0400004 .fphead n, l, W, BU, nobr, nosat, 0000010

0x00888C60: 221C8078 [ B0] ADD.L1 A4,A7,A4

88 continue;

C$DW$L$_conv$26$B, C$DW$L$_conv$25$E:

0x00888C64: 0210002A MVK.S2 0x2000,B4

0x00888C68: 020043EA MVKH.S2 0x870000,B4

86 for (j = 0; j < h_len; j++) {

0x00888C6C: 06A003A2 MVC.S2 B8,ILC

0x00888C70: 4C6E NOP 3

C$L10, C$DW$L$_conv$26$E:

0x00888C72: 0CE6 SPLOOP 2

C$DW$L$_conv$28$B, C$L11:

0x00888C74: 03103445 LDH.D1T1 *A4--[1],A6

0x00888C78: 1E7D || LDH.D2T2 *B4++[1],B7

0x00888C7A: 2CE6 SPMASK L2

0x00888C7C: EA132000 .fphead p, l, W, H, nobr, nosat, 1010000

0x00888C80: 029CDC81 || MPY.M1X A6,B7,A5

0x00888C84: D1C7 || ^ MV.L2X A3,B6

0x00888C86: CEA9 CMPGT.L2 B6,B5,B0

0x00888C88: 31946079 [!B0] ADD.L1 A3,A5,A3

0x00888C8C: 2761 || ADD.L2 B6,1,B6

0x00888C8E: 9CE7 SPKERNEL 3,1

83 for (i = sig_loop_lower_bound; i <= sig_loop_upper_bound; i += 1) {

C$DW$L$_conv$30$B, C$L13, C$L12, C$DW$L$_conv$28$E:

0x00888C90: EC91 ADD.L2 B1,-1,B1

0x00888C92: 15A2 || SHR.S1 A3,0x10,A3

0x00888C94: 26AF || ADDK.S2 0x1,B5

0x00888C96: 47F0 || ADD.L1 A7,2,A7

0x00888C98: 4FE4A121 [ B1] BNOP.S1 C$L9 (PC-56 = 0x00888c48),5

0x00888C9C: E7400700 .fphead n, l, W, BU, nobr, nosat, 0111010

0x00888CA0: 01A03654 || STH.D1T1 A3,*A8++[1]

Looks like the outer loop is split into two parts (probably due to -O3, not sure why), but why are there so many instructions for a simple multiply-accumulate statement?

My initial questions would be whether I am doing something wrong as far as C code, or missing some CCS setting that would enable further optimization? Secondly, where could I find an example of optimized TI code for convolution (in C), and its compiler-generated assembly?

Many thanks,

Andrey

over 15 years ago

Jeff Brower73 over 15 years ago

Genius 3420 points

TI Compiler Experts-

Andrey and I are part of the Signalogic team working on this, so I would like to make a follow-up post. I take it from the lack of reply to Andrey's post that in fact there is no C source code that will cause the TI C64x+ compiler to generate a DDOTPL2 (multiply-and-accumulate) instruction. I find that surprising since efficient MAC has been a TI bread-and-butter for many years.

If that's not correct can you please give us the C source example and compiler options.

Also, is there an app note about writing optimized C source code newer than this one:

http://focus.tij.co.jp/jp/lit/ug/spru425a/spru425a.pdf

Please let us know. Thanks.

-Jeff

PS. As Andrey mentioned, we're using EVMC6472, CCS 4.2, BIOS6, and CGT 7.0.3.

RandyP over 15 years ago in reply to Jeff Brower73

TI__Guru* 84110 points

The Optimizing C Compiler does a really good job of generating tight code when you give it as much information as you know about it. There are several optimization switches and keywords that need to be used, notably -o3, restrict, _nassert, and MUST_ITERATE, that improve the optimization immensely. Unfortunately, your code snippet does not show enough information to tell you what you could do to improve it. Does the code shown return the values you expect?

A couple of good places to look for information on optimization techniques from the TI Wiki Pages are:

TI Technical Training's TMS320C6000_DSP_Optimization_Workshop

Optimization_Techniques_for_the_TI_C6000_Compiler

Getting the best possible performance from the C64x+ architecture usually requires going through the steps outlined in the information above, usually in this order:

Write the program in simple C with no optimization and make sure it works correctly
Turn on optimization by switching to the Release configuration and see if you get the performance you need
Apply the keywords and pragmas and additional compiler switches to give the optimizer more information about your data, and see if you get the performance you need
Use intrinsics to get quick entitlement to the specialized instructions, and see if you get the performance you need
Write tight loops in assembly

The TI Compiler Experts do not routinely check the hardware forums, so please wait a couple of minutes for me to move this thread to the Compiler Forum.

If this answers your question, please click the Verify Answer button below. If not, please reply back with more information.

George Mock over 15 years ago in reply to Jeff Brower73

TI__Guru**** 251180 points

I recommend you learn the techniques described in http://processors.wiki.ti.com/index.php/C6000_CGT_Optimization_Lab_-_1 . I think that will give you the improvement you seek.

Thanks and regards,

-George

Jeff Brower73 over 15 years ago in reply to George Mock

Genius 3420 points

George-

Thanks very much for your reply.

The techniques you mention are a good starting point. However, they're not sufficient to generate code as fast as the TI benchmark, nor do they generate code using the most optimal C64x+ instructions, for example DDOTPL2.

There are some very informative threads ongoing about this on the C6x Yahoo Group, suggest that you read them (look for "efficient C64x+ code generation and DDOTPL2 instruction", "Mismatch in Timing measurements", and "BIOS profiling not matching cycle counts"). There is discussion of specific code generation techniques, memory usage, and cycle measurement methods -- especially where EVM C6472 is concerned -- that I suggest should be added to the TI Wiki page that you mention.

-Jeff

RandyP over 15 years ago in reply to Jeff Brower73

TI__Guru* 84110 points

Andrey and Jeff,

Have you reached conclusion of this effort? What results have you achieved?

Regards,
RandyP

Jeff Brower73 over 15 years ago in reply to RandyP

Genius 3420 points

Randy-

Here is a summary:

1) v7.0.3 C64x+ compiler can generate DOTP2 instructions, but not DDOTPL2.

2) Enabling cache for LL2 mem (both per-core local mem and shared mem) is crucial; otherwise performance can be 10 to 15x slower than benchmark code. It's important to understand that LL2 mem is "outside the core" even though it is physically onchip mem.

3) With our EVM6472 board, both Clock_getTicks() and TSC_read() functions work well to profile code. We created both non-BIOS and BIOS6 projects and are seeing similar performance results.

4) It's important to develop accurate cycle count formulas. We found that we had to splice together information from different comments in intermediate asm lang source files and also by code examination. Formulas in intermediate asm lang source comments cannot be used as-is.

5) BIOS6 has a ways to go. For example to create SIMD code that does core-specific processing, we have to access the DNUM register directly; the API fails. We suspect there are other cases such as this.

Overall, our current performance results are promising. Without intrinsics we can get within 3x to 5x of benchmark code, with we estimate we can get within 1.5x to 1.7x. Variations are due to length of arrays and which mem areas are used.

If anyone wants to look at our data comparing C6472 vs. quad-core x86 (Penryn) for C code annotated with OpenMP pragmas, or our expected results using C66x, please contact me privately and mention CIM (Compute Intensive Multicore). Thanks.

-Jeff

RandyP over 15 years ago in reply to Jeff Brower73

TI__Guru* 84110 points

Jeff,

Jeff Brower said:
1) v7.0.3 C64x+ compiler can generate DOTP2 instructions, but not DDOTPL2.

I am not surprised, although technically we (meaning you mostly) have not found the way to do it. I doubt there is a way to write C code that will use the DEAL and SHFL instructions, either. DDOTPL2 seems more likely, but it places a lot of restrictions on the code. It was likely implemented with a particular algorithm in mind, and that would be written in assembly or C with intrinsics.

We are proud of our C compiler's skill, but it is not perfect. Even with optimization on, it does silly things with assignments sometimes. And sometimes I just think they are silly and end up getting an assembly error when I try to optimize them out.

Jeff Brower said:
2) Enabling cache for LL2 mem (both per-core local mem and shared mem) is crucial; otherwise performance can be 10 to 15x slower than benchmark code. It's important to understand that LL2 mem is "outside the core" even though it is physically onchip mem.

Yes, cache is crucial to getting the performance that the C64x+ core entitles you to. I am not sure if you are mixing Local-L2, Global-L2, and Shared-L2, but do not forget to make the tradeoffs for L1D cache vs. SRAM. You can also get a lot of performance lift by using EDMA3 and IDMA1 to get data moving to the best spot while you are working on other data. The easiest scheme is max L1D cache and max L2 cache, but other combinations can turn out better for some applications or algorithms.

L1P, L1D, and LL2 are all part of the C64x+ Megamodule. We often use C64x+ and Megamodule and core interchangeably, and it is okay to do that because these parts are tightly coupled. L1P and L1D have no wait state delays in most cases, but L2 does have some wait state delays incurred, mitigated by bus widths and the memory controller's efficiency.

Jeff Brower said:
5) BIOS6 has a ways to go. For example to create SIMD code that does core-specific processing, we have to access the DNUM register directly; the API fails. We suspect there are other cases such as this.

If you have a chance, please report the specific API call you used, along with the BIOS6 version, CGT version, and CCS version you are using, by posting to the BIOS Forum. They would know if this is a known bug or not. I like using the DNUM register reference that you get when you #include <c6x.h>. Same with TSCL/H, which I just figured out that you can get the 64-bit value efficiently using _itoll(TSCH,TSCL), with optimization on or off. Hopefully, newer compiler releases do not change the order they get read.

Jeff Brower said:
Overall, our current performance results are promising. Without intrinsics we can get within 3x to 5x of benchmark code, with we estimate we can get within 1.5x to 1.7x. Variations are due to length of arrays and which mem areas are used.

Be sure to take a look at that optimization workshop. There are memory bank issues that can be managed to improve performance, and other good ideas. And I think (but have not used them recently) that the simulator and/or profiler offer some good performance analysis tools.

RandyP

Code Composer Studio™︎

Code Composer Studio forum

Convolution with a multiply-and-add (DDOTPL2) instruction