Optimizing Assembly Code through Software Pipelining

xingdou yan

I am learnig "TMS320C6000 Programmer’s Guide".

When I was learning the chapter 5.4 which tells how to unroll the "c" code in "Assembly Code",I met problems.

the "c " code is as the following

int dotp(short a[], short b[] )
{
int sum0, sum1, sum, i;
sum0 = 0;
sum1 = 0;
for(i=0; i<100; i+=2){
sum0 += a[i] * b[i];
sum1 += a[i + 1] * b[i + 1];
}
sum = sum0 + sum1;
return(sum);
}

and the Assembly Code is as the following (183 Page in TMS320C6000 Programmer’s Guide)

{

LDW .D1 *A4++,A2 ;                        load ai & ai+1 from memory
|| LDW .D2 *B4++,B2 ;                   load bi & bi+1 from memory
|| MVK .S1 50,A1 ;                          set up loop counter
|| ZERO .L1 A7 ; zero out sum0 accumulator
|| ZERO .L2 B7 ; zero out sum1 accumulator
[A1] SUB .S1 A1,1,A1 ; decrement loop counter
|| LDW .D1 *A4++,A2 ;* load ai & ai+1 from memory
|| LDW .D2 *B4++,B2 ;* load bi & bi+1 from memory
[A1] SUB .S1 A1,1,A1 ;* decrement loop counter
||[A1] B .S2 LOOP ; branch to loop
|| LDW .D1 *A4++,A2 ;** load ai & ai+1 from memory
|| LDW .D2 *B4++,B2 ;** load bi & bi+1 from memory
[A1] SUB .S1 A1,1,A1 ;** decrement loop counter
||[A1] B .S2 LOOP ;* branch to loop
|| LDW .D1 *A4++,A2 ;*** load ai & ai+1 from memory
|| LDW .D2 *B4++,B2 ;*** load bi & bi+1 from memory
[A1] SUB .S1 A1,1,A1 ;*** decrement loop counter
||[A1] B .S2 LOOP ;** branch to loop
|| LDW .D1 *A4++,A2 ;**** load ai & ai+1 from memory
|| LDW .D2 *B4++,B2 ;**** load bi & bi+1 from memory
MPY .M1X A2,B2,A6 ; ai * bi
|| MPYH .M2X A2,B2,B6 ; ai+1 * bi+1
||[A1] SUB .S1 A1,1,A1 ;**** decrement loop counter
||[A1] B .S2 LOOP ;*** branch to loop
|| LDW .D1 *A4++,A2 ;***** ld ai & ai+1 from memory
|| LDW .D2 *B4++,B2 ;***** ld bi & bi+1 from memory
MPY .M1X A2,B2,A6 ;* ai * bi
|| MPYH .M2X A2,B2,B6 ;* ai+1 * bi+1
||[A1] SUB .S1 A1,1,A1 ;***** decrement loop counter
||[A1] B .S2 LOOP ;**** branch to loop
|| LDW .D1 *A4++,A2 ;****** ld ai & ai+1 from memory
|| LDW .D2 *B4++,B2 ;****** ld bi & bi+1 from memory
LOOP:
ADD .L1 A6,A7,A7 ; sum0 += (ai * bi)
|| ADD .L2 B6,B7,B7 ; sum1 += (ai+1 * bi+1)
|| MPY .M1X A2,B2,A6 ;** ai * bi
|| MPYH .M2X A2,B2,B6 ;** ai+1 * bi+1
||[A1] SUB .S1 A1,1,A1 ;****** decrement loop counter
||[A1] B .S2 LOOP ;***** branch to loop
|| LDW .D1 *A4++,A2 ;******* ld ai & ai+1 fm memory
|| LDW .D2 *B4++,B2 ;******* ld bi & bi+1 fm memory
; Branch occurs here
ADD .L1X A7,B7,A4 ; sum = sum0 + sum1

}

I want to know why it is to write into the same registers (A2 ,B2)successive in the assemly code before the data in A2(or B2) has been computed.

waiting for your answers

Thanks

over 15 years ago

0 RandyP over 15 years ago

TI__Guru* 84110 points

You have already started learning well from the Programming Guide, which helps you learn the methods of writing and optimizing your code.

If you want to understand what the assembly code is doing, you will need to find the CPU & Instruction Set User's Guide for the device you are interested in using. There are slight differences from the C6000 to the C6400 to the C64x+ to the C674x DSP cores, but the code you show above is going to be the same for them all.

What you have is the tightest possible code for executing your dotp function, so you may be very happy with the performance you get. The instruction at LOOP: is 8 parallel instructions which means the DSP is executing all 8 instructions in one (1) clock cycle, and that is the best that can be achieved. The instruction sets may improve with newer DSP cores such that better performance can be achieved even beyond this, but you have very good code implemented by the compiler.

You will need to study the sections of the CPU & Instruction Set UG in detail in the sections before the listings of the individual instructions. This is where the explanations can be found to explain how the C6000 unprotected pipeline works. It is complex and not something that can be explained easily in a forum posting. That is why the whole UG was written, to explain how the architecture works to meet your DSP requirements.

For the LDW instructions, you will learn when in the pipelined process that the values actually get written to the A2 and B2 registers. And from that, you will understand that everything works well and works cleanly.

You can also try loading this program into the Cycle Accurate Simulator in Code Composer Studio and do an assembly single-step through the code to see when values show up and how they get loaded, multiplied, and added. Open a register window and follow along with each instruction to see the accurate pipeline stages in action.

0 one and zero over 15 years ago in reply to RandyP

TI__Mastermind 18256 points

Hi,

this is very well covered in the C6000 Optimization workshop. You can find the material here:

http://processors.wiki.ti.com/index.php/TMS320C6000_DSP_Optimization_Workshop

Chapter 8 (page 216) will go through Software pipelining. This will guide through all the necessary concepts to be able to understand the mechanism.

0 Sid over 15 years ago in reply to one and zero

Genius 4530 points

Hi,

This seems to be a case of loop unrolling in VLIW architecture.

As you can see, VLIW has a higher bandwidth which ensures that upto 8 instructions can be issued in parallel. Since Load instruction takes approximately 2 cycles, instead of stalling the pipeline, the further operations are being scheduled via re-ordering. At the same time, the new LD instruction will load the new values from meory, only when the Execution unit of the pipeline has finished calculating the values for the first iteration.

This is generally done in VLIW architecture to reduce the CPI (Clocks per instruction) by executing more instructions in parallel and hence improving performance.

Regards,

Sid

0 RandyP over 15 years ago in reply to Sid

TI__Guru* 84110 points

Interesting how we have 3 different views of this issue presented. I hope one or more of these have managed to answer your question.

The best description, if you want to learn software pipelining, is in the training material that one and zero referenced. Sid's description is a good qualitative summary and explains the basic concept of software pipelining quite well, but please be aware that the referenced training material is more accurate.

Processors

Processors forum

Optimizing Assembly Code through Software Pipelining