This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

6678 linear assembly delay slots

Hi

I have write this simple linear assembly code to calculate a dot product :

.global _dotp
_dotp: .cproc pm, pn
	.reg m, n, prod, sum
	ZERO sum
	MVK .S1 100, A1	
	
loop:
 	LDH .D1 *pm++, A2
       ||LDH .D2 *pn++, B2
	SUB .S1 A1, 1, A1
 [A1] B .S2 loop
	NOP 2	
	MPY .M1X A2, B2, A6
	NOP
	ADD .L1 2, sum, sum	
	.return sum
	.endproc


Because of delay slots for the branch instruction the ADD & MPY instructions must occur in the loop so the ADD & MPY instructions must occur 100 times.
But when I compile this linear assembly code with ccsv5 the result show that the ADD & MPY instruction occurs only once

I have compile this code in O3.
Is there any other optimization settings required?