l1:
// Increment r1
ADD r1, r1, 1
// Store the new r1 value into the address pointed to by r2
ST32 r1, r2
// loop
JMP l1
The above code seems to take 4 cycles per loop, not three. (The ST32 is really "SBBO src,dst,#0x00,4").
What takes the extra cycle? Is it the jump? The TRM doesn't seem to publish instruction cycle information, probably because all but a few of them are 1...
--Chris