This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

MSP430F5438 problem with MOVA instruction

Other Parts Discussed in Thread: MSP430F5438

I am observing the  problem related to the execution of MOVA instruction in the following
example: (using MSP430F5438)

    mova    0xA4B6(R15), R15

Before the above instruction executed, the value of R15 is 0x6. My understanding is that
after the instruction is executed, the value of R15 is supposed to be loaded with the
value of the memory location at address (0xA4B6+0x6 = 0xa4bc) which is  according to the
memory dump is 0x0bb8c. However, instead of that value, I observe that R15 gets the value
0xF3FFF.

I would like to understand the above behavior. I understand that there is errata related to
mova instruction but I do not see that this errata (below) can explain the behavior

Thanks

  • Actually your observation is what I would expect if there weren't the erratum CPU16 (which actually is no erratum but just a duplication of the description of the indexed mode in lower 64k, chapter 4.4.2.1).

    Normally, I'd expect that the above command will subtract 0x5b4a from R15 (the index is signed). Which is, as it is a 20 bit calculation, 0xfa4bc. Of course there is no flash and the reading would be 0xf3fff as one could expect when reading from a vacant memory location (2* 0x3fff, the default return value).

    Yet if the register has the upper 4 bits clear, the result should be truncated to 16 bit and the addressed location should be 0x0a4bc, which is of course no vacant space.
    So either CPU16 and the users guide is wrong or something else is going on.

    Are you sure that you have an MSP430F5438 and not an XMS430F5438? The XMS devices are premature demo versions and often differ from the docs.

    Also, what you observed will happen if the code is executed in ram while the flash controller is busy writing or erasing. Then it will return 0x3fff to any request. The memory dump will of corse reveal the correct contents as it is taken when the flash controller is no longer busy.

    What you can try is to change the instruction to
    mova 0x0006(R15), R15
    where R15 is loaded with 0xa4b6 before. Here 6 is added to R15, giving the expected 0x0a4bc address in any case.

    While normal programming experience leads to the assumption that the register contains the variable offset to a fixed address (such as an array base address), the assembly language description tells that the fixed part is the (signed) offset and the register contains the base address.
    This makes no difference on a plain 16 bit system due to the automatic rollover, but it does when the address range is larger than 16 bit and no rollover happens.
    Note that the MOVX instruction has a 20 bit offset part, so the automatic rollover applies again.

  • Thanks, Jens-Michael

  • Jens-Michael you nailed it very nicely except for your assertion : 'Yet if the register has the upper 4 bits clear, the result should be truncated to 16 bit and the addressed location should be 0x0a4bc'. I cannot think of any MSP430 CPUX instruction behavior that executes differently based on the top 4 bits 15:19 being 0 or not. Some instructions do affect or do not affect the top 4 bits (for ex : SWPB.A Rn or SWP.W Rn) but it is deterministic never conditional on one of operand's data itself. This thread question is a "Read the user manual !" type.
    Section : Addressing Modes / Address Instructions with Indexed Mode/
  • that was bits 16:19 , not 15:19
  • "I cannot think of any MSP430 CPUX instruction behavior that executes differently based on the top 4 bits 15:19 being 0 or not."

    That is CPU16 erratum. "With indexed addressing mode and instructions calla, mova and bra, it is not possible to reach memory above 64k if the register content is <64k"
    It means, if bits 16..19 of the index register (which is actually NOT the INDEX register but the BASE ADDRESS register, while the immediate part of the instruction is the signed index) are zero, the CPU is designed to do a wrap-around in 64k address range. Intentionally or (rather) not, maybe a side-effect from making the CPU backwards compatible to plain 16bit code.
    However, this is apparently not the whole truth. If the immediate value (index) is <0x8000 and the register is <0x10000, then you'll get an overflow that wraps around on 64k. However, if the immediate value is >= 0x8000, then you might get an underflow that doesn't wrap around to 0x0ffff but rather to 0xfffff.
    This behavior gave some headache when using indirect jumptables (e.g. jumping indirectly using a function pointer array) created by the compiler. Well, the compiler couldn't know that the linker will put the table above 0x8000. But that was another thread. And the compiler has been fixed long ago (this thread is 7 years old)
    It is, however, not the CPU's fault that people (including compiler coders) think of the immediate part as the unsigned base address and the register as the index. And on the 16 bit CPU core, it didn't make any difference anyway.

    " This thread question is a "Read the user manual !" type. "

    Yes and no. Yes, it is in the manual (and the errata sheet), but no, it is hidden deep in the details and not obvious even if you read the manual. It takes quite some time to digest the users guide to an extent that makes answering this question possible.
  • Jens, reading your last post,  this thread again and then the manual and the CPU16 errata, I find that I was wrong. Indexed register addressing has different behaviors indeed based on the base register bits 16:19 being null or not for simple instructions (4.4.2.1) but supposedly not if the instruction is an address instructions indexed mode (4.4.2.4) as you  pointed out.  I see too now that for the original example in this thread  mova   0xA4B6(R15), R15, the result found by the user based on R15 content after execution seems to be for address 0xFA4BC ; what you'd expect for an address instruction if there weren't the erratum cpu16.  cpu16 errata states that " if the register content is <64k it is not possible to reach memory above 64k". Put it simply, cpu16 is describing the bug as address instruction indexed mode wrongly behaving like simple adress indexed mode, which it isn't in this example. 

    Sometimes describing a bug can make incorrect simplifications when trying to understand its modus operandi. The bug must not be fully understood. You are saying that the wrap around behavior follows cpu16's description (unwantingly wrapping around on 64K) when register <64k +index would exceed 64k but that underflowing the register follows the user manual wrapping around 1024K (1M). Based on these observations, cpu16 might be rewritten as follow (unless I make a mistake)

    a) if the register address is below 64k the register is sign extended from 16 bits to 20 bits and added to the signed index extended to 20 bits. Or using a casting and sign extension notation :

    adr =  (X .W)->.A +  (R).W->.A   

    b)  If the register address is >= 64k then the index is sign extended to 20 bits and added to the register :   

    adr = (X.W)->.A + R

    example1 : mova A4B6h(R5), dst   with R5=00006h  gives source address  (A4B6h)->.A + (00006h).W->.A = FA4B6h + 00006h = FA4BCh  (underflowing the register when <64k wraps around 1M)

    example2:  mova 0006h(R5) , dst   with R5=0FFFEh gives source address  (0006h)->.A + (0FFFE).W->.A = 00006h + FFFFEh = 00004h (exceeding 64k when register <64k wraps around 64k) 

    example3: mova 0006h(R5), dst   with R5=1A4B6h  gives source address  (0006h)->.A + 1A4B6h = FA4B6h + 00006h = 1A4BCh  ( register >64k  ) 

    example4:  mova 0006h(R5) , dst   with R5=FFFFEh gives source address  (0006h)->.A + FFFFE = 00006h + FFFFEh = 00004h (overflowing the register wraps around 1M) 

  • I think this pretty much sums it up.

    I'd say, the safe way is to use the MOVX instruction rather MOVA for indexed addressing. It is a word longer but does what it is intended to do if you operate in extended address space, while the good old MOV is sufficient if you want to remain in the lower 64k.

    The orthogonal design of the MSP430 instruction set made it seem mandatory to implement indexed mode for MOVA, but due to the inherent ambiguity of the parameters (which cannot be detectred/resolved at compile time) it shouldn't be used. The TI engineers tried to resolve the ambiguity problem, but every attempt to do so that succeeds in one case will inherently cause failure for the other. And so did the solution they implemented.
    In their situation, I had dropped MOVA z16(R) (and the opposite MOVA R,z16(R) ) totally. Every tool that can handle MOVA instructions can also as well use MOVX.

    BTW, CALLA x(R) also fails miserably if x is >0x7fff. Early versions of the MSP430X compilers used this when calling a function through an indexed function pointer array (e.g. call function #X from a dynamically loaded library with a jumptable above 0x8000)  EDIT: looking back, it was too MOVA, not CALLA (which doesn't have indexed mode at all) that was used by the compiler in this case - and failed.

  • And yes, you're right, from reading CPU16 it seems that the wrap-around would happen in both directions. Yet it only does so on 16 bit overflow. Not on underflow.
    Well, no MSP so far uses full 1MB address range, so CPU16 is technically correct -> you cannot reach the memory above 64k. You can only reach unused address space right beneath 1MB - and you do so unexpectedly. :)
  • MOVX x(Ri), Rj  is 1 cycle slower than MOVA x(Ri),Rj  it is a convenient choice nevertheless.

    one might never use a C or C# compiler but only MSP430 assembly language tools (assembler, linker, archiver, etc). Bugs like cpu16 make using a compiler even less attractive. Wasted time on a compiler to do guesswork what the compiler will do / has done and why - is time lost on one's own code. Since it is embedded application the closer to the hardware will be the most robust and transparent to the programmer.

    For this bug it seems  the only case that would lead to a practical application failure (you mentioned dynamically loaded library) is when the register is between 0x08000 and 0x0FFFF. With MOVA x(R)  the index x is always sign extended to 20 bits (bug or not) and therefore should never be made to contain the base address as you pointed out (which this thread original example seems to be doing). But suppose one is coding in assembly language (no compiler) then one could circumvent this by restricting x <= 0x7FFF (extension to 20 bits stays unchanged) and the register serves as a positive or negative 20bits index it would either be looking like 0xFxxxx or 0x0xxxx. That still has potential for bug failure on positive values  > 0x07FFF. This is because the bug is apparently sign extending R a15..a0 to 20 bits (example2) resulting in the 64k wrap round 64k that one is not expecting with MOVA x(R). So then the only absolutely safe choice within practical uses -unless i am making a complete reasoning mistake which is possible :)- , would be for R to avoid the address section from  0x08000 to 0x0FFFF when using MOVA.

    This thread original example demonstrates the underflow behavior of the true bug (that cpu16 is missing) but remains a bad use of MOVA with the base address apparently held in the index instead of the register.  You pointed that out right from the beginning. I think this leaves the only practical case for the bug  (with no user error) where R is between 0x08000 and 0x0FFFF. 

    Completely unrelated.. why is indirect with auto-increment address mode @R+ no longer available for the destination register on MSP430 ? I was reading an earlier coding hints note and saw lines with MOV  x(Ri), @Rj+. Could it be that doing an explicit add  ADDA  #00001h, Rj  was deemed an acceptable alternative speedwise ?  BTW can one write the constant values generator addressing modes directly when using an assembler  (let's forget the compiler altogether) ? for example :

    MOVA R3, Rj for MOVA #00000h,  Rj   or   MOVA @R3+, Rj  for ADDA #FFFFFh, Rj   ?

  • Yes, MOVX is a word longer, so it is a cycle slower.

    If this is important (to most of the MSP users it isn't, especially not to those who use float variables where integers would be sufficient, such as index counters), then you'd use assembly or at least inline assembly.

    I once worked in espire (an object.oriented assembler language) and did assembly work for the C64, Spark and the 8031 processors. But for more complex tasks I prefer high-level languages. Today, most compilers produce assembly output anyway instead of producing binary code directly.

    Nevertheless, unless you are have a toolchain where the assembler directly produces the final output, or you do additional effort for exact segment placing, you'll never know where the code ends up in memory space. That's not a matter of using compiler or assembler, but it is the linker that puts the code in place. So the critical instruction may well end up where you don't want it (>0x7fff), no matter whether you use compiler or assembler. So you can't easily prevent the immediate part from becoming >0x7fff and depending on your application, it might not be possible to ensure the register stays below 0x7fff as well. Even though this is rather unlikely, as most people don't use tables >32k. Well, To be true, I once did, on a 1611 with 40k ram. This code couldn't work on any MSP430X while using MOVA.

    Using the immediate index as base and the base address register as index is what was always done when the address space was limited to 16bit. It didn't make a difference. And that's understandable, because you usually have a fixed base address and a variable index. However, the MSP core was designed the opposite way. Maybe due to internal optimizations. I don't know. Only the original designers know.
     When going to 20 bit address range, this didn't work anymore. Sometimes you have to pay the price for additional functionality. Such as using MOVX for having more than 64k memory.
    The user's guide states: "address instructions are instructions that support 20-bit operands but have restricted addressing modes." and  "Restricting the addressing modes removes the need for the additional extension-word op-code". That's true. However, the restrictions obviously extend beyond just not providing some addressing modes. IMHO it would have been better to simply not provide the indexed mode at all. Especially since MOVA is the only one providing it.

    To answer you other question regarding indirect register/autoincrement for destination: this was never part of the MSP instruction set. The orthogonal structure of the instruction word leaves only 3 bits for the source and destination addressing modes. Since immediate value for destination makes no sense, two bits were assigned for the source and 1 bit for the destination, In fact, there are only 4, not 7, different addressing modes: Register, indexed, indirect registrer and autoincrement.
    Immediate mode is implemented by using autoincrement mode on PC register, absolute is implemented by using indexted with SR (= constant generator returning 0) and symbolic is done as indexed on PC.

  • Yes, the complete set of assembly tools including linker directives and doing one's own restrictive segment placing in memory without having to exactly dictate everything. IMO using a compiler is signing up for back-end heavy debugging work and increasing potential for sub-optimal, badly nested and wasteful use of hardware ressources (most people will remedy this with choosing a faster more power hungry processor to accomodate their cluttered coding style that relies heavily on the compiler). I recognize the need for a high level language to develop any application but i think one should think of hardware as soon as possible and stay away from object oriented esoteric structures in embedded applications which tend to prize saving power by efficient use of limited ressources. There is nothing wrong with compilers but it makes it easy for the programmer to err toward writting adverse code for efficient machine implementation or at extra hardware ressources cost.
    I used to code in M68000 cpu and not much since.

    T.I.'s offering of a x16s(R20) with MOVA did not put restriction on its usage, but the bug has or is the bug an inexorable consequence of a deeper architecture problematic choice ? Either way i will stay away from it in assembly.

    Funny part, if autoincrement in a destination never existed i had to check back where i saw it then . Well, it was not. it was a macro with two arguments –4(pQR) , @pWI+ , that lined up perfectly under previous lines that used true MOV or SUB mnemonics. pQR and pWI are both .eq to some registers.

    I hear your paragraph on addressing modes with As, Ad being limited to only 2 and 1 bits respectively for a total of 4 non-emulated source addressing modes and 2 destination modes. But it did not answer my question if an assembler would take :

    ADD @R2+ , R4 ; 2 cycles or 1 cycle ?
    MOV 0(R3) , R5 ; 3 cycles or 1 cycle ?

    for

    ADD #8 , R4 ; 2 cycles
    MOV #1 , R5 ; 2 cycles

    Do you know what is the execution speed of the constant number generator, the user manual omits to say it but since it does not require memory access is it equivalent to register addressing for execution speed ?
  • Using constants will take same cycles as with registers, regarding 16-bit instructions. With 20-bit instructions, number of cycles can be different.

            adda R4, R5        ; 1
            adda #0, R4        ; 2
            adda #1, R4        ; 2
            adda #2, R4        ; 1
            adda #4, R4        ; 2

            suba R4, R5        ; 1
            suba #0, R4        ; 2
            suba #1, R4        ; 2
            suba #2, R4        ; 1
            suba #4, R4        ; 2

            cmpa R4, R5        ; 1
            cmpa #0, R4        ; 1
            cmpa #1, R4        ; 2
            cmpa #2, R4        ; 2
            cmpa #4, R4        ; 2

  • zrno soli, that would be a let down for address instructions. What is your source and what about mova ? mova with true immediate source takes 2 cycles. Though your syntax is relying on the compiler you did not spell out use of the constant generator but the quoted cycle times are quicker than true immediate to register (3 cycles for adda, suba and cmpa) so I assume the compiler made the proper substitutions.

  • I was working on algorithm in assembler that was using 20-bit data and was cycle aligned. I noticed some strange things, and measured number of cycles for each assembler instruction. At the end, everything was running fine. I forgot mova in last post, here it is...

            mova R4, R5        ; 1
            mova #0, R4        ; 2
            mova #1, R4        ; 2
            mova #2, R4        ; 1
            mova #4, R4        ; 1

  • @ Howard Handsum:
    As a rule-of-thumb, the MSP takes one cycle for each physical memory access. If source or destination are a register, they don't take a cycle, except if the destination is R0. The reason for the latter is that the MSP loads the next instruction while it writes to the destination register. But when the destination is R0 (==PC), then this read goes to the wrong address and must be discarded and repeated.
    On MSPs with FRAM, the number of cycles depends on the location of the instruction and the type of instruction. This is because FRAM requires waitstates on higher clock frequencies that are sometimes but not always circumvented by the 32bit read cache. When the instruction writes to FRAM (which will always cause waitstates) or the instruction is a 3-word-instruction or a not dword-alinged 2-word-instruction, or a jump/branch (modifying R0), the resulting cache miss will result in an additional waitstate. On slower MCLK, no waitstates apply and things are like with the flash-based MSPs.

    @ zrno soli:
    are you sure about your clock cycles? I know you usually are, but I'm a bit puzzled.

    ADDA, SUBA and CMPA only know immediate mode (2 cycles) and register moder (1 cycle). However, the only constant you can get in register mode is 0, with R3 as source.
    So I'd expect xxxA #0,Rx being 1 cycle (as it is effectively xxxA R3, Rx) and all other immediate values being 2 cycle instructions.
    Regarding MOVA, I'd expect all 6 constants from the constant generator being used (including -1), as MOVA has all addressing modes available to get all 6 constants.
    Well, probably the reasion is that the MOVA doesn't have the normal As field (and the additional MOVA Rx, &abs20), as well as the other xxxAs too. And therefore the constant generator only works for some of the constants.
    in nany case, those instructions that take 2 cycles should also be 2 words long. :)
  • Jens-Michael Gross said:
    On slower MCLK, no waitstates apply and things are like with the flash-based MSPs.

    My assembler cycle aligned code executed on any 2xx / 5xx I tried without problems. I was surprised that it failed on FRAM, even it was on low MCLK (1 MHz) without wait states. To save some space I used JMP $+2 instead of double NOP's, and that was the problem. When I replaced back all JMP $+2 with double NOP's it started to work on FRAM, too.

    I measured instruction cycles with timer, for all instructions without cycle number noted inside family datasheet, and some that I have doubts. Noted on side, for later use. I also measure number of cycles for executing some part of code, that was not as per datasheet, to clear up things. For example (if I remember right) this two blocks on 2xx will take 24 cycles, but not on 5xx, where instruction order have impact on number of cycles.

        rra.b @R5         ; 3        rra.b @R5         ; 3
        nop               ; 1        rra.b @R5         ; 3
        rra.b @R5         ; 3        rra.b @R5         ; 3
        nop               ; 1        rra.b @R5         ; 3
        rra.b @R5         ; 3        rra.b @R5         ; 3
        nop               ; 1        rra.b @R5         ; 3
        rra.b @R5         ; 3        nop               ; 1
        nop               ; 1        nop               ; 1 
        rra.b @R5         ; 3        nop               ; 1
        nop               ; 1        nop               ; 1
        rra.b @R5         ; 3        nop               ; 1
        nop               ; 1        nop               ; 1
    -------------------------    -------------------------
    Total number of cycles 24    Total number of cycles 27
  • @Jens Michael Gross

    The user manual gives a binary description of extended instructions (4.6.1) . All MOVA combinations independently of addressing mode for source and destination occupy 1 word, This appears to put a contradiction between cycle counts measured by zrno soli timer approach and the assumption :"instructions that take 2 cycles should also be 2 words long." Looking at that binary description table for MOVA closely the usual As bits 5:4 and Ad bit 7 position values given are not what one would expect for the addressing opcode when compared to non-address instruction binary description (table 4.22). You must be right that it does not have the normal As and Ad fields. On the other hand it is not missing  MOVA Rsrc,&abs20, which appears in the table.  

    @zrno soli

    Have you tried to time non-extended instructions (MOV, ADD, SUB, CMP) ? Since they do not have the same addressing restrictions imposed on xxxA instructions it would clarify a few things. I expect 1 cycle MOV for all constant listed in the constant generator to a register destination, but if not, definitively for R3. If you do not get 1 cycle for MOV #0 , Rdst, there could be something else going on with the timer approach.

    The manual says NOP is an emulated instruction for MOV #0, R3. In the CGR section it also says quite contradictly : " Registers R2 and R3, used in the constant mode, cannot be addressed explicitly; they act as source-only registers". This is equivocal for MOV #0, R3. Then they also say it can be used to set defined waiting time and never give cycle times for CG source.

  • Howard Handsum said:

    Have you tried to time non-extended instructions (MOV, ADD, SUB, CMP) ? Since they do not have the same addressing restrictions imposed on xxxA instructions it would clarify a few things. I expect 1 cycle MOV for all constant listed in the constant generator to a register destination, but if not, definitively for R3. If you do not get 1 cycle for MOV #0 , Rdst, there could be something else going on with the timer approach.

    The manual says NOP is an emulated instruction for MOV #0, R3. In the CGR section it also says quite contradictly : " Registers R2 and R3, used in the constant mode, cannot be addressed explicitly; they act as source-only registers". This is equivocal for MOV #0, R3. Then they also say it can be used to set defined waiting time and never give cycle times for CG source.

    As already noted...

    zrno soli said:

    Using constants will take same cycles as with registers, regarding 16-bit instructions. With 20-bit instructions, number of cycles can be different.

    16-bit instructions regarding number of cycles (with using constant generator or not) are like in datasheet. MOV.W #0, R4 will take one cycle. If you have doubts about my measurement you can check by yourself.

    As I see in 5xx /6xx family manual NOP is emulated by MOV R3,R3.

  • That's right you said that earlier, i forgot.  Still, strange things are happening with your results and you seem to demand absolute faith in your word without submitting code to peer analysis or at least a sufficient verbal description of your cycle alignment method (  timer mode, timer initialization, interruption(s) , etc) . In such conditions people are left with having to trust you or not  because you said so which is not a constructive or useful situation to fill the gaps in the user manual. Your results may very well be correct, that is not the point here.

  • I can't attach source, because my USB assembler stack is part of this. Program print number of cycles on PC serial (CDC) terminal.

    Algorithm for measuring number of cycles don't use any interrupts.

    1) clear and start timer as counter clocked by (S)MCLK

    2) execute 256 times mov.w R5, R6 (one instruction or part of assembler code, copied 256 time, no looping)

    3) stop timer

    4) number of cycles will be stored in higher byte of timer register (TAxR). Few cycles used for stopping timer are irrelevant, related to (at least) main 256 cycles.

  • Howard Handsum said:
    The manual says NOP is an emulated instruction for MOV #0, R3.


    zrno soli said:
    As I see in 5xx /6xx family manual NOP is emulated by MOV R3,R3.

    Since #0 is emulated by R3 in registe rmode, this is the very same.

    Howard Handsum said:
    Registers R2 and R3, used in the constant mode, cannot be addressed explicitly; they act as source-only registers"

    Well, R2 in register mode is the status register. It can be read and written in this mode. In all other modes it is write protected and reads the constants. R3 might be writable, but since it always returns #0 in register mode (and othe rconstants in othe rmodes), it would be a write-only register :)

    The MSP430 does not throw an exception if an instruction is used in an illegal way. The memory controller throws an exception if instrucitons are fetched form vacant memory or such, but not the processor. It silently ignores the illegal state, continuing as if it were legal. If you use word instrucitons on an odd address, it silently ignores the LSB. If you write to R3, it will silently ignore that R3 can't be written to. Or at least that this won't have an effect - I guess that R3 very well exists physically, like all other registers (straight design), even though you cannot by any means read it. Likely, the difference would be in the access signal logic.

  • "ll MOVA combinations independently of addressing mode for source and destination occupy 1 word"
    You didn't count the additional word for the #20 or z16 parameter. The binary description only tells the internal bit combination for the opcode, the src/dst register number or the upper 4 bit of the #20 parameter. Reading the lower 16 bit of #20 or z16 requires a second cycle.
    Following zrno soli's measurements, the usage of the constant generator does not match the same on 16bit instructions, so some of the MOVA #20,Rx will have a literal #20 parameter rather than using the constant generator. Which explains the second cycle.
  • " I was surprised that it failed on FRAM, even it was on low MCLK (1 MHz) without wait states. To save some space I used JMP $+2 instead of double NOP's, and that was the problem. "

    Mayby the fact that jmp$+2 actually modifies PC (even if to the value it already has) and therefore breakes pipelining of the instruction (hence the second clock cycle) but maybe also invalidates the FRAM cache, is the reason for the irregular waitstates.
    Instead of the double nop (mov R3,R3), you could use mov @R3,R3, which takes just one word but has an additional read to 0x0002, which is no FRAM and therefore takes 1 cycle without messing with the cache. Well, as long as reading 0x0002 has no side-effects. Any other register (or other values from the constant generator) pointing to sram or peripheral registers (not sensitive to read) would do the job too.

    Does mov R0,R0 work? Writing to R0 will result in a waitstate, but perhaps this undoes the autoincrement of R0, resulting in an endless loop? Also, I think I remember a CPU erratum regarding (IIRC) single-stepping over instructions modifying R0. So maybe not a good idea.

    Nevertheless, I don't have an explanation for the different results in your rra.b/nop experiment (I remember it from a much earlier discussion). Especially since the difference is 3 and not 5 cycles, which makes no sense at all. If you extend the code from 6 to 12 instructions each, do the counts exactly double? And is there a difference if you use rra.w?

**Attention** This is a public forum