This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

MSP430FR5994: Execution cycles for extended (MSP430X) PUSHM.W and POPM.W doesn't appear to match Family User's Guide

Part Number: MSP430FR5994


I'm using Timer A configured to count cycles for me. (The CCS cycle counter is worthless -- seriously useless -- so I have to use a hardware timer/counter to get results that are meaningful.)

In my testing, I'm finding that the following:

    pushm #3, R11
    popm #3, R11

Takes 11 cycles.

(According to Timer A. There are a lot of other details required to get that value -- turning the timer on and off occupies 6 cycles despite the fact that each instruction is 4 cycles in length so should take 8 cycles. But I have to accept that the cycle point where the "turn-on" and "turn-off" takes place isn't necessarily obvious. So a lot of testing has taken me to the point where I can actually calibrate out the results I'm reporting. It's been a process. Trust me, these values are accurate. I've put a lot of effort into ensuring that fact.)

The documentation in Table 4-17 of the MSP430FR58xx, MSP430FR59xx, and MSP430x6xx Family User's Guide:

shows that the cycle count should be 5 cycles for each, for a total of 10 cycles, not 11 cycles.

I'm running at 1 MHz, so I don't think this is any kind of wait-state issue with the FRAM and it certainly isn't a wait state issue with the SRAM, where the stack is located.

So. Any idea where the extra cycle is coming from? Is the table simply wrong?

If so, this would NOT surprise me. The PUSH Rx instruction takes 3 cycles but the POP Rx instruction is actually a MOV @SP+, Rx (Format I instruction) and so takes 2 cycles (see Table 4-10), not 3. Consequently, a PUSH followed by a POP is 5 cycles and that's expected.

That said though, I do NOT expect 11 cycles given the above table which suggests 10 cycles for the pair (where N=3.)

Any help would be appreciated. I'd like to fully understand what's going on here.

Thanks,

Jon

  • Hi Jon

    Thanks for your sharing!

    at the first step, I use RUN >> clock function on CCS to check the machine cycle number of pushm #3, R11 and popm #3, R11.

    pushm #3, R11 is 1 machine cycle and popm #3, R11 is 6 machine cycles.

    So far, I don't have idea if this result is correct.

    I need try-run your timer A method and I will check if there is machine counter feature on IAR as well.

    Could you please share your test code using timer A ?

    Thanks!

  • Starting with the timer. I assume you are using read/modify/write instructions (bic, bis) to start and stop the timer. Obviously, the exact time the timer starts and stops depends on when the write part of that happens. Which will be the last cycle of the instruction. Plus whatever latency or synchronization happens within the timer.

    I don't think I would start and stop the timer. Reading TAR would seem the better choice. With the address of TAR in a register (mov @Rx,Ry) for fastest operation.

    That timer latency/synchronization is where I think your trouble lies. Try repeating that pushm/popm pair multiple times. If you do it 10 times and count 110 cycles, you might have something.

  • No, I'm not using BIS or BIC to start or stop the timer. Here's the methodology I'm using:

        ; Calibration:
        mov.w   #(TASSEL_2 | ID_0 | MC_0 | TAIE_0 | TAIFG_0), &TA0CTL   ; uses 4 cycles
        mov.w   &TA0R, R5       ; clock is stopped, read freely.
        mov.w   #(TASSEL_2 | ID_0 | MC_3 | TAIE_0 | TAIFG_0), &TA0CTL   ; uses 4 cycles
        mov.w   #(TASSEL_2 | ID_0 | MC_0 | TAIE_0 | TAIFG_0), &TA0CTL   ; uses 4 cycles
        mov.w   &TA0R, R4       ; read it a 2nd time while clock is stopped.
        ; R4 - R5 = 0x000E - 0x000B = (3 cycles -- calibration value)
    
        ; Test instruction pair:
        mov.w   #(TASSEL_2 | ID_0 | MC_0 | TAIE_0 | TAIFG_0), &TA0CTL   ; uses 4 cycles
        mov.w   &TA0R, R5       ; clock is stopped, read freely.
        mov.w   #(TASSEL_2 | ID_0 | MC_3 | TAIE_0 | TAIFG_0), &TA0CTL   ; uses 4 cycles
        pushm.w #3, R11
        popm.w  #3, R11
        mov.w   #(TASSEL_2 | ID_0 | MC_0 | TAIE_0 | TAIFG_0), &TA0CTL   ; uses 4 cycles
        mov.w   &TA0R, R4       ; read it a 2nd time while clock is stopped.
        ; R4 - R5 = 0x0019 - 0x000B = (14 cycles, less the above 3 cycles = 11 cycles)
    
        ; Test one dozen instruction pairs:
        mov.w   #(TASSEL_2 | ID_0 | MC_0 | TAIE_0 | TAIFG_0), &TA0CTL   ; uses 4 cycles
        mov.w   &TA0R, R5       ; clock is stopped, read freely.
        mov.w   #(TASSEL_2 | ID_0 | MC_3 | TAIE_0 | TAIFG_0), &TA0CTL   ; uses 4 cycles
        pushm.w #3, R11
        popm.w  #3, R11
        pushm.w #3, R11
        popm.w  #3, R11
        pushm.w #3, R11
        popm.w  #3, R11
        pushm.w #3, R11
        popm.w  #3, R11
        pushm.w #3, R11
        popm.w  #3, R11
        pushm.w #3, R11
        popm.w  #3, R11
        pushm.w #3, R11
        popm.w  #3, R11
        pushm.w #3, R11
        popm.w  #3, R11
        pushm.w #3, R11
        popm.w  #3, R11
        pushm.w #3, R11
        popm.w  #3, R11
        pushm.w #3, R11
        popm.w  #3, R11
        pushm.w #3, R11
        popm.w  #3, R11
        mov.w   #(TASSEL_2 | ID_0 | MC_0 | TAIE_0 | TAIFG_0), &TA0CTL   ; uses 4 cycles
        mov.w   &TA0R, R4       ; read it a 2nd time while clock is stopped.
        ; R4 - R5 = 0x0092 - 0x000B = (135 cycles, less the above 3 cycles = 132 cycles)

    (I simply repeated the TA0CTL mov.w instruction, one with MC_3 to start and one with MC_0 to stop.)

    These two instructions placed back to back (one immediately succeeding the other) that leads to a difference value. If I now place any instruction in between these two, I can check to verify whether or not the timing of the inserted instruction matches the theory expressed in the datasheet. And in every case so far (except the one I am presenting), the timing exactly matches what the datasheet says about it. So the method works fine.

    However, to deal with your objection I did as you asked (as shown above) and repeated the pairing a dozen successive times and my argument remains intact. It's exactly 11 cycles per pair.

  • The following code computes the cycle count of the inserted body of assembly code:

                        mov.w   #(TASSEL_2 | ID_0 | MC_0 | TAIE_0 | TAIFG_0), &TA0CTL
                        mov.w   &TA0R, R5
                        mov.w   #(TASSEL_2 | ID_0 | MC_3 | TAIE_0 | TAIFG_0), &TA0CTL
                        mov.w   #(TASSEL_2 | ID_0 | MC_0 | TAIE_0 | TAIFG_0), &TA0CTL
                        mov.w   &TA0R, R6
                        mov.w   #(TASSEL_2 | ID_0 | MC_3 | TAIE_0 | TAIFG_0), &TA0CTL
                        ; INSERT CODE BODY HERE -- CYCLE COUNT WILL BE COMPUTED INTO R5 LATER
                        mov.w   #(TASSEL_2 | ID_0 | MC_0 | TAIE_0 | TAIFG_0), &TA0CTL
                        mov.w   &TA0R, R7
                        rla.w   R6
                        add.w   R7, R5
                        sub.w   R6, R5      ; cycle count is now in R5.
                        nop                 ; set breakpoint here and examine R5.

    (I added a few extra instructions to make it easier to work out the cycle count of any instruction or set of contiguous instructions.)

  • So. Any idea where the extra cycle is coming from? Is the table simply wrong?

    Table with number of cycles from family datasheet is 100% correct for 2xx flash family. For CPUXv2 and FRAM especially only your method (hundreds of repeating instruction sequence in assembler measured with timer) can give precise answer. This is not something new, I was playing with this 10 years ago, and found that on 5xx flash family (not FRAM) even different sequence of same instructions can be executed at different number of cycles...

          rra.b @R5         ; 3        rra.b @R5         ; 3
        nop               ; 1        rra.b @R5         ; 3
        rra.b @R5         ; 3        rra.b @R5         ; 3
        nop               ; 1        rra.b @R5         ; 3
        rra.b @R5         ; 3        rra.b @R5         ; 3
        nop               ; 1        rra.b @R5         ; 3
        rra.b @R5         ; 3        nop               ; 1
        nop               ; 1        nop               ; 1
        rra.b @R5         ; 3        nop               ; 1
        nop               ; 1        nop               ; 1
        rra.b @R5         ; 3        nop               ; 1
        nop               ; 1        nop               ; 1
    -------------------------    -------------------------
    Total number of cycles 24    Total number of cycles 27

  • That datasheet is for the FRAM family, as specifically mentioned already: MSP430FR58xx, MSP430FR59xx, and MSP430x6xx Family User's Guide. (Unless the MSP430x6xx itself is the culprit here?)

    I'm first off kind of annoyed that CCS is useless in terms of counting cycles in the lower right corner of the IDE. That's got almost no value to me. Which is what forced my hand in using a hardware timer for testing. I almost cannot believe that this IDE that has existed for so long can be so terribly bad at counting cycles. It's shocking.

    But then I'm finding I cannot trust what I read in the datasheet, either. I have to instead take the time to check, manually. I'm not all that happy, right now.

    Yes, I get it that documentation is one thing and the hardware itself is another and it is possible these things get out of sync. I expect better. But I can get the fact that it happens, anyway.

    But it's bit of a pain that I cannot trust any software tool or hardware documentation and that I have to independently verify every single thing I read.

    I could wish for better.

    Thanks for your thoughts and interesting presentation. I will see how that performs here, as well, on this device. Worth a shot.

    Just tested it. I get 24 cycles for your left side case and I get 27 cycles for the right side case, just as you illustrated!

    I'm seriously annoyed, now.

    I want to have the complete internal description of the processor, now. (I used to work at Intel on the BX chipset for the Pentium II. I can read.)

    The only remaining question I have is this: I'm getting a consistent 5 cycles for one and 6 cycles for the other. It doesn't matter if I pair them up or isolate them. I get the same results. So unlike your example, it appears to be consistent here. I do need to do things such as inserting NOP instructions, though. I haven't yet tried that. But now this is getting into more complex territory, too. I'll spend a little time on this and I very much thank you for your contribution. I will probably select it as an answer, soon, as I don't think anyone else is going to add a contribution nearly as useful as yours.

    Added: Done and selected. Thanks! I really owe you one for this contribution! (I'm in the process of writing a full featured BASIC interpreter, including matrix operations, for the MSP430X. It is entirely in assembly code. I have a lot of problems to face. But this is yet another one that is getting in my face, right now: chebyshev and non-linear minimax only the least of them. But I'm not particularly happy at this moment with TI. The only saving grace is that the 5994 launchpads are cheap and readily available and of very good quality. The file I/O system is what necessitates the FRAM right now. Sadly, this means FLASH is out of the question. But that's okay for my target audience.)

  • The file I/O system is what necessitates the FRAM right now. Sadly, this means FLASH is out of the question. But that's okay for my target audience.

    Just to add that MSP430F5xx/6xx have really fast flash writing (compared to some other vendors). Long word block write with smart bit enabled (write 128 bytes at once) executed from RAM can go over 250 Kbyte/sec.

  • Thanks for that. But it is the endurance that's an issue for the emulator. Not the speed.

    Added (as it appears the moderators want to combine my questions when they are about the same project I'm working on -- I'm new here, though. So I may be misinterpreting the meaning when they combined this question with my earlier one.):

    I appreciated your thoughts and you earned my respect. If you have thoughts on the following, I'd very much appreciate hearing them.

    I'm interested in documentation on the JTAG interface. Not the silly "erase, program, and verify" stuff. But the bit by bit details on the entire JTAG chain as well as what debugging support exists inside the chip via JTAG and how to use those functional groups for debugging purposes. This means both scan-chain debug as well as run-stop debug information. Do you know, one way or another, if this has been released to the public?

    I'm also interested in a good JTAG debugger tool (since, apparently, the CCS toolchain does NOT include one or else it doesn't work very well and is useless to me.) I've concluded that the CCS assembler and linker are good enough for my needs. But I'm very disappointed in its debugging features. My budget could exceed some thousands of dollars for a good toolset, if needed. So money isn't likely a barrier. I just want a good tool that actually works correctly and can help me identify silicon bugs, when necessary. (I've encountered and reported some of these to Microchip in the past because they do have some very good bond-out ICE systems.)

**Attention** This is a public forum