MSP430FR5994: Strange/Undocumented behavior when disabling/enabling individual timer interrupts

Filip Dimitrov

Part Number: MSP430FR5994
Other Parts Discussed in Thread: MSP-EXP430FR5994, , MSP430F6659, MSP430FR2633, MSP430FR4133, MSP430FR5969, MSP430FR6989

Hello all,

Hardware : MSP430FR5994 LaunchPad Development Kit (MSP-EXP430FR5994)
CSS compiler : CCS Version: 9.3.0.00012
CPU : 78C85QW G4 / MSP430FR5994 / REV C
More detailed information about the setup, in the attached project.

I have this very strange prolem related to a simple tick counter I've implemented for MSP430FR5994.
Setup is very simple, I'm starting a timer (TA2) with interrupt (on CCR0). Inside ISR I just increment a counter (UINT64).
When I need to get counter value from main code, I disable interrupts for that specific timer and get the value.
Afterwards I re-enable interrupt for it.

Long story short : this causes wild execution of code, hangs, and all sorts of similar problems.
And I don't have any rational explanation for all of this. Already checked errata, family datasheet, datasheet, forum post etc...and I can't find any reasonable explanation.
The only closest thing that I can find is in MSP430 family user guide, section 1.3.4.1 Interrupt Acceptance, where there is a note on enabling/disabling global interrupt flag.
But this is related to global inerrupt flag, which is claimed as fixed when using compiler intrinsics __disable_interrupt()/__enable_interrupt() (not tested/checked explicitly).

After 3 days fight with my large project, which suddenly start to reset the device after adding some variables, I've managed to boil-down the problem to a simple test project.
I've already checked assembly code (seems quite ok), linker map file (sections seems ok, stack is not overriden by my value) and anything else I can think of.
I'm attaching it in this thread. In the begining of main.c, there is more detailed description of setup/build/used devicess & revisions etc.
Entire test case is in main.c, supplied with proper comments. Other files are just supplementary for initializations etc.
I know this sample project have logical problem related to reading correct counter value (because of instruction pre-fetch mechanisms inside CPU),
but this should not cause any of observed effects.

So test project is setup to blink the green led if everything is currently working, and to turn on the red one if improper reset is detected.
Led blink may hang, this also indicate a problem, long WDT (~ 40 sec) is also setup, so device will reset eventually and red led will be turned on.
Depending on conditions, problem appear right away (no led on) or after short time (1 - 10 sec).
Also a startup delay (4 sec) is in place so if there is reset of any kind it will be visible with naked eye.
Dev board reset button will clear red status.
If initialization fails, device will hang forever (no WDT reset), so this is also easily detected.
It is very easy to test and see when the problem appear, no need for debugger or any communications attached, that may affect the test.

The related "things", that affects whether this problem will appear and how, according to my tests are :
- Counter position in memory.
- Size of counter (UINT16/UINT32/UINT64).
- Various NOPs (or else) placed arount function that gets counter.
- Various NOPs (or else) placed inside ISR.
- Removing desable/enable interrupt makes problem disapear (check main.c comments)

In general my problem is not with the problem itself, but with lack of information about it.
I'm using other "lock-free" techniques in various parts in my code, and if such undocummented problems (?) appears now and then, I can't really trust this CPU.
So my questions (finally) :
- Solving this particular problem is apparently easy, by just adding simple NOP between disable interrupt and reading counter value in main code. But is it correct ?
- Are other peripherals have same problem ? What is their solution ?
- Is there any other place except errata document & forums, where such problems are described ? Where ?

I'm involved in a very serious project, and I can't afford any hidden "surprises" in the final product.
So please advise.

Thank you for your support,
FilipCrashTest.zip

over 5 years ago

0 Gary Gao over 5 years ago

TI__Mastermind 40890 points

Hi Filip

Some comments from my side:

1. I think the way you use have a bug here. If the counter value overflow, the time delay is not accurate and the counter value will overflow.

2. When you want to stop the timer to read the counter value, stop the timer by "TA0CTL = TASSEL__SMCLK | MC__STOP;" will be better than clear the CCIE.

3.If you can make your code more sample to reproduce the problem will be better.

4.I think we can focus on what's the function do you want to realize, Why you want to add the counter value in the Timer's ISR?

0 Filip Dimitrov over 5 years ago in reply to Gary Gao

Prodigy 80 points

Hello Gary,

Thank you for your answer.
However, question here is not for the logic of this test program, but to random code execution after sequence :

1. Disable peripheral (any ?) interrupt.
2. Not putting NOP

And where this is explained as a problem in official documents.

Regards,
Filip

0 Chester Gillon over 5 years ago

Guru 92251 points

I can repeat the crash with your example program, on a MSP-EXP430FR5994, albeit with a X430FR5994 rev A device (i.e. prototype silicon).

Since the program seems to fail with an out of range PC, I set a break point for a PC off the end of the range used by the program. I.e. "Break After Program Address" for 0x10576.

When the breakpoint is hit the following conditions occur, on multiple attempts:

The PC is 0x01067A which is off the end of the program
The global g_nTVal variable is 1255
The TA2CCTL0 CCIE bit is zero

I haven't yet determined the cause, and can't see any errata which explain the problem. The MSP430FR5994 has a S version of the EEM, which doesn't support trace, and so far haven't been unable to capture the last valid instruction executed.

While section 1.3.4.1 Interrupt Acceptance of the MSP430FR58xx, MSP430FR59xx, and MSP430FR6xx Family User's Guide mentions the need to insert a NOP instruction in front of the EINT instruction and a NOP instruction after the DINT, it is not clear if the same applies when a peripheral interrupt is enabled/disabled while global interrupts remain enabled (which is what the TickGet16 function does).

Perhaps someone from TI could comment on this.

0 Bruce McKenney over 5 years ago in reply to Chester Gillon

Guru 101915 points

I saw something similar, (I used the MPU to catch it) though of course my addresses are different. The evidence I see is:

1) T2 has just presented an IRQ (CCIFG=1, TA2R=TA2CCR0); since CCIE=0, the ISR hasn't executed.

2) TickGet16 has executed its first instruction at 0x10470 (the BIC) and maybe its second instruction at 0x10476 (MOV to R12), but not its third instruction at 0x1047a (MOV to R15).

3) Control then jumped directly to 0x10730 -- this is the failure. The difference between the "old" and "new" PCs appears to be a constant (~0x2C0), since if I add instructions elsewhere both move the same amount. I don't see this address or this offset anywhere nearby (including the stack recent history).

One thing I found interesting was a few cases where the ISR was executing but CCIE=0. This tells me that the IRQ was queued sometime in the middle of the "BIC CCIE" instruction (which takes 8 MCLKs), and was allowed to run even though CCIE=0 by the time the instruction completed.

And yet, in the failure case the IRQ was evidently presented in the middle of the BIC instruction but was not allowed to run.I wonder if the Verilog has a rule like "if IRQ is presented before instruction clock 4, do it, but after clock 4 don't do it" or something.

But yes, the finger seems to point to turning off CCIE "simultaneous" (fuzzy term) with the timer presenting a CCIFG IRQ. It takes a little while for the meta-cycles to match up exactly.

[Edit: I forgot to mention: SMCLK=MCLK=2MHz (DCO=16MHz, /8) and NWAITS=1. The code path is probably large enough to not fit in the FRAM cache.]

0 David Schultz over 5 years ago in reply to Chester Gillon

Guru 25485 points

The stated reason for putting a NOP after a DINT is to make sure that it is protected from interrupts. If that instruction doesn't need protection, you don't need a NOP. The reason for a NOP before an EINT is to avoid the problems caused be certain other interrupt affecting instructions that might occur just before it.

None of that applies here. The manual does recommend stopping the timer before changing its operation. With a notable exception for the interrupt enable. (Note use of singular rather than plural) It isn't clear if that is only TAIE or includes CCIE.

0 Bruce McKenney over 5 years ago in reply to Bruce McKenney

Guru 101915 points

As Filip mentioned, the location of the global -- referenced in the instruction following the BIC -- matters. Lowering its address appears to move the "phantom branch target" by 2x the change in the global's address.

At around 0x2008 (starting at 0x214c) the target falls into the "normal" code (0x10490) and becomes difficult to catch. Following the line, if the global's address were 0x1FF8, the offset would be 0, making the effect disappear (go into hiding at least).

0 Chester Gillon over 5 years ago in reply to Chester Gillon

Guru 92251 points

Chester Gillon said:
I can repeat the crash with your example program, on a MSP-EXP430FR5994, albeit with a X430FR5994 rev A device (i.e. prototype silicon).

I have created a new example which can run on multiple different devices, using one source file and a different CCS build configuration for each tested device.

It is a simplified version of the original version in which the following changes have been made:

Leave the CPU clock at the device default at reset, which is nominally 1 MHz.
Only use a single output pin which should be toggled to indicate the program is still running.
Leave the watchdog disabled, and add a trap of reading the SYSRSTIV at start-up and halt if has got reset source which appears to be following a crash.
Change from using TA2 to TA0 to be able to run on smaller devices.
Use a structure where the padding field allows the address of the nTVal variable incremented by the TIMER0_A0_ISR to be adjusted, instead of the #pragma LOCATION used in the original example. This still allows the linker to check the RAM allocation when attempting to port the example to other devices.

The results from testing on different devices are:

Device	Result
MSP430F6659 Rev A	Does not crash
MSP430FR2633	Does not crash
MSP430FR4133 Rev B	Does not crash
MSP430FR5969	Crashes. PC changes to invalid address 0x01034e, and then device resets with SYSRSTIV "PMMPW PMM password violation (PUC)"
X430FR5944 Rev A	Crashes. PC changes to invalid address 0x01034e, and then device resets with SYSRSTIV "PMMPW PMM password violation (PUC)"
MSP430FR6989 Rev C	Crashes. PC changes to invalid address 0x01034e, and then device resets with SYSRSTIV "PMMPW PMM password violation (PUC)"

I.e. the crash has been repeated on more than one type of device. As the crash has already been shown to be sensitive to the addresses used, not sure if that is the reason why the crash didn't happen on all devices tests (since the devices used have different amounts of memory).

The project is attached MSP430_crash_on_enable_disable_timer_int.zip

0 Chester Gillon over 5 years ago in reply to Bruce McKenney

Guru 92251 points

Bruce McKenney47378 said:
Lowering its address appears to move the "phantom branch target" by 2x the change in the global's address.

I can repeat that effect with the MSP430_crash_on_enable_disable_timer_int.zip example attached on the previous post. Some tests on changing the address of the globals.nTVal field by adjusting the amount of preceding padding when running on a MSP430FR6989:

globals.nTVal address	Phantom branch target
0x0020F6	0x010346
0x0020F8	0x01034A
0x0020FA	0x01034E
0x0020FC	0x010352
0x0020FE	0x010356

I haven't yet determined why moving the address of the global changes the phantom branch target, nor what actually causes the phantom branch target.

0 Bruce McKenney over 5 years ago in reply to Chester Gillon

Guru 101915 points

Thanks for writing this, though I probably won't get to it tonight.

I tried the original code on a Launchpad with a Rev C FR5994 (as did Filip), and it failed the same way.

The picture that seems to be forming is that the timer is presenting an IRQ, but at some critical moment the CPU expects to see one thing (IRQ number?) on the address bus but instead sees the source address from the next (pipelined) instruction. So it dutifully takes the low 8 bits, multiplies by 2x and adds to the PC.

Why now? I suppose it has to do with that same instruction making a RMW request to the (same) timer. This succeeds most of the time, so I suspect it matters when in the 8 MCLKs the IRQ is actually presented.

But this is all speculation. Nobody gave me the keys to the Verilog.

If what's happening even vaguely resembles this, that suggests that the appropriate workaround is the NOP (after the BIC) that Filip mentioned, since that has no source address in it.

0 Filip Dimitrov over 5 years ago in reply to Bruce McKenney

Prodigy 80 points

Hi all,

Thank you for your effort and posts, I'm still waiting for official response from TI here.

One more point, If we put a NOP after disabling peripheral INT, and compiler optimizations are "on", how can we be sure NOPs will stay in compiled code and not optimized out by the compiler optimizer?
Since "__disable_interrupt" is compiler intrinsic, which contains NOP, compiler "know" not to optimize it's NOP.
But in our case I expect NOPs to be removed now and then.
Not tried/checked in any way, since the fix should be again some official statement from TI.
We can't afford to work with guesses and unofficial try-error tests (which are limited at best).

So as I see it, we also should disable compiler optimizations for this fix to work reliably.
Correct me if I'm wrong.

Cheers,
Filip

0 Chester Gillon over 5 years ago in reply to Filip Dimitrov

Guru 92251 points

Filip Dimitrov said:
But in our case I expect NOPs to be removed now and then.

I tried looking at the latest MSP430 Optimizing C/C++ Compiler v20.2.0.LTS User's Guide but couldn't find a definitive statement the compiler optimizer handling of the __no_operation intrinsic.

On a test with TI MSP430 compiler v20.2.1, with the maximum optimization level of 4 "whole program optimizations" then the following __no_operation still resulted in a NOP in the program:

uint16_t TickGet16()
{
    TA0CCTL0 &= ~CCIE;
    __no_operation ();

    uint16_t nRet = (uint16_t) globals.nTVal;

    TA0CCTL0 |= CCIE;

    return nRet;
}

As shown by the following in the debugger:

 61         TA0CCTL0 &= ~CCIE;
        $C$L5:
010082:   F0B2 FFEF 0342      AND.W   #0xffef,&Timer0_A3_TA0CCTL0
 62         __no_operation ();
010088:   4303                NOP     
 64         uint16_t nRet = (uint16_t) globals.nTVal;
01008a:   421F 20FA           MOV.W   &0x20fa,R15
01008e:   421D 20FC           MOV.W   &0x20fc,R13
010092:   421D 20FE           MOV.W   &0x20fe,R13
010096:   421D 2100           MOV.W   &0x2100,R13
 66         TA0CCTL0 |= CCIE;
01009a:   D0B2 0010 0342      BIS.W   #0x0010,&Timer0_A3_TA0CCTL0
123             if(nTmrNow - globals.nBlinkTimer > 250)
0100a0:   0FCD                MOVA    R15,R13
0100a2:   8E0D                SUB.W   R14,R13
0100a4:   903D 00FB           CMP.W   #0x00fb,R13
0100a8:   2BEC                JLO     (0x0082)
125                 LED_OPERATIONS_PORT ^= LED_OPERATIONS_MASK;
0100aa:   E3D2 0202           XOR.B   #1,&Port_A_PAOUT
126                 globals.nBlinkTimer = nTmrNow;
0100ae:   0FCE                MOVA    R15,R14
0100b0:   4E82 1C00           MOV.W   R14,&globals
0100b4:   3FE6                JMP     (0x0082)

In the above example, the addition of the __no_operation in the source was sufficient to stop the crash.

I will notify one of the TI compiler experts to see if can get an official statement about the potential for the compiler optimizer to remove __no_operation intrinsics specified in the source code.

0 Bruce McKenney over 5 years ago in reply to Chester Gillon

Guru 101915 points

I usually use 'asm volatile (" nop ");' since I forget how to spell __no_operation(). I suppose that since it accepts "volatile" here it treats it appropriately, though now that I look this isn't mentioned in SLAU132U.

0 Bruce McKenney over 5 years ago in reply to Filip Dimitrov

Guru 101915 points

I agree that diagnosis won't fix anything, but that, along with Chester's purpose-built test case, is a way to get TI's attention.

I share with TI a certain skepticism when hearing about a previously unknown silicon bug, having seen so many such reports that turned out to be just incorrect coding/design. But I have seen a few real ones go by (maybe 3 in 12 years), and this one looks pretty suspicious.

0 David Schultz over 5 years ago in reply to Bruce McKenney

Guru 25485 points

This behaviour of volatile is documented in gcc. (6.47.2.1) But, it states that all asm statements are intrinscly volatile so it does nothing. It does have an effect in extended asm.

0 Bruce McKenney over 5 years ago in reply to David Schultz

Guru 101915 points

I dug around a little, and it appears IAR (origin of __no_operation) also supports "asm volatile". In a curious twist, that makes the "asm" version More portable than __no_operation.

Meantime, I hope someone from TI shows up soon.

0 Chester Gillon over 5 years ago in reply to Bruce McKenney

Guru 92251 points

Bruce McKenney47378 said:
At around 0x2008 (starting at 0x214c) the target falls into the "normal" code (0x10490) and becomes difficult to catch.

The default is that the unused parts of the FRAM or FLASH will be in the erased state, which over 32-bits decodes as the following instruction:

FFFF FFFF           AND.B   @R15+,0xffff(R15)

The Indirect Autoincrement addressing mode of the above instruction means that after "phantom branch target" is that of erased FRAM or FLASH, then as the PC runs though the erased FRAM or FLASH R15 is incremented and addresses are written to. If attempts to write to a peripheral which is protected with a password then will trigger a reset. This explains why the test can get reset due to PMMPW PMM password violation (PUC)".

Changing the linker command file to specify a fill of 0x3fff (same value as read from vacant memory) which is "jmp $" means that if the "phantom branch target" is unused FRAM or FLASH the PC stops incrementing which stops the failure from causing a reset. E.g. in the MEMORY regions:

    FRAM                    : origin = 0x4000, length = 0xBF80, fill = 0x3FFF
    FRAM2                   : origin = 0x10000,length = 0x34000, fill = 0x3FFF

However, I still haven't determined the exact point at which the "phantom branch" occurs.

0 Bruce McKenney over 5 years ago in reply to Chester Gillon

Guru 101915 points

Yes, I was using the @R15+ to trigger the MPU, since R15 (most of the time) contains *WDTCTL=0x5Axx. At 0x2008, it jumped instead into "abort()"; that was caught by accident. I was pretty sure I wouldn't be able to reliably catch it if it branched to some random place in the library.

I don't have my materials here, but as I recall R15 either contains (*WDTCTL) or one of the words of the time stamp. At the failure, R15 contained the former, so I concluded the first MOV to R15 hadn't happened. The preceding MOV to R12 was ambiguous, since R12 contains the first word of the time stamp pretty much all the time.

0 Chester Gillon over 5 years ago in reply to Bruce McKenney

Guru 92251 points

Bruce McKenney47378 said:
At the failure, R15 contained the former, so I concluded the first MOV to R15 hadn't happened. The preceding MOV to R12 was ambiguous, since R12 contains the first word of the time stamp pretty much all the time.

I have adjusted my example to leave more "clues" about the point of failure:

Make the linker command file fill FRAM with 0x3fff (jmp $) to halt when a phantom branch into unused space occurs.
Initialise the 64-bit counter to the value 0xFEEDABBADEADBEEF to help identify how far (if any) the MOV to R15 has happened.
Add some assembler statements to the start of TickGet16 to set R12 and R15

The assembler for the TickGet16 function is:

 61         asm volatile (" MOV.W &TA0CCTL0, R12");
        TickGet16():
01012e:   421C 0342           MOV.W   &TA0_TA0CCTL0,R12
 62         asm volatile (" MOV.W &TA0CCTL0, R15");
010132:   421F 0342           MOV.W   &TA0_TA0CCTL0,R15
 64         TA0CCTL0 &= ~CCIE;
010136:   F0B2 FFEF 0342      AND.W   #0xffef,&TA0_TA0CCTL0
 66         uint16_t nRet = (uint16_t) globals.nTVal;
01013c:   421C 20FA           MOV.W   &0x20fa,R12
010140:   421F 20FC           MOV.W   &0x20fc,R15
010144:   421F 20FE           MOV.W   &0x20fe,R15
010148:   421F 2100           MOV.W   &0x2100,R15
 68         TA0CCTL0 |= CCIE;
01014c:   D0B2 0010 0342      BIS.W   #0x0010,&TA0_TA0CCTL0
 71     }
010152:   0110                RETA

The following breakpoint is used to detect when the "phantom branch" failure has jumped to the unused space filled with 0x3fff:

At the point of failure on the MSP430FR5994 the global variables were:

globals.nTVal	unsigned long long	0xFEEDABBADEADC2DF (Hex)	0x0020FA	
globals.nBlinkTimer	unsigned int	0xC2DB (Hex)	0x001C00

The core registers were:

PC	0x01033A	Core	
SP	0x002BF4	Core	
SR	0x0009	Core	
R3	0x000000	Core	
R4	0x00FFFF	Core	
R5	0x00FFFF	Core	
R6	0x00FFFF	Core	
R7	0x00A55A	Core	
R8	0x00FFFF	Core	
R9	0x000112	Core	
R10	0x001AF8	Core	
R11	0x00FFFF	Core	
R12	0x00C2DF	Core	
R13	0x002102	Core	
R14	0x000000	Core	
R15	0x000010	Core

The TA0CCTL0 was 0x0001 (CCIE = 0, CCIFG = 1).

Annotating the sampled register values onto the instructions for TickGet16:

010132:   421F 0342           MOV.W   &TA0_TA0CCTL0,R15 /* R15 has 0x0010 which is the TA0CCTL0 value with CCIE = 1 and CCIFG = 0 */
 64         TA0CCTL0 &= ~CCIE;
010136:   F0B2 FFEF 0342      AND.W   #0xffef,&TA0_TA0CCTL0 /* TA0CCTL0 has CCIE = 0 so this has been executed */
 66         uint16_t nRet = (uint16_t) globals.nTVal;
01013c:   421C 20FA           MOV.W   &0x20fa,R12 /* R12 has 0xC2DF which matches the least significant word in globals.nTVal so this has been executed */
010140:   421F 20FC           MOV.W   &0x20fc,R15 /* R15 doesn't have the value at this address (0xDEAD) so hasn't been executed */

Given that R15 sampled A0CCTL0 with CCIE = 1 and CCIFG = 0 immediately prior to the instruction which clears CCIE and after the crash TA0CCTL0 had CCIE = 0, CCIFG = 1 this backs up the theory the crash occurs when CCIE is cleared around the time a timer interrupt becomes active.

The updated example is attached 3175.MSP430_crash_on_enable_disable_timer_int.zip

0 George Mock over 5 years ago in reply to Filip Dimitrov

TI__Guru**** 252560 points

Filip Dimitrov said:
If we put a NOP after disabling peripheral INT, and compiler optimizations are "on", how can we be sure NOPs will stay in compiled code and not optimized out by the compiler optimizer?

Suppose you write this sequence of intrinsics ...

__disable_interrupts();
__no_operation();

The TI MSP430 compiler will always emit the corresponding instructions, in that order. This is true without regard to the optimization level.

Thanks and regards,

-George

0 Filip Dimitrov over 5 years ago in reply to George Mock

Prodigy 80 points

Hello George,

Thank you for your response.

We are interesting in disabling specific peripheral interrupt and on next ASM line to have NOP...something like this :

asm(" BIC.W #xxx, &yyy");
asm(" NOP");

asm(" BIC.W #xxx, &yyy");
__no_operation();

or any combinations of these.

Which variant is more "robust"...i.e. not likely to reorder or optimized out by compiler optimizer?

Regards,
Filip

0 George Mock over 5 years ago in reply to Filip Dimitrov

TI__Guru**** 252560 points

Always use asm() statements as a last resort.

Thanks and regards,

-George

0 Filip Dimitrov over 5 years ago in reply to George Mock

Prodigy 80 points

Hi all,

Still waiting for official clarification from TI about questions in my first post (hope not all guys from TI are caught by corona virus :) :

- What is the official solution for this problem?
- Are other peripherals (except timers) suffer from it ? What is their solution ?
- Which CPUs/revisions are affected ?
- Is there any other place except errata document & forums, where such problems are described ? Where ?

Regards,
Filip

0 Gary Gao over 5 years ago in reply to Filip Dimitrov

TI__Mastermind 40890 points

Hi Filip

Have you try disable the CCS optimization ? Or using the IAR ?

Best regards

Gary

0 Filip Dimitrov over 5 years ago in reply to Gary Gao

Prodigy 80 points

Hello Gary,

All optimizations are disabled, and as I noted, I've check generated assembly code, which seems ok.
So this is not a compiler problem.

Regards,
Filip

0 Gary Gao over 5 years ago in reply to Filip Dimitrov

TI__Mastermind 40890 points

Hi Filip

I have create a code example that based on your situation. But there is no problem with it.

Fullscreen msp430fr599x_ta0_01.c Download

/* --COPYRIGHT--,BSD_EX
 * Copyright (c) 2015, Texas Instruments Incorporated
 * All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions
 * are met:
 *
 * *  Redistributions of source code must retain the above copyright
 *    notice, this list of conditions and the following disclaimer.
 *
 * *  Redistributions in binary form must reproduce the above copyright
 *    notice, this list of conditions and the following disclaimer in the
 *    documentation and/or other materials provided with the distribution.
 *
 * *  Neither the name of Texas Instruments Incorporated nor the names of
 *    its contributors may be used to endorse or promote products derived
 *    from this software without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
 * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
 * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
 * PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
 * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
 * EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
 * PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
 * OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
 * WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR
 * OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
 * EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 *
 *******************************************************************************
 *
 *                       MSP430 CODE EXAMPLE DISCLAIMER
 *
 * MSP430 code examples are self-contained low-level programs that typically
 * demonstrate a single peripheral function or device feature in a highly
 * concise manner. For this the code may rely on the device's power-on default
 * register values and settings such as the clock configuration and care must
 * be taken when combining code from several examples to avoid potential side
 * effects. Also see www.ti.com/grace for a GUI- and www.ti.com/msp430ware
 * for an API functional library-approach to peripheral configuration.
 *
 * --/COPYRIGHT--*/
//******************************************************************************
//  MSP430FR5x9x Demo - Timer0_A3, Toggle P1.0, CCR0 Cont Mode ISR, DCO SMCLK
//
//  Description: Toggle P1.0 using software and TA_0 ISR. Timer0_A is
//  configured for continuous mode, thus the timer overflows when TAR counts
//  to CCR0. In this example, CCR0 is loaded with 50000.
//  ACLK = n/a, MCLK = SMCLK = TACLK = default DCO = ~1MHz
//
//           MSP430FR5994
//         ---------------
//     /|\|               |
//      | |               |
//      --|RST            |
//        |               |
//        |           P1.0|-->LED
//
//   William Goh
//   Texas Instruments Inc.
//   October 2015
//   Built with IAR Embedded Workbench V6.30 & Code Composer Studio V6.1
//******************************************************************************
#include <msp430.h>

unsigned long long TickGet16(void);


#pragma LOCATION(g_nTVal, 0x214c)               // This is important for reproducing problem
volatile unsigned long long g_nTVal,g_nBlinkTimer;                        // Different sizes, causes different hangs effects (seems UINT16 works...at least for 30 min)

int main(void)
{
    WDTCTL = WDTPW | WDTHOLD;               // Stop WDT

    // Configure GPIO
    P1DIR |= BIT0;
    P1OUT |= BIT0;
    g_nBlinkTimer=0;
    // Disable the GPIO power-on default high-impedance mode to activate
    // previously configured port settings
    PM5CTL0 &= ~LOCKLPM5;
    TA2EX0 = TAIDEX_7;      // TAIDEX = /8
    TA2CCTL0 = CCIE;                        // TACCR0 interrupt enabled
    TA2CCR0 = 50;
    TA2CTL = TASSEL__SMCLK | ID__8 | MC__UP | TACLR; // SMCLK, continuous mode

    __bis_SR_register( GIE);     // Enter LPM0 w/ interrupt
    while(1)
    {
        unsigned long long nTmrNow = TickGet16();

        // Blink on 250 timer ticks
        if(nTmrNow - g_nBlinkTimer > 250)
        {
            P1OUT ^= BIT0;
            g_nBlinkTimer = nTmrNow;
        }

    }

}

// Timer0_A0 interrupt service routine
#if defined(__TI_COMPILER_VERSION__) || defined(__IAR_SYSTEMS_ICC__)
#pragma vector = TIMER2_A0_VECTOR
__interrupt void Timer0_A0_ISR (void)
#elif defined(__GNUC__)
void __attribute__ ((interrupt(TIMER0_A0_VECTOR))) Timer0_A0_ISR (void)
#else
#error Compiler not supported!
#endif
{
    g_nTVal++;
}


unsigned long long TickGet16()
{
//    __NOP();                              // Adding stuff here seems to fix the problem.
//    __NOP();                              // Adding more stuff makes problem appear again (obviously timer tick syncs with CPU execution, they run on same clocks, this is expected)

 //   CLR_BIT_ATOMIC_W(TA2CCTL0, CCIE);       // Without disabling interrupts everything seems to work !!!
    TA2CCTL0 &= ~CCIE;                        // TACCR0 interrupt enabled
//  TA2CCTL0 &= ~CCIE;                      // Same as above, produces same effects

//  __NOP();                                // Uncomment this and problem goes away for any other arrangements

    unsigned long long nRet = g_nTVal;

 //   SET_BIT_ATOMIC_W(TA2CCTL0, CCIE);
    TA2CCTL0 = CCIE;                        // TACCR0 interrupt enabled
    return nRet;
}

Could you help to make some change to reproduce the issue that you faced?

Best regards

Gary

0 Chester Gillon over 5 years ago in reply to Gary Gao

Guru 92251 points

Gary Gao said:
Could you help to make some change to reproduce the issue that you faced?

I have previously created an example, based upon that from Filip, which also fails. See the project attached to https://e2e.ti.com/support/microcontrollers/msp430/f/166/p/895839/3313155#3313155. Are you able to repeat the failure with the code in the referenced post?

Also, as noted in that referenced post I got the failure on several different MSP430FR devices.

0 Gary Gao over 5 years ago in reply to Chester Gillon

TI__Mastermind 40890 points

Hi Chester

I can't import your code due to I don't have the compiler MSP430v20.2. What's the CCS version do you use?

I use the CCS 9.3 with compiler version TI v18.12.4.LTS. And I can't find the MSP430v20.2 in online.

Could you make few change on my code that is very simple to reproduce this issue?

Best regards

Gary

0 Chester Gillon over 5 years ago in reply to Gary Gao

Guru 92251 points

Gary Gao said:
What's the CCS version do you use?

That project was created using CCS 10.0, which installed MSP430 compiler v20.2.

I have updated the example project to use compiler TI v18.12.4.LTS, which I used in CCS 9.3

8105.MSP430_crash_on_enable_disable_timer_int.zip

0 Gary Gao over 5 years ago in reply to Chester Gillon

TI__Mastermind 40890 points

I just make few change that make the code more simple but the no issue happen.

Fullscreen MSP430_crash_on_enable_disable_timer_int.c Download

/*
 * This program has been created to demonstrate a crash on a MSP430 device when a timer interrupt
 * is enabled / disabled while global interrupts remain enabled.
 *
 * It is a simplified version of the original version in https://e2e.ti.com/support/microcontrollers/msp430/f/166/t/895839
 * in which the following changes have been made:
 * a. Leave the CPU clock at the device default at reset, which is nominally 1 MHz
 * b. Only use a single output pin which should be toggled to indicate the program is still running.
 * c. Leave the watchdog disabled, and add a trap of reading the SYSRSTIV at start-up and halt if has got
 *    a reset source which appears to be following a crash.
 * d. Change from using TA2 to TA0 to be able to run on smaller devices.
 */

#include <msp430.h>
#include <stdint.h>
#include <stdbool.h>

/* The pin toggled to indicate the program is still running */
#define LED_OPERATIONS_PORT                 P1OUT
#define LED_OPERATIONS_PORT_DIR             P1DIR
#ifdef __MSP430FR2633__
// CAPTIVATE-FR2633 LED1
#define LED_OPERATIONS_MASK                 BIT7
#else
#define LED_OPERATIONS_MASK                 BIT0
#endif

/* Structure where the padding field allows the address of the nTVal variable incremented by the TIMER2_A0_ISR
 * to be adjusted, instead of the #pragma LOCATION used in the original example.
 * This still allows the linker to check the RAM allocation when attempting to port the example to other devices.
 */

uint16_t g_nTVal,g_nBlinkTimer;

void Tick_Init()
{
    TA0CTL = 0;         // Disable clock

//    globals.nTVal = 0;        // Clear current value

    // Configure all registers before start
    TA0CCTL0 = TA0CCTL1 = 0;
    TA0R = 0;
    TA0EX0 = TAIDEX_7;      // TAIDEX = /8

    TA0CCR0 = 50;      // for tests

    TA0CTL = TASSEL__SMCLK | ID__8 | MC__UP | TACLR; // | TAIE;         // TASSEL = SMCLK, ID = /8, MC = UP & reset, clear, int enable, intf clear - will start

    // The TAxCCR0 CCIFG interrupt flag is set when the timer counts to the TAxCCR0 value
    // Enable interrupt & clear flags
    TA0CCTL0 = CCIE;
}

uint16_t TickGet16()
{
    TA0CCTL0 &= ~CCIE;

  //  uint16_t nRet = (uint16_t) globals.nTVal;
    uint16_t nRet = g_nTVal;

    TA0CCTL0 |= CCIE;

    return nRet;
}

#pragma vector=TIMER0_A0_VECTOR
__interrupt void TIMER0_A0_ISR()
{
 //   globals.nTVal++;
    g_nTVal++;
}

int main(void)
{
    volatile uint16_t reset_cause;
    volatile bool halt_on_unexpected_reset_cause = true;

    WDTCTL = WDTPW | WDTHOLD;   // stop watchdog timer
/*
    do
    {
        reset_cause = SYSRSTIV;
        switch (reset_cause)
        {
        case SYSRSTIV_NONE:
        case SYSRSTIV_BOR:
        case SYSRSTIV_RSTNMI:
           // Continue with the program, as probably a power up
            break;

        default:
            // Unexpected reset, halt to allow inspection in the debugger
            while (halt_on_unexpected_reset_cause)
            {
            }
            break;
        }
    } while (reset_cause != 0);
*/
    // Enable I/O if currently locked.
    // The conditional test allows operation on other than MSP430FR series devices.
    if ((PM5CTL0 & LOCKLPM5) != 0)
    {
        PM5CTL0 &= ~LOCKLPM5;
    }

    LED_OPERATIONS_PORT &= ~LED_OPERATIONS_MASK;                     // to off
    LED_OPERATIONS_PORT_DIR |= LED_OPERATIONS_MASK;                 // to output direction

    Tick_Init();
    __enable_interrupt ();

    // Main loop
    for (;;)
    {
        uint16_t nTmrNow = TickGet16();

        // Blink on 250 timer ticks
        if(nTmrNow - g_nBlinkTimer > 250)
        {
            LED_OPERATIONS_PORT ^= LED_OPERATIONS_MASK;
            g_nBlinkTimer = nTmrNow;
        }
    }
}

The reason to make this change is that I think we just use a global variable will make this more simple.

0 Chester Gillon over 5 years ago in reply to Gary Gao

Guru 92251 points

Gary Gao said:
The reason to make this change is that I think we just use a global variable will make this more simple.

I agree that making that simplification makes the failure "go away".

As Filip noted the failure seems sensitive to the location of variables / the code in memory.

E.g. in the example attached to https://e2e.ti.com/support/microcontrollers/msp430/f/166/p/895839/3340896#3340896 the following global structure was used:

struct
{
    uint16_t nBlinkTimer;
    uint8_t padding[1272];
    volatile uint64_t nTVal;
} globals;

If the padding field is commented out, which has the effect of changing the address of the nTVal field in RAM, then the failure no longer occurs.

I think what Filip is looking for is an explanation for why the failure is sensitive to the the location of variables / the code in memory, to be able to ensure that the failure can be avoided in a deterministic way.

Is it possible that the failure case has exposed some previously unknown device errata?

0 Bruce McKenney over 5 years ago in reply to Gary Gao

Guru 101915 points

I was able to demonstrate -- even tinker with -- this symptom using CCS v8.3 (compiler v18.1.5.LTS), using either Filip's original code or Chester's instrumented code. (I couldn't import Chester's project due to that CCS bug, but a copy/paste of the source sufficed.) This was using a Rev C "M"SP430FR5994 on a Launchpad.

To echo Chester's point: This was ordinary code written (as near as I can tell) to the User Guide specification, which uses a condoned feature [Ref UG (SLAU367O) Sec 25.2.1 "Note:"] and it failed. A confounding effect was that the symptom was arcane, albeit catastrophic. This could happen to anyone.

My reconstruction is that it requires two things to happen within a single instruction: (a) TAx triggers an enabled EQU0 and (b) the CPU issues a RMW operation to TAx to disable that interrupt. When this happens one thing (IRQ#?) is expected on the bus, but something different (source operand from the next instruction?) arrives, and the CPU branches off into nowhere. It probably matters (1) what order these things happen and (2) [maybe] which specific sub-instruction clock these events take place.

A race? So it seems. Improbable? Sure. But I've diagnosed enough obscure races to know that if you encounter one enough times you will eventually lose.

0 Gary Gao over 5 years ago in reply to Chester Gillon

TI__Mastermind 40890 points

Hi Chester

I have add the struct that you mentioned before and still no issue happen.

struct
{
uint16_t nBlinkTimer;
uint8_t padding[1272];
volatile uint16_t nTVal;
} globals;

The reason I change the "volatile uint64_t nTVal;" to "volatile uint16_t nTVal;" is that we don' t need to do the conversion uint64_t to uint16_t in TickGet16().

This is my test code

Fullscreen 0513.MSP430_crash_on_enable_disable_timer_int.c Download

/*
 * This program has been created to demonstrate a crash on a MSP430 device when a timer interrupt
 * is enabled / disabled while global interrupts remain enabled.
 *
 * It is a simplified version of the original version in https://e2e.ti.com/support/microcontrollers/msp430/f/166/t/895839
 * in which the following changes have been made:
 * a. Leave the CPU clock at the device default at reset, which is nominally 1 MHz
 * b. Only use a single output pin which should be toggled to indicate the program is still running.
 * c. Leave the watchdog disabled, and add a trap of reading the SYSRSTIV at start-up and halt if has got
 *    a reset source which appears to be following a crash.
 * d. Change from using TA2 to TA0 to be able to run on smaller devices.
 */

#include <msp430.h>
#include <stdint.h>
#include <stdbool.h>

/* The pin toggled to indicate the program is still running */
#define LED_OPERATIONS_PORT                 P1OUT
#define LED_OPERATIONS_PORT_DIR             P1DIR
#ifdef __MSP430FR2633__
// CAPTIVATE-FR2633 LED1
#define LED_OPERATIONS_MASK                 BIT7
#else
#define LED_OPERATIONS_MASK                 BIT0
#endif

/* Structure where the padding field allows the address of the nTVal variable incremented by the TIMER2_A0_ISR
 * to be adjusted, instead of the #pragma LOCATION used in the original example.
 * This still allows the linker to check the RAM allocation when attempting to port the example to other devices.
 */
uint16_t test_;
struct
{
    uint16_t nBlinkTimer;
    uint8_t padding[1272];
    volatile uint16_t nTVal;
} globals;


void Tick_Init()
{
    TA0CTL = 0;         // Disable clock

    globals.nTVal = 0;        // Clear current value

    // Configure all registers before start
    TA0CCTL0 = TA0CCTL1 = 0;
    TA0R = 0;
    TA0EX0 = TAIDEX_7;      // TAIDEX = /8

    TA0CCR0 = 50;      // for tests

    TA0CTL = TASSEL__SMCLK | ID__8 | MC__UP | TACLR; // | TAIE;         // TASSEL = SMCLK, ID = /8, MC = UP & reset, clear, int enable, intf clear - will start

    // The TAxCCR0 CCIFG interrupt flag is set when the timer counts to the TAxCCR0 value
    // Enable interrupt & clear flags
    TA0CCTL0 = CCIE;
}

uint16_t TickGet16()
{
    TA0CCTL0 &= ~CCIE;

 //   uint16_t nRet = (uint16_t) globals.nTVal;
    uint16_t nRet = globals.nTVal;
    test_=nRet;

    TA0CCTL0 |= CCIE;

    return nRet;
}

#pragma vector=TIMER0_A0_VECTOR
__interrupt void TIMER0_A0_ISR()
{
    globals.nTVal++;
//    g_nTVal++;
}


int main(void)
{
    volatile uint16_t reset_cause;
    volatile bool halt_on_unexpected_reset_cause = true;

    WDTCTL = WDTPW | WDTHOLD;   // stop watchdog timer

/*
    do
    {
        reset_cause = SYSRSTIV;
        switch (reset_cause)
        {
        case SYSRSTIV_NONE:
        case SYSRSTIV_BOR:
        case SYSRSTIV_RSTNMI:
           // Continue with the program, as probably a power up
            break;

        default:
            // Unexpected reset, halt to allow inspection in the debugger
            while (halt_on_unexpected_reset_cause)
            {
            }
            break;
        }
    } while (reset_cause != 0);
*/
    // Enable I/O if currently locked.
    // The conditional test allows operation on other than MSP430FR series devices.
    if ((PM5CTL0 & LOCKLPM5) != 0)
    {
        PM5CTL0 &= ~LOCKLPM5;
    }

    LED_OPERATIONS_PORT &= ~LED_OPERATIONS_MASK;                     // to off
    LED_OPERATIONS_PORT_DIR |= LED_OPERATIONS_MASK;                 // to output direction

    Tick_Init();
    __enable_interrupt ();

    // Main loop
    for (;;)
    {
        uint16_t nTmrNow = TickGet16();

        // Blink on 250 timer ticks
        if(nTmrNow - globals.nBlinkTimer > 250)
        {
            LED_OPERATIONS_PORT ^= LED_OPERATIONS_MASK;
            globals.nBlinkTimer = nTmrNow;
        }
    }
}

0 Gary Gao over 5 years ago in reply to Bruce McKenney

TI__Mastermind 40890 points

Hi Bruce

Thank you for your advice. I have some doubt with your points, could you help me to make it clear?Thanks

1. I have read the "[Ref UG (SLAU367O) Sec 25.2.1 "Note:"]" and that there are few risk there to disable and enable the CCIE. What the meaning of "A confounding effect was that the symptom was arcane, albeit catastrophic"

2. What's the meaning of " the CPU issues a RMW". And I am not fully understand your point here if you can give more detail description that will be great. Thanks

Best regards

Gary

0 Bruce McKenney over 5 years ago in reply to Gary Gao

Guru 101915 points

1) "TI recommends stopping the timer before modifying its operation (with exception of the interrupt enable, interrupt flag, and TACLR) to avoid errant operating conditions." (emphasis mine). The exception would appear to cover CCIE. Keep in mind that modifying CCIFG with the timer running is more or less required in many cases, so the CCIE inference seems warranted.

The symptom is a random(-ish) branch from code which contains no branches. There's an IRQ in here somewhere, but the branch target is not the contents of the relevant Vector (nor any other Vector, nor in the stack, nor in any register). Chester's research suggests that the branch target's PC offset depends (*2?) on the source address (low byte?) in the instruction following the BIC that clears CCIE.

This code exercises the test case heavily. If such an odd symptom showed up say once per month, it would probably be written off as an electrical glitch or something. (If you're designing a system to run for a year without interruption, these sorts of oddities get your attention.)

2) During the instruction (BIC) of interest, the CPU does a Read-Modify-Write to change TA0CCTL0 in the timer peripheral. This seems to be a significant part of the puzzle. As I mentioned earlier in the thread, I saw a number of cases where execution reached the ISR, but CCIE=0. That suggests that the CCIE clearing and the IRQ both occurred during the BIC instruction. So sometimes this works, and sometimes not -- thus the suggestion that ordering and/or specific sub-cycle matters. (The BIC takes at least 6 clocks.)

Nobody gave me the keys to the Verilog, so I can only report what can be seen from the outside. Someone with a cycle-accurate simulator might be able to say more. Chester's test case rarely runs more than 2 seconds on my Launchpad, so it wouldn't take a lifetime to track down.

0 Bruce McKenney over 5 years ago in reply to Bruce McKenney

Guru 101915 points

Also as an "amicus brief" of sorts, I'll point out this thread. It pertains to the FR57 series, which I think is at least a cousin of the FR59:

https://e2e.ti.com/support/microcontrollers/msp430/f/166/p/722730/2700430

The takeaway there seemed to be that the timer peripheral's operations aren't atomic as observed by the CPU.

0 David Schultz over 5 years ago in reply to Bruce McKenney

Guru 25485 points

It isn't clear to me that the exception on changing timer operation applies to CCIE. The note uses the singular tense so at best it is poorly written and at worst it doesn't apply to CCIE.

Looking at 25-15 I think that with bad luck it would be possible to generate a short pulse to the interrupt logic if CCIE were cleared just after an interrupt was generated. How does the interrupt system deal with a pulsed input? Will it start an interrupt service cycle and then run into trouble when it needs to fetch the ISR address?

0 Bruce McKenney over 5 years ago in reply to David Schultz

Guru 101915 points

1) That was kind of my point in mentioning CCIFG, since the language is a bit murky. A prohibition on modifying CCIFG with the timer running would invalidate maybe 20 years of precedent (every program that ever contained an A1 ISR), so I'm supposing the (literal) singular is intended to apply to both CCIFG and CCIE.

2) That's pretty much what I think is happening.

0 Chester Gillon over 5 years ago in reply to Gary Gao

Guru 92251 points

Gary Gao said:
Or using the IAR ?

I created an IAR project for a MSP430FR5994, using the same source file as with CCS.

The program still fails when built with IAR, in this case the failure causes the PC to jump to invalid address 0x020010 :

IAR for MSP430 Kickstart v7.20.1 was used to build the attached project (the .zip file contains IAR and CCS project files, which both use the same MSP430_crash_on_enable_disable_timer_int.c).

6888.MSP430_crash_on_enable_disable_timer_int.zip

0 Gary Gao over 5 years ago in reply to Chester Gillon

TI__Mastermind 40890 points

Thank you Burce and Chester for your great work and advice there. I will do more test about it when I get the time.

**Attention** This is a public forum

MSP low-power microcontrollers

MSP low-power microcontroller forum

MSP430FR5994: Strange/Undocumented behavior when disabling/enabling individual timer interrupts