MSP432P4111: Possible bug reading TimerA0 TAR when timer runs off 32768Hz oscillator

Julian Kolodko

Part Number: MSP432P4111

G'Day all,

Sorry, for this long complicated email. We've been working on a difficult bug in our system for a while now and think we've finally tracked the root cause. We think its an issue with SYSBIOS but we don't discount the possibility we are abusing SYSBIOS is some way that we don't understand. Given that we think this is a SYSBIOS issue, things get a bit complicated....

For a while now (e.g. see this old forum post of ours, and this more recent post) we've had a problem where SYSBIOS occasionally gets upset and does not properly process time delays (e.g. clock functions set to run some time in the future or Task_sleep()).

The exact manifestation of this varies (our older post illustrates 3 variations using execution graphs) - but at the core what we see is that is that there are extended blocks of time (our testing says its about 2 seconds) where anything that is waiting on SYSBIOS timing doesn't run even though we expect it should. This is usually preceded by a period of time where Swi.ti_sysbios_knl_Clock_workFuncDynamic__E() runs for what seems an excessively long period.

We think we have found the root cause of this and would appreciate if you could confirm our observations using the attached example project that illustrates the issue.

Long story short:

If we configure our project with the "Internally configure a Timer to periodically call Clock_tick()" option selected in the clock module settings, the project is built using TimerA0 with a 32768Hz clock input. In this case, we see the problem occur.
If we configure our project with the "Application code calls Clock_tick()" option selected (and we call Clock_tick() in our own timer interrupt running off TimerA3 with a 1MHz clock input synchronous to MCLK) we never see the problem occur.

Our suspicion is that something goes wrong with the Timer_getCount() function in SYSBIOS when an asynchronous clock is used to drive the underlying timer.

The example project is simply a clock function and a timer HWI running with fairly short intervals toggling pins.

runs on a MSP-EXP432P4111 Launchpad.
Built using CCS9 compiler v18.12.4.LTS (and also CCS10, compiler v20.2.2.LTS)
Built against Simplelink v3.40.1.02, XDCTools 3.60.2.34

Now

Build the project as given (note MCLK = 48MHz). In this case the project is configured with the "Internally configure a Timer to periodically call Clock_tick()" option selected.
freeze all timer modules (CCS -> Scripts menu -> MSP432 Debug Clock Control - > Freeze all timer modules)
Open the file [SIMPLELINK]\kernel\tirtos\packages\ti\sysbios\family\arm\msp432\Timer.c and put a breakpoint at line 252
Start debugging and wait until you hit the breakpoint - this can take a few seconds or a few hours.
A screen shot of debugger output when we hit the breakpoint is attached below.

The code at the breakpoint is intended to deal with timer rollovers, but it's been triggered when the timer is well away from rolling over.

The variable first comes from the Timer_getCount() function call at line 245. The function is inlined by the optimizer so first is in scope at line 252.
The variable now is equal to the output of Timer_getCount() (which returns the value of first) as would be expected.
The register TACCR0 has been set to the value of variable next as expected (per line 185).
The variable previous is equal to the last value of the TACCR0 register.

If previous stores when the last value of the CCR0 register, then by definition now should be larger (unless a roll over has occurred). But when we hit the breakpoint now is smaller than previous even though no rollover has occurred. This seems wrong, and when this happens we get our behavior where SYSBIOS does not run anything waiting on a timer.

If you change our code over so that Clock_tick() is called by our interrupt (see comments at line 173 in hello.c) you'll see that you never hit the breakpoint (unless you happen to catch a rollover) even after many hours of running, so whatever the problem is, its seems specific either to TimerA0 or specific to a TimerA module using an asynchronous clock input.

Cheers

Julian

Example.zip

over 4 years ago

0 Cash Hao11 over 4 years ago

TI__Mastermind 46625 points

Hi Julian,

I do not find any related information in the device errata. However, based on your description, I think it is credible that the issue is caused by the asynchronous clock resource on the timer. Which power mode you used when the timer is running? Could you try always let the code runs at active mode?

Best regards,

Cash Hao

0 Bruce McKenney over 4 years ago

Guru 99415 points

I don't know much about SYSBIOS, but Refer TRM (SLAU356I) Sec 19.2.1 ("NOTE", second paragraph).

It appears that getCount does the recommended fetch-voting based on the "synchronous" flag, set according to a parameter (presumably from the user program) to Timer_Instance_init(). Is that flag set to 0?

[Edit: Just to be clear, that code isn't trying to deal (primarily) with rollovers, rather to deal with overshoots, where -- due presumably to interrupt latency-- the entire next tick has passed by the time you try to update it.]

0 Julian Kolodko over 4 years ago in reply to Bruce McKenney

Expert 2265 points

G'Day Bruce,

>>I don't know much about SYSBIOS, but Refer TRM (SLAU356I) Sec 19.2.1 ("NOTE", second paragraph).

Thank you, yes, I'm aware of the need for voting for async timer clocks. We are not reading TAR in our own code here - that's been done by SYSBIOS. Its interesting to compare how SYSBIOS handles this compared to DriverLib. Differences are highlighted in orange.

SYSBIOS
- [SIMPLELINK]\kernel\tirtos\packages\ti\sysbios\family\arm\msp432\Timer.c -> Timer_getCount()
- If timer clock is synchronous, TAR is read and returned.
- If timer clock is asynchronous, TAR is read twice.
  - If secondReading - firstReading <= 10 counts TAR is considered stable and firstReading is returned to the caller.
  - Otherwise TAR is keeps being read until the two readings are within 10 counts

MSPWARE/DRIVERLIB
- [SIMPLELINK]\source\ti\devices\msp432p4xx\driverlib\timer_a.c -> Timer_A_getCounerValue()
- TAR is read twice regardless of whether the timer clock is asynchronous or not
  - Difference between firstReading and secondReading is calculated as largerReading - smallerReading.
  - If the difference <= 50 counts, TAR is considered stable and secondReading is returned to the caller.
  - Otherwise TAR keeps being read until the two readings are within 50 counts.

I wonder why SYSBIOS returns firstReading, not secondReading. In my mind secondReading is likely to have more correct bits because TAR has had longer to settle.

Also, I don't understand why a 10/50 count band is being allowed here. Why don't these functions keep reading the TAR until they get two identical readings? Is this just to speed things up? I worry allowing a band can be problematic in some circumstances. Consider this example -

The current TAR value (truncate to 8 bits for simplicity) is 0b00001011 and just about to count up.
Timer_getCount() is run as TAR transitions to its new value.
It is possible that the first read of TAR happens when just the two LSBs have transitioned - i.e. firstReading = 0b00001000
Then the second read get the final settled value - i.e. secondReading = 0b00001100
Timer_getCount() calculates secondReading - firstReading = 4
The different is less than 10 so Timer_getCount() returns firstReading (=0b00001000) to the caller
This is a problem because firstReading is less than the TAR value we started with

>> It appears that getCount does the recommended fetch-voting based on the "synchronous" flag, set according to a parameter (presumably from the user program) to Timer_Instance_init(). Is that flag set to 0?

This is all handled automatically by the tools. The tools setup TimerA0 for use by SYSBIOS with a 32768Hz oscillator and TACCTL0.SCS set to 0 (Asynchronous). You can see this passes through correctly to Timer_getCount() by putting a breakpoint at line 252 of [SISBIOS]\.....\Timer.c waiting for the breakpoint to be hit then examining the obj->synchronous flag in the debugger.

>> [Edit: Just to be clear, that code isn't trying to deal (primarily) with rollovers, rather to deal with overshoots, where -- due presumably to interrupt latency-- the entire next tick has passed by the time you try to update it.]

Thank you yes, that's right.

Cheers

Julian

0 Bruce McKenney over 4 years ago in reply to Julian Kolodko

Guru 99415 points

Based on the comment, I think the guard band around the fetched time(s) is intended to accommodate a fast (asynchronous) clock, i.e. close to MCLK, which would be expected to advance while the fetches were executing. (My first thought is that this is a very unusual case -- someone who went to the trouble to design-in a fast crystal would probably use it for MCLK as well. It seems like the guard works against you with a slow timer.) Another observation is that in the case of a (small) mismatch it returns first (closer to the call) rather than second (closer to the return).

Are you fairly certain that your breakpoint indicates a failure case? Your symptom --a two-second stall-- is consistent with missing an overshoot and waiting for the 32kHz timer to wrap around again; in this code that would mean not setting CCIFG when it is needed. But based on your values, it seems that it should have matched case "C".

I agree that this looks somewhat odd. Unfortunately, I can't seem to build your project so all I can do is inspect. I have in the past written code to do something analogous, and there are usually many corner cases.

0 Julian Kolodko over 4 years ago in reply to Bruce McKenney

Expert 2265 points

>> ....is intended to accommodate a fast (asynchronous) clock, i.e. close to MCLK, which would be expected to advance while the fetches were executing....

That makes sense, thank you for the explanation. As you say, if the timer clock is much slower than MCLK, this band check is not really need.

>> Another observation is that in the case of a (small) mismatch it returns first (closer to the call) rather than second (closer to the return).

Yes, and I think that's part of the problem because the first value read is more likely to be a transient state than the second value read (when timer clock is slow compared to MCLK).

I notice that TI hasn't been consistent in which value they return: SYSBIOS (Timer_getCount()) returns the first read value, while the equivalent DriverLib/MSPWare call (Timer_A_getCounerValue()) returns the second read value.

>> Are you fairly certain that your breakpoint indicates a failure case?

Yes. Given how simple our example project is (a HWI and a Clock function, both doing almost no work) it should be impossible to fall into case C with the now value (the value of TAR reported byTimer_getCount()) being one less than the previous CCR0 setting.

In order to fall into case C (i.e. now less than previous), we'd need a two second delay (full rollover of the TAR).
There is nothing in the code that could cause a 2 second delay. The code barely does any work at all.
now is always exactly one less than previous when we hit the breakpoint. I don't see how some external delay (power management?? debugger?? other???) could interact with the code in such a way as to always have now exactly one less than previous.
Although now is less than previous, the TAR register itself is actually greater than previous so there is nothing wrong with the timer count, only with the now value.

So, given the above, our conjecture is that the now value is incorrect. And the reason that its incorrect is because reading of the TAR register to set now (a) allows a band of 10 between reads and (b) returns the first read value (which is more likely to be a transient state of the TAR register) rather than the second when the difference between reads is <10.

>> But based on your values, it seems that it should have matched case "C".

Well spotted! Yes, I've only explained the first bug so far, but we think there is another - when the above bug occurs, it causes the SYBIOS timer interrupt to get into a loop triggering itself.

The chain of events is roughly as follows:

The SYSBIOS TimerA0 CCR0 HWI fires (identified as HWI.ti_sysbios_family_arm_msp432_Timer_periodicStub_E() in execution graphs, function Timer_periodicStub() in Timer.c)
It advances the CCR0 register (line 185 Timer.c)
Timer_setNextTick() runs and finds now is less than previous (due to first bug - now being read incorrectly) causing "Case C" to run.
Case C sets the CCR0 interrupt flag.
The SYSBIOS CCR0 HWI completes, then runs again immediately because of step 4 (this is second bug, HWI triggering itself)
Now we have a loop
- per step 1, Timer A0 CCR0 HWI fires
- per step 2, CCR0 is advanced,
- per step 3 case C is triggered - this time now because of bug 1, but because we've shifted the CCR0 ahead of where it really should be.
- per step 4, CCR0 IFG is set,
- per step 5, the HWI completes then immediately fires again....
Loop continues until we catch our tail
Not quite sure how this translates to a 2 second delay....... but if you refer back to our older forum post I think (though I haven't quite got the complete mechanism down) this loop does explain why Swi.ti_sysbios_knl_Clock_workFuncDynamic_E() runs for so long in some of our execution graphs.

Cheers

Julian

0 Julian Kolodko over 4 years ago in reply to Cash Hao11

Expert 2265 points

G'Day Cash,

I've never worked with the power management stuff. How do I know if a device is always running in active mode?

I do know our main project has .enablePolicy set to false in the power manager stuff (PowerMSP432_ConfigV1). Is that it? Is it possible to see what the power management state is at run time? I don't see anything in ROV.

Cheers

Julian

0 Cash Hao11 over 4 years ago in reply to Julian Kolodko

TI__Mastermind 46625 points

Hi Julian,

You can check the PCMCTL0 register to get the current power state information of your MCU.

Best regards,

Cash Hao

0 Bruce McKenney over 4 years ago in reply to Julian Kolodko

Guru 99415 points

I was kind of hoping a TI-RTOS wizard would show up, but if I were stuck on this, I would do the two indicated experiments:

1) Set the guard band (READ_LATENCY) to 0, which is what it really should be in your configuration.

2) Change getCount to return second rather than first, which in theory is benign.

It sounds like your experiments are fairly expensive, so you may want to do both, then work backwards if it fixes the symptom.

0 Julian Kolodko over 4 years ago in reply to Bruce McKenney

Expert 2265 points

G'Day Bruce,

Thank your for the time you have put into this for us.

We have a bit of a phobia of modifying SYSBIOS as a solution to issues we are having, both for fear of unforseen consequences and the difficulty of maintaining our modifications as SYSBIOS is updated over time.

So we've gone with a different work around. Rather than letting the tools setup a timer and use that to call Clock_tick() on our behalf, we call it in ourselves in a mSec timer interrupt that was already in our code base. This seems to keep SYSBIOS out of the problematic code path. We are still testing, but current indications are that this is a solution for us.

Is there anything I can do to help you get the project I attached running? It would be good to get independent verification that what we think we are seeing is real. With a bit of luck, that might help spark up some interest from the wizards.

Cheers

Julian

0 Bruce McKenney over 4 years ago in reply to Julian Kolodko

Guru 99415 points

I installed CCSv9 and was able to build your project.

I was able to generate your symptom; it took maybe a minute or so. In each of maybe 4-5 occurrences, the indication was that the difference between first and second was (always) 3. This would seem to point to a (long!) carefully-timed pre-emption.

I tried experiment (2) above. and it took maybe 1-2 hours (I wasn't watching closely) to fail. At the breakpoint, CCS seemed to be telling me that second was less than first. (I suspect an illusion, but I haven't found it.) In any case, second was less than (old) CCR0. This seems to point away from a pre-emption as a trigger. It also says that this would not be an effective fix.

I also tried a trivial experiment -- just running a timer from ACLK (i.e. async) and seeing if TAR read at a CCIFG (ever) differs from CCR0. That hasn't failed in 30 minutes or so. This seems to point away from a clock-boundary glitch (reading stale value from timer peripheral).

I have to get up for work in the morning, so I probably won't get further before tomorrow. I suspect there are some clues here, but I don't yet know what they mean.

0 Julian Kolodko over 4 years ago in reply to Bruce McKenney

Expert 2265 points

G'Day Bruce,

Thank you for your effort on this. There is some comfort in receiving confirmation that this is a real, reproducible problem.

Given I have a work around, I'm not in a hurry for a solution, but I would really appreciate if you could investigate further when you have the time and keep me up to date with what you find.

We've been working on this bug on and off for two years. It would be great to definitively squash it!

Cheers

Julian

0 Jace H over 4 years ago in reply to Julian Kolodko

TI__Guru 56685 points

All,

Thank you for the effort of the investigation throughout this thread. We will summarize this thread and feed it back to our internal bug tracking so this can potentially be addressed in a future release.

In the mean time, I'm glad you were able to find a workaround for the issue. At this time, I'm going to mark the reply where you discuss your workaround as "TI Thinks Resolved" for those who stumble across this thread and are in need of the workaround as well.

0 Julian Kolodko over 4 years ago in reply to Jace H

Expert 2265 points

G'Day Jace,

Thank you for your feedback.

We do have a work around for our particular situation where a cascade of at least two bugs messes up SYSBIOS timing for things like Task_sleep(), clock functions, semaphore timeouts etc. [This sentence added to help with keyword searches!]

I'm glad to hear that you've added this to your bug tracking because I worry the primary bug might have broader impact than just our somewhat esoteric SYSBIOS timing problem - its a bug in a commonly used SYSBIOS & DriverLib API across processors.

The primary bug looks to be that both the SYBIOS Timer_getCount() function and the DriverLib Timer_A_getCounerValue() functions can't necessarily be trusted for reading timers with asynchronous clocks, and the timer clock is very different to MCLK (e.g. the default setup for SYSBIOS seems to be MCLK = 48MHz, SYSBIOS timer peripheral clock = 32768Hz)

If my guess for the root cause about this is correct, then I think the SYSBIOS function will have an error of up to 10 counts while the DriverLib function an error of up to 50 (but with a lower probability [though non-zero probability given Bruce's comments above] of an erroneous value being reported), and in both cases, the reported timer count can be LESS than the real value (which is where SYSBIOS's use of Timer_getCount() comes unstuck)

I haven't checked the implementation of these functions for other processors, but I imagine they use the same voting process (which requires two readings of the timer register to be within a tolerance). If they do, then this same problem may well exist for more than just the MSP432.

Cheers

Julian

0 Bruce McKenney over 4 years ago in reply to Jace H

Guru 99415 points

I'll just throw in my subsequent (inconclusive) observations:

1) I tried experiment (1) -- setting the tolerance=0 -- and it ran without fail for 6.5 hours (then I had to use the equipment for something else). This isn't too surprising since it would have caught the particular case where TAR seems to "jump" from ccr0-1 to ccr0+2.

2) My scope showed the GPIO toggle-burst right on schedule up to the breakpoint. This suggests that the timer is not actually counting all the way around to ccr0-1 (2 seconds), though that's how the code interprets it.

3) Mis-matches (second != first) happen moderately often. I put in a little histogram and (a) mismatches were counted maybe 5 times per second over a few minutes and (b) maybe 1% of these show a difference > 1 (up to at least 10). The code of interest is running IRQ-enabled, so this wouldn't be startling except in how often it happens -- right between 2 consecutive instructions. In the failure case the RTOS trace didn't show any "surprise" interrupts.

4) The overt symptom isn't quite a "stall", rather the code sets CCIFG repeatedly to cause the RTOS code to step very fast (maybe a half-second) around the timer cycle to get to (original) ccr0-1 (as Julian described above). The scope shows non-stop GPIO toggle-bursts from pin_toggle during that time.

I focused on Timer.c. I'm not familiar enough with TI-RTOS to speculate what other parts of the system might be involved.

0 Julian Kolodko over 4 years ago in reply to Bruce McKenney

Expert 2265 points

Thank you for this additional testing Bruce!

It's a great independent analysis and confirmation of the chain of issues I'm experiencing.

Based on your testing I think the right solution for the first bug is to set the tolerance to 0 in Timer_getCount() / Timer_A_getCounerValue() [will work for timer peripherals with asynchronous clock, much slower than MCLK].

The work around I'm using (not letting SYSBIOS use a timer with async clock) is a bit heavy handed, and doesn't solve the issue for people using the Timer_getCount() / Timer_A_getCounerValue() in their own code.

I *think* once the first bug is fixed (so that the count read from the timer can longer longer be less than the actual register value), the second bug can't be triggered.

I guess now we wait to see if the issue is fixed in the next release.

Thanks again Bruce.

Cheers

Julian

Arm-based microcontrollers

Arm-based microcontrollers forum

MSP432P4111: Possible bug reading TimerA0 TAR when timer runs off 32768Hz oscillator