MSP432P4111: SYSBIOS Clock Function SWI appears to be blocking all tasks from running

Julian Kolodko

Part Number: MSP432P4111
Other Parts Discussed in Thread: SYSBIOS

Hi,

I have a problem with SYSBIOS (SimpleLink 3.2) Clock Functions that I hope you can help me debug.

The discussion here is with respect to this Execution Graph, which is typical of our problem. We seem to randomly reach a point where the Clock Function SWI blocks all tasks from running.

On the left of the capture our code is working correctly - a mixture of HWIs, SWIs (including the clock function SWI), Tasks, and a couple of zero latency interrupts (not visible here), are running.

In our system, the clock function SWI runs as the lowest priority SWI, with a tick period of 310uSec [well, given the underlying 32768Hz clock, its actually around 305uSec], unnecessary ticks suppressed. This is a fast tick, but with unnecessary ticks suppressed it doesn't seem to place an unnecessary load on the processor. All our clock functions are dynamically created (i.e. in the code, not in the GUI).

What we are finding is that at some random point in time, the Clock Function SWI starts running all the time, preventing tasks from running, though higher priority SWIs and HWIs continue on. We have not been able to find a reliable trigger - but happens seconds to many minutes after startup.

This transition from normal operation to blocked-by-clock-SWI operation is visible in the execution graph where the cyan traces start becoming very long.

Also notice all the little green flags in the Semaphore row of the execution graph. Each of those flags indicates that one particular Semaphore has been posted. That semaphore is only posted by a clock function (the clock function is a single line of code that posts the semaphore) with a 2mSec period. Here we the semaphore being posted about every 50uSec. This is real - see the CRO trace below.

What this feels like is that all of a sudden the tick for the Clock Function SWI has been reduced from 310uSec to something very small so it monopolizes the processor but I don't see how that can happen.

The Clock Function uses TimerA0 which is configured to run off ACLK (which runs off LFXT). I've output ACLK to a pin and it looks fine (32.768kHz). Settings for TimerA0 look right.
There is nothing to indicate there have been any task or HWI stack overflows (per ROV Task and HWI reports)
Our code does not call Clock_tick() anywhere
None of our clock functions run for more than maybe 100uSec.
None of our clock function block - the fact that the semaphore is being repeatedly posted seems to confirm this .
ROV (see pictures below - I assume ROV retrieves data directly from live SYSBIOS data structures in the micro) seems to indicate Clock tick rate remain at 310uSec, and that clock functions have the right timeouts
- pTask_clockFunc_TriggerTask - this is the clock function that does nothing but set the semaphore. Its timeout of 6 ticks (near enough to 2mSec) is correct.
- zeroXing_ClockFunc_update_periods - this clock function does some background processing roughly every 4 mSec (13 ticks is near enough)

These two screen grabs were takenfrom ROV at the same time as the above Execution graph.

I've setup three of the four active clock functions to set a pin high when they start and go low when they finish. CRO caputure is below. You can clearly see where the clock functions are running at the normal expected cadence and then two of them suddenly start running very quickly.

Yellow = pTask_ClockFunc_triggerTask
Green = zeroXing_ClockFunc_update_periods
Purple = LED_ClockFunc_tick

Just for interest, I let the micro run for a while after it entered this state, then paused execution and looked at ROV again. Here is the clock function information - the tRemaining fields seem to be confused.

Related issues I've found

https://e2e.ti.com/support/legacy_forums/embedded/tirtos/f/355/p/501165/1815469 - I don't think my clock tick is unreasonably slow - particularly with suppress unnecessary ticks enabled. None of my clock functions block.

https://e2e.ti.com/support/legacy_forums/embedded/tirtos/f/355/p/399069/1412086 - All my clock functions run within about 100uSec.

I'd appreciate any help in debugging this issue!

Thank you for your time

Julian

over 6 years ago

0 Julian Kolodko over 6 years ago

Expert 2265 points

I've found a couple more hints relating to this problem.

First, I've shown above that the Clock Function SWI seems to permanently get stuck. I've now also seen a few instances where the SWI stalls briefly and recovers, without going crazy posting the semaphore. e.g. see the long green trace in the execution graph below.

Second - I've taken a closer look at TimerA0 and I **think** (trying to investigate this more) that I've noticed odd behaviour between the TAR and CCR0 registers. I think I've seen TAR get significantly ahead of CCR0 on occasion, and I think that when the Clock function SWI goes crazy, CCR0 stops being updated.

Third - We have some zero latency interrupts in our system. Could these interact badly with the timing around Clock Functions?

Thanks again for you help.

Julian

0 Julian Kolodko over 6 years ago in reply to Julian Kolodko

Expert 2265 points

I can confirm that this problem is not directly related to ZLI. As a test I changed by ZLIs to HWIs and still occasionally saw the Clock Function SWI hang up.

I can also confirm that CCR0 stops being updated when the Clock Function SWI gets "hung up".

I've traced this second issue into [SIMPLELINK]kernel\tirtos\packages\ti\sysbios\knl\Clock.c Clock_workFuncDynamic()

When the Clock Function SWI hangs, the code gets stuck in the following loop:

    while (ticksToService >= distance) {
        serviceTick = serviceTick + distance;
        ticksToService -= distance;
        distance = Clock_walkQueueDynamic(TRUE, serviceTick);
    }

ticksToService is a very large number near 2^32 when the clock function SWI stalls, so we are going to be in this loop for a long time
The call to Clock_walkQueueDynamic() runs my clock functions, explaining why I see them run over and over again very quickly and why no other lower priority threads run
Further down in function Clock_workFuncDynamic() is a call to Clock_scheduleNextTick() which updates the timer's CCR0. Since we are stuck in the loop, CCR0 is never updated.

As for why we get stuck in this loop. I don't have a definite answer yet but seeing ticksToService around 2^32 is unusual

The underlying time being used here is a 16 bit timer so the count can never naturally get that big.
I suspect the line ticksToService -= distance; is to blame here. If distance is bigger than ticksToSerivce an overflow occurs, explaining the very large value for ticksToService.

I suspect some additional checking is required here....

distance is calcualted in Clock_walkQueueDynamic() - some comments below, but I suspect my problem is that (a) I have a short clock function tick period (b) sometimes two clock functions will be scheduled on consecutive ticks (c) maybe sometimes higher priority tasks starve the clock function code for once tick. If that happens I think the code that checks if a clock function has timed out can fail and consequently lead to the behaviour I've seen. Will do some more experiments to check.

UInt32 Clock_walkQueueDynamic(Bool service, UInt32 thisTick)
{
    UInt32 distance = ~(0U);
    Queue_Handle clockQ;
    Queue_Elem  *elem;
    Queue_Elem  *clockQElem;
    Clock_Object *obj;
    UInt32 delta;

    /* Traverse clock queue */
    clockQ = Clock_Module_State_clockQ();
    elem = Queue_head(clockQ);
    clockQElem = Queue_prev(elem);

    while (elem != clockQElem) {

        /* MISRA.CAST.OBJ_PTR_TO_OBJ_PTR.2012 */
        obj = (Clock_Object *)elem;
        elem = Queue_next(elem);

        /* if  the object is active ... */
        if (obj->active == TRUE) {

            /* optionally service if tick matches timeout */
            if (service == TRUE) {

                /* if this object is timing out update its state */

****** MY COMMENT - If, for some reason, for some clock function, then is never true, then ****** 
****** obj->currTimeOut is never updated, obj->active is always true, and consequently bad ****** 
****** things can happen below                                                             ****** 
****** E.g. I have 310uSec ticks.  What if (a) a clock function is scheduled to run on the next ******
****** tick and (b) higher priority tasks starve this code for for over 310uSec                             ******

                if (obj->currTimeout == thisTick) {

                    if (obj->period == 0U) { /* oneshot? */
                        /* mark object idle */
                        obj->active = FALSE;
                    }
                    else {                      /* periodic */
                        /* refresh timeout */
                        obj->currTimeout += obj->period;
                    }

                    /* MISRA.ETYPE.INAPPR.OPERAND.BINOP.2012 MISRA.CAST.FUNC_PTR.2012 */
                    Log_write2(Clock_LM_begin, (UArg)obj, (UArg)obj->fxn);

                    /* call handler */
                    obj->fxn(obj->arg);
                }
            }

            /* if object still active update distance to soonest tick */
            if (obj->active == TRUE) {

****** MY COMMENT - See comment above - this calculation goes bad (underflows) if some some reason a clock function isn't properly timed out ****** 
****** The if statement below won't catch such an error                                                                                      ****** 

                delta = obj->currTimeout - thisTick;

                /* if this is the soonest tick update distance to soonest */

****** MY COMMENT - Is this redundant?? distance is always 0xffffffff, and delta comes from a 16bit timer.  This can never be false ****** 
******              It is meant to catch the error described above?  I don't think it will                                          ****** 

                if (delta < distance) {     
                    distance = delta;
                }

            }
        }
    }

    return (distance);
}

0 ToddMullanix over 6 years ago in reply to Julian Kolodko

TI__Guru* 96960 points

Sorry. This thread fell thru a crack. Someone will look at it today or tomorrow morning. In the meantime, can you enable logging for the Clock module. I'd like to see if a clock function is hogging the CPU.

Todd

0 Alan DeMars over 6 years ago in reply to ToddMullanix

TI__Mastermind 30830 points

Are you able to use a 1ms Clock.tickPeriod just to see if the problem still exists?

Can you try using Clock.TickMode_PERIODIC to see if the problem still exists?

Also, can you try using the default value for Clock.swiPriority?

I'm wondering if the Clock Swi processing is occasionally being held off too long by other Swis, resulting in an interrupt being missed.

Alan

0 ToddMullanix over 6 years ago in reply to Alan DeMars

TI__Guru* 96960 points

Julian,

Any update on this?

Todd

0 Julian Kolodko over 6 years ago in reply to ToddMullanix

Expert 2265 points

G'Day Todd,

Sorry for the delay - been away a couple of days.

Things I've tried/noticed

I tried changing my two most frequently running clock functions (running on 2mSec (6ticks) and 4mSec (13 ticks) cadence) into tasks - still saw the clock function stall. So issue is perhaps not related to those particular clock functions or frequently running clock functions in general.

I'm trying to put together an example program to illustrate the problem. No success so far, but in this work I've noticed that Swi.ti_sysbios_knl_Clock_workFuncDynamic__E() runs in programs that have no clock functions but do have tasks with Task_sleep() calls. I guess the BIOS scheduler uses "workFuncDynamic" when the RTOS needs to time something??? Perhaps this gives a clue that the whatever I'm doing wrong is not directly related to clock functions at all.

I'll try Alan's suggestions and get back to you later in the day.

Cheers

Julian

0 Alan DeMars over 6 years ago in reply to Julian Kolodko

TI__Mastermind 30830 points

Julian,

All BIOS timeouts (Task_sleep(), Semaphore_pend(), Event_pend()) are implemented internally using a dynamically constructed Clock object. When the dynamic Clock times out, the Clock object is destroyed.

Alan

0 Julian Kolodko over 6 years ago in reply to Alan DeMars

Expert 2265 points

G'Day Alan,

I've tried changing my tick rate from 305uSec to 1mSec, and in about 48hrs of testing I have not seen the Clock Function SWI hang.

This leads me to a few questions

Have I actually fixed the problem or just made it much less likely to occur?? I'm going to test this by going the other way, and making the tick rate faster. Hypothesis is that the clock function SWI should then be more likely to hand and it might be easier to find root cause.
- I've tried a tick of around 91uSec (3 ticks of the 32768Hz oscillator) and I'm not seeing the Clock Function SWI hang much more often than with a 305uSec tick.
I'm not entirely convinced yet the root cause here is missed interrupts. In each case where I've caught the Clock function SWI hanging, the micro has had plenty of idle time in the time leading up to the hang. This is clear in the execution graphs above.

My code had a few 1mSec task sleep calls. With a tick rate of 305uSec, this was effectively done using Task_sleep(3). Now my tick rate is 1mSec the call changes to Task_sleep(1). What does that actually do?? What is the minimum resulting sleep time? Is it guaranteed to be "at least 1 tick" (and up to 2) or is it "up to 1 tick" (depending on the relative timing of the Task_sleep() call and the next clock tick)? What does Task_sleep(0) do?

Cheers

Julian

0 Alan DeMars over 6 years ago in reply to Julian Kolodko

TI__Mastermind 30830 points

Julian,

I think you MAY have fixed the problem. I have a hunch that the issue is a pile up of Swis of higher priority than the Clock Swi delaying the processing of the Clock Swi by more than a tickPeriod. This can lead to the results you were getting. By increasing the tickPeriod, that condition may never arise again.

To be more confident about this, try setting tickPeriod back to the 305us value and let the Clock.swiPriority assumes its usual highest Swi priority value. This will insure that the Clock Swi will execute before all other Swis.

To answer about Task_sleep(1): As you suspect, the actual sleep time is "up to" 1ms and no more. But it could be considerably less if the Clock interrupt fires immediately after Task_sleep(1) is called.

Alan

0 Julian Kolodko over 6 years ago in reply to Alan DeMars

Expert 2265 points

G'Day Alan,

Thanks for your feedback.

While acknowledge that a 1mSec tick seems to make things better, I'm still a bit nervous because I don't understand the real root cause here. I can't prove whether the 1mSec tick fixes the issue, or just makes it much less likely to occur - when released our device will have to run reliably for years at a time.

I'm not quite ready to accept you idea that there is a pile up of SWIs for a couple of reasons

The execution graph in my initial post is typical of what I see when the Clock Function SWI hangs, and there is plenty of idle time there where SWIs could run if they needed to. Indeed as part of trying to resolve this issue I've reduced how often my main routine runs (roughly doubling idle time) with no reduction in hangs.
I've tried your suggestion of a 305uSec tick with Clock.swiPriority at 15 (with everything else below that). This did not help - still saw the Clock Function SWI hang.
Running with an RTOS tick rate of 91uSec doesn't make the problem noticeably worse (i.e. doesn't reduce average time to hang)

I've tried reverting from SimpleLink3.20.0.6/XDCTools3.55.2.22 back to SimpleLink 2.30.0.14/XCDTools3.8.50.24 with no improvement (I tried this because we updated versions in the last month or two so there was a concern the bug might have been introduced with that change).

Now that I've pulled out my top two most frequently running clock functions, the execution graph looks a bit different when things hang up.

The clock function SWI runs for longer than normal, then stops running at all.
The code is still stuck in the Clock.c -> Clock_workFuncDynamic() -> while (ticksToService >= distance) loop with a very large value for ticksToService. Given this, I don't understand why Clock_workFuncDynamic() stops showing up in the execution graph.
Overall the symptoms are as before - everything higher priority than the Clock function SWI runs but nothing lower priority runs except the IDLE task does run (where previously Clock_workFuncDynamic() would have been running).
- I think this makes sense - Clock_workFuncDynamic() does the scheduling - if its not running properly it probably thinks there are no user tasks etc to run, so it allows the idle task.
This execution graph generated with system tick at 305uSec

I still have not been able to create a simple demo app to illustrate the problem. At the moment I assume I'm mistreating the RTOS somehow, but can't quite figure out how and so I would really appreciate your continued help on this. If the new information above sparks any new insights or you can think of some way to dig deeper into this I'd love to hear. I'm starting to run out of ideas for how to track this down.

Cheers

Julian

0 Alan DeMars over 6 years ago in reply to Julian Kolodko

TI__Mastermind 30830 points

Julian,

I agree that my theory isn't holding up after further experimental results.

What evidence do you have that the Idle task is running?

Clock_workFuncDynamic() only schedules Clock object function execution, not tasks.

If the processor is stuck in Clock_workFuncDynamic(), it should never be returning to task mode.

Since Clock_workFuncDynamic() runs in the context of a Swi, any Swis posted of higher priority than the Clock Swi during the execution of Clock_workFuncDynamic() (ie a Swi posted by a Hwi) will run.

Is there any chance you have a non-BIOS managed interrupt that is calling BIOS APIs such as Semaphore_post()? This might explain how the Idle task is able to run while the processor is stuck in the Clock_workFuncDynamic() function.

Tell me about Hwi.timerA3_CCR0_HWI(). The extended workFuncDynamic occurs when that interrupt occurs while it is executing. What priority is assigned to it?

Alan

0 Julian Kolodko over 6 years ago in reply to Julian Kolodko

Expert 2265 points

G'Day Alan,

>> What evidence do you have that the Idle task is running?

See the execution graph in my previous post - the blue traces are knl_idle_loop__E() running. On the right hand side of the execution graph you can see workFuncDynamic() has stopped running (together with anything of lower priority than workFuncDynamic()) but idle does appear to run.

>> Clock_workFuncDynamic() only schedules Clock object function execution, not tasks.

This confuses me a little. You said earlier that "All BIOS timeouts (Task_sleep(), Semaphore_pend(), Event_pend()) are implemented internally using a dynamically constructed Clock object. When the dynamic Clock times out, the Clock object is destroyed."

I read this to mean that (an instance of) Clock_workFuncDynamic() would run and this seemed to be confirmed when I looked at execution graphs (even the one in my previous post - a bit to the left of the "Normal Operation" arrow). Any task that wakes up after being put to sleep always has the following sequence of events -

#Hwi.ti_sysbios_family_Arm_msp432_Timer fires
#Swi.ti_sysbios_knl_Clock_workFuncDynamic() fires
My task that just woke up runs

>> If the processor is stuck in Clock_workFuncDynamic(), it should never be returning to task mode. Since Clock_workFuncDynamic() runs in the context of a Swi, any Swis posted of higher priority than the Clock Swi during the execution of Clock_workFuncDynamic() (ie a Swi posted by a Hwi) will run.

I see what you are saying. Yes, it unclear why the idle task runs now.

FYI, I have three idle tasks - two from the RTOS (power management and stack monitoring) and one of my own (watchdog management).

>> Is there any chance you have a non-BIOS managed interrupt that is calling BIOS APIs such as Semaphore_post()? This might explain how the Idle task is able to run while the processor is stuck in the Clock_workFuncDynamic() function.

Here are two of my three zero latency interrupts. In my current setup, the third one never fires so is not contributing to the issue.

Nothing here but MSPware calls, unless there are BIOS calls hiding under the covers???

// Timer interrupt that triggers DMA based transfer via UART
void masterSampleTimer_ZLI(UArg arg0)
{
  // This ISR is for CCR0 interrupt only - so just clear that interrupt
  MAP_Timer_A_clearCaptureCompareInterrupt(MASTERSAMPLE_TIMER_MODULE, MASTERSAMPLE_TIMER_CCR);

  // Clear out interrupts on the optical interrupt pin and enable the interrupt
  MAP_GPIO_clearInterruptFlag(PIN_UART_TX_PORT, PIN_UART_TX_PIN);
  MAP_GPIO_enableInterrupt(PIN_UART_TX_PORT, PIN_UART_TX_PIN);

  // Prepare timer for next shot
  refCCR_value_u16 = MAP_Timer_A_getCaptureCompareCount(MASTERSAMPLE_TIMER_MODULE, MASTERSAMPLE_TIMER_CCR);
  nextCCR_value_u16 = refCCR_value_u16 + g_samplePeriod_16;  // Will roll over properly
  MAP_Timer_A_setCompareValue(MASTERSAMPLE_TIMER_MODULE, MASTERSAMPLE_TIMER_CCR, nextCCR_value_u16);
  
  // Start DMA
  if (!(BITBAND_PERI(EUSCI_A_CMSIS(UART_EUSCI_PORT)->IFG, EUSCI_A_IFG_TXIFG_OFS)) && 
      !(BITBAND_PERI(EUSCI_A_CMSIS(UART_EUSCI_PORT)->STATW, EUSCI_A_STATW_BUSY_OFS)))
  {
    // e2e.ti.com/.../622597
    // If the micro is reporting TXIFG is clear, but the UART is not busy then we have a problem.
    // To fix it, set the TXIFG flag.  This will ensure the DMA can start when the enable channel
    // command is given below.
    BITBAND_PERI(EUSCI_A_CMSIS(UART_EUSCI_PORT)->IFG, EUSCI_A_IFG_TXIFG_OFS) = 1;
  }

  g_uart_dirn_et = e_uart_Dirn_Tx;

  MAP_DMA_enableChannel(UART_TX_DMA_CHANNEL_NUMBER);  
  MAP_Timer_A_startCounter(DMA_OPERATION_TIMER_MODULE, TIMER_A_UP_MODE);  // Start DMA timeout timer

}

// GPIO interrupt detecting first falling edge at the start of a UART message
// Should only trigger once per message
void startOfMessage_ZLI(uintptr_t arg0)
{    
  uint16_t intFlags_u16 = MAP_GPIO_getEnabledInterruptStatus(PIN_UART_TX_PORT);

  if (intFlags_u16 & PIN_UART_TX_PIN)
  {	
    // Don't clear interrupt flag here.  This is a single shot interrupt and anyway the flag
	// will keep being reset as the message comes through.  Flag is cleared prior to start of 
	// transmission of a new message
	
    // Begin the Conversion for the next ADC Sample
    ASSERT(!MAP_ADC14_isBusy());  // Notify developer of ADC busy error
    MAP_ADC14_toggleConversionTrigger();

    // To ensure this ISR triggers once per message, turn off interrupts,
	// clear any pended interrupts.
    MAP_GPIO_disableInterrupt(PIN_UART_TX_PORT, PIN_UART_TX_PIN);
    MAP_Interrupt_unpendInterrupt(INT_PORT1 + (PIN_UART_TX_PORT - 1));
  }
}

>>Tell me about Hwi.timerA3_CCR0_HWI(). The extended workFuncDynamic occurs when that interrupt occurs while it is executing. What priority is assigned to it?

timerA3_CCR0_HWI() is the TimerA3 CCR0 interrupt - priority 2. Runs once a mSec. Generate a 32bit mSec count and a 32bit seconds count. There is no direct relationship between this HIW and workFuncDynamic().

Cheers

Julian

0 Alan DeMars over 6 years ago in reply to Julian Kolodko

TI__Mastermind 30830 points

Julian,

1) When the Clock module's timer interrupt goes off, it posts a dedicated Swi.

The Swi function for that Swi is Clock_workFuncDynamic().

Clock_workFuncDynamic() processes all of the active Clock objects looking for timeouts and calling their respective Clock functions accordingly.

The transient Clock objects constructed by the various APIs that support timeouts (Task_sleep(), Semaphore_pend(), etc) are respectively configured to invoke functions that handle the corresponding timeout, at which time the Clock object is then destroyed.

2) What does "priority 2" mean? Is that the literal value you've passed to Hwi_create() for the priority?

The valid values of priority are 0xe0, 0xc0, 0xa0, 0x80, 0x60, 0x40, 0x20, 0x00.

Any value below 0x20 (the default Hwi.disablePriority setting) will be treated as a zero latency interrupt and not be routed through the BIOS interrupt dispatcher.

Any Hwi function not handled by the BIOS interrupt dispatcher is prohibited from invoking any BIOS APIs so as to avoid corrupting the behavior of the Task scheduler.

What priority values are you using for all of the interrupts in your application?

Alan

0 Julian Kolodko over 6 years ago in reply to Alan DeMars

Expert 2265 points

Sorry yes, I #defined the HWI priorities for my own use. Priority 2 = 0x40

Zero latency interrupts -> I've configured the RTOS so that the two highest priority levels are both for ZLIs. I have three ZLI, two at priority HWI_PRIORITY_0 (they effectively run sequentially - one triggers the next - and both need to have minimal jitter) and the other at HWI_PRIORITY_1 (which is used for timing certain events as accurately as possible).

I have 15 HWIs and I'm not sure how much deal you want about those. Their priorities are scattered between 2 and 7.

Cheers

Julian

0 Alan DeMars over 6 years ago in reply to Julian Kolodko

TI__Mastermind 30830 points

Julian,

What value did you set Hwi.disablePriority to?

Are you modifying any of the Hwi priorities dynamically using Hwi_setPriority()? If you inadvertently set a BIOS-dispatcher-managed-Hwi priority to a number less than Hwi.disablePriority, then Hwi_disable() will no longer mask that interrupt and that could violate critical section code within BIOS, leading to state corruption.

Are you able to share your CCS project? If not, then maybe just a .out file with its corresponding app_pem4f.rov.xs file (which can be found in the Debug/configPkg/package/cfg directory)? I can try running it on my local MSP432P4111 launchpad and see if I can recreate the symptom.

Alan

0 Julian Kolodko over 6 years ago in reply to Alan DeMars

Expert 2265 points

G'Day Alan,

I have Hwi.disablePriority set to 64 and I'm not dynamically changing priorities.

I'm not really able to share the source and I'm not sure our code will run on a Launchpad - its very dependent on our hardware. I'll see what I can put together.

That said, I have now run a unit for a week with a 1mSec tick rate and no problems encountered. Although I don't really understand the root cause here this seems good enough for the moment, at least until I can get a more specific handle on what triggers the problem.

Cheers

Julian

0 Alan DeMars over 6 years ago in reply to Alan DeMars

TI__Mastermind 30830 points

One thing I failed to notice in your initial post is all those Semaphore_posts going on while the system is stalled in workFuncDynamic():

Do you know what thread is doing all the Semaphore posting? I don't see a Hwi function label nor anything hinting that a transition from workFuncDynamic to another thread has taken place.

Is there a chance that the system is being clobbered by an interrupt storm that fuels itself under certain conditions?

Alan

0 Julian Kolodko over 6 years ago in reply to Alan DeMars

Expert 2265 points

G'Day Alan,

FYI, I notice I now have two different links to this thread, and each link shows a different subset of comments. Not quite sure what is going on there...

(This set of comments) https://e2e.ti.com/support/microcontrollers/msp430/f/166/p/843943/3142049

(Another set of comments) https://e2e.ti.com/support/microcontrollers/msp430/f/166/t/843943

Back to the semaphore posting. I don't think this is an interrupt storm.

In the code I had at that time, I had a tick of 305uSec, and two clock functions with fairly high cadence (2mSec and 4mSec). The 4mSec clock function did some house keeping, and all the 2mSec clock function did was post a semaphore so that a task could be triggered periodically (I did this so that task would run with a regular cadence - a Task_sleep() does not guarantee a regular cadence if that task is being preempted).

What you see in the execution graph is that semaphore being posted over and over again, implying that the 2mSec clock function was being run over and over again (about every 50uSec). Further digging showed that both the 2mSec and 4mSec clock functions were being run over and over again (but the 4mSec clock function does not do anything that leaves a trace in the execution graph).

I traced this issue into [SIMPLELINK]\kernel\tirtos\packages\ti\sysbios\knl\Clock.c Clock_workFuncDynamic()

When the the semaphore starts being hammered, I find that the above function is stuck in this loop because ticksToService is a very large (near 2^32) number.

    while (ticksToService >= distance) {
        serviceTick = serviceTick + distance;
        ticksToService -= distance;
        distance = Clock_walkQueueDynamic(TRUE, serviceTick);
    }

It seems as though ticksToService overflowed (not sure how). That causes the loop to run indefinitely (as long as no higher priority SWIs/HWIs want to run), which in turn causes my clock functions (run by Clock_walkQueueDynamic()) to run over and over again.

I tried switching the two clock functions to tasks with a Task_sleep() [priority of these tasks was low, and having an accurate cadence wasn't really necessary] and I still saw the code getting stuck in the Clock_workFuncDynamic() loop, but in this case the execution graph (I've included an example execution graph in one of my posts in the "other" link) shows the scheduler dropping back to the idle task (and not running any task/thread with priority above idle and below Clock functions) where previously it was stuck running Clock_walkQueueDynamic().

Cheers

Julian

0 Alan DeMars over 6 years ago in reply to Julian Kolodko

TI__Mastermind 30830 points

Julian,

Is the LED_ClockFunc_Tick always the affected Clock object (tRemaining is huge)?

When and where does the LED_ClockFunc_Tick Clock object get started? Is this being started from another Clock object?

Alan

0 Julian Kolodko over 6 years ago in reply to Alan DeMars

Expert 2265 points

I didn't pay close attention to that. My recollection (will check next week) is that once this problem starts occurring, any clock function that runs ends up with an invalid tRemaining. In the ROV screenshots above, the LED clock function is clearly invalid, but so are all the other clock functions where tRemaining is marked in red (e.g. the ZeroXing clock function should running every 4mSec, and Ptask should be running every 2mSec and the tick period is 305uSec tick - so 8277247 for tRemaining or there abouts can't be right).

All my clock functions are created programmatically at startup. The LED clock function is a single shot clock function created in the stopped mode. To start the LED clock function running I have the following function

if (alreadyBlinking) return;

key = Hwi_disable();

LED_turn_off();
Clock_stop(g_LED_Tick_clockFunc_handle);

// Setup new LED indication

Clock_setTimeout(g_LED_Tick_clockFunc_handle, LED_MIN_DELAY_TICKS);
Clock_start(g_LED_Tick_clockFunc_handle);

Hwi_restore(key);

The LED clock function itself is a a statemachine where each state does its job then sets up the clock function to run again a some point in the future to perform the next part of the operation until the operation completes.

0 Alan DeMars over 6 years ago in reply to Julian Kolodko

TI__Mastermind 30830 points

Julian,

After some closer looking, it appears that the Clock servicing logic may be vulnerable to the impact of zero latency interrupts.

As an experiment, we'd like you to try changing all your zero latency interrupts into normal dispatched interrupts and see if the application is stable with a Clock.tickPeriod of 305us.

I realize that your application will not perform optimally due to the jitter introduced by this experiment but we'd like to determine if our hunch is correct.

Alan

0 ToddMullanix over 6 years ago in reply to Alan DeMars

TI__Guru* 96960 points

Julian,

Did this get resolved? If not, can you try the things Alan recommended?

Todd

0 Julian Kolodko over 6 years ago in reply to ToddMullanix

Expert 2265 points

Sorry, yes, I've tried Alan's suggestion and I still get the problem when my ZLI's are changed to HWI's

Working through this problem I've been able to establish that its not a straight forward case of certain HWI's or SWI's being starved of execution time leading to corruption of SYSBIOS state, and I know its not related to our use of ZLI's. What does seem to have worked is changing the tick rate from 305uSec to 976uSec as described elsewhere in this thread. I've had a unit with this change running for over 2 weeks now with no problems. My only concern is that I don't understand exactly why this is a solution so I'm not sure whether the problem is solved, masked or just must less likely to happen.

Could you answer one last question, then we can close this thread off - I'll open a new thread if this problem gets in my way again.

This is a long shot, first because I can't find a way my code could do this, and second, because I think I did try making my clock functions the highest priority SWI (they are currently the lowest priority SWI).

In any case, I'm curious.

What's the worst that can happen if I accidentally do a Clock_setTimeout(hndl, 0)? Looking at the source, SYSBIOS doesn't prevent this (but it does prevent Task_sleep(0)). In particular, I used to have a tick rate of 305uSec, but I also had a (higher priority) SWI that ran for over 400uSec (i.e. longer than one tick). What if I did a Clock_setTimeout(hndl, 0), then before that clock function could run, that other SWI ran holding off the clock function until the next RTOS tick. In this case, two ticks would pass by before the clock function stuff could run. Could that cause corruption of RTOS state?

Cheers

Julian

0 ScottG over 6 years ago in reply to Julian Kolodko

TI__Mastermind 26780 points

Hi Julian,

At first look I don't see that this would be a problem. Can you describe the execution sequence you are suspecting a bit more? Are you calling Clock_start() immediately after Clock_setTimeout(x,0), and then the other Swi triggers and preempts? And what execution context are these Clock calls made from?

I have to leave the office now but will look at this tomorrow...

Regards,
Scott

0 Julian Kolodko over 6 years ago in reply to ScottG

Expert 2265 points

G'Day Scott,

Yes, in the context of a clock function I'd run Clock_setTimeout(x, 0) and then Clock_start() on the next line, something like the code below (remembering clock functions are the lowest priority SWI in my code base)

What if between the Clock_setTimeout() and then Clock_setTimeout(x, 0) call, OR immediately after the Clock_start() call, a higher priority SWI or HWI preempts my clock function (and by extension, the SYSBIOS clock function handler) , such that, by the time the SYSBIOS clock function handler gets a chance to run, the scheduled start time for my clock clock function has passed.

By the way, I understand Clock_setTimeout(x, 1) means that my clock function is scheduled to run in *up to* one tick. What does Clock_setTimeout(x, 0) mean?

#define USEC_PER_TICK 305

clk_fxn0()

{

static uint32_t uSecInterval = 10;

// Do some work - uSecInterval will be set based on the result of that work

doStuff();

uSecInterval = outcomeOfStuff(); // What if this returns 0

// Setup next execution of this clock function (which is configured as a one-shot)

Clock_setTimeout(clk_fxn0_hndl, uSecInterval / USEC_PER_TICK);

Clock_start();

}

Cheers

Julian

0 Julian Kolodko over 6 years ago in reply to Julian Kolodko

Expert 2265 points

G'Day Scott,

I've tried to put together a little test program to illustrate my point, but no matter how I abuse the operating system with 0 timeouts and preempting my clock function at inopportune times, I can't get my program to misbehave. The operating system is absolutely solid.

Whatever I'm doing to upset the clock function handler appears to be far more subtle than just an invalid clock function time out.

I'm going to close off this thread now. I'm satisfied that changing my tick rate from 305uSec to 1mSec seems to have solved my problem even if it doesn't explain it.

Will start a new thread if I come up with anything more specific in future.

Thank you to you and to Alan and to Todd for your support. It is much appreciated.

Cheers

Julian

0 ScottG over 6 years ago in reply to Julian Kolodko

TI__Mastermind 26780 points

Julian,

Ok, thank you for the update, and for closing this out. I’ve been reviewing the sources and haven’t found any sequence where a timeout of zero would be a problem. I had a couple of ideas along the way but those did not pan out. I’m not quite done yet, but will finish my review while I’m at it. I will re-open this thread if I find anything.

Cheers,
Scott

**Attention** This is a public forum

MSP low-power microcontrollers

MSP low-power microcontroller forum

MSP432P4111: SYSBIOS Clock Function SWI appears to be blocking all tasks from running