This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

GateMutexPri not properly locking out task

Other Parts Discussed in Thread: SYSBIOS

Hi,

I am having a problem where a GateMutexPri doesn't seem to be stopping a task from entering a critical section of code when another task already (seemingly) has the gate.

Software/Hardware:
- TI-RTOS 1.01.00.25 (SYS/BIOS 6.34.04.22)
- XDC Tools 3.24.6.63
- Code Composer Studio 5.5
- TI Compiler 5.0.6
- LM3S9D92 custom board

Details of the problem:
The problem manifests itself when two tasks attempt to call Event_pend() causing an xdc runtime error. This should never happen as the code that calls the Event_pend() is resource locked by using a GateMutexPri.

Of the two tasks in question one, is priority 16 (task_high) and one is priority 12 (task_low). When the problem occurs, according to the ROV window, task_low has the gate and task_high is pending on the event. Under normal operation the same task that is pending should also have the gate (and this does normally work).

I've used the TI provided Logger in order to get more information regarding what is going on in the system. The code also fails the same way without any logging instrumentation.

Here is what happens:
Initially state of tasks:
task_high is pending on an event (EVT_inverter_cntl)
task_low is running

1. task_low calls the function
2. task_low logs that it is going to enter the gate
2.1 somewhere in here the hardware timer goes off that runs the tick.
2.2 the hardware timer also calls a clock function that posts EVT_inverter_cntl event causing the scheduler to run and a switch making task_high ready and a switch to run
3. task_high calls the function
4. task_high logs that it is going to enter the gate
5. task_high indicates through the log that it entered the gate
6. task_high pends on EVT_SBLink_ORQ, this causes a switch to have task_low run
7. task_low indicates through the log that it entered the gate
- This shouldn't happen if task_high did properly acquire the gate. I would expect task_low to block here and task_high to run until it leaves the gate.
8. task_low pends on EVT_SBLink_ORQ, this causes the XDC runtime error and ultimate application failure.

There is no way to pend on the event without acquiring the gate and there is no way to return from the function without leaving the gate.

If it is necessary I could probably post the actual code in question.

In order to replicate the failure I have to leave the application running overnight at the office. I've tried making a simple example application to replicate the failure to no avail.

The code looks something like this:
function {
log - about to enter gate
enter gate
log - entered gate

log - about to pend on event
pend on event
log - pended on event

leave gate
log - left gate
}


Any insight into the problem would be helpful. Thanks!

Edit: Is it possible that there is a critical section of code in GateMutexPti_enter that is not being protected in this case?

  • Hi Mike --

    Do you know if you are Assert-checking enabled for this build?   I wonder if you are somehow calling GateMutexPri_enter/leave from an HWI or SWI thread.   We have an assert within the GateMutexPri code to check for this.   You could add a bogus call to GateMutexPri_enter() from one of your ISRs to make sure that the Assert is active.

    I just reviewed the GateMutexPri code and I don't see any obvious flaws, but I'll ask one of the developers to take another look.

    Regards,
    -Karl-

  • Hi Karl

    Thanks for the reply, I did double check that the assert is enabled by calling GateMutexPri_enter() in one of my HWIs. The call did raise an XDC runtime error at the assertion.


    Edit:

    I noticed GateMutexPri_enter calls Task_disable(), will this actually stop the scheduler from switching tasks?

  • Yes.   Task_disable() disables the scheduler.


    I reviewed the code and I cannot understand why you are seeing the behavior you are seeing.

    Have you reviewed your debug code to make sure the prints make sense?  Sometimes with task switches, the switch can occur right after your print and confuse you.   If you use the "BIOS.libType = BIOS.LibType_Custom" option in your .cfg file, you can edit GateMutexPri.c and add some instrumentation there. 

    It would really help to have a test case, but I understand that the nature of the failure would make it difficult to make one.

    -Karl-

  • I'll see what I can do with modifying GateMutexPri.c to gather more information. Is there anything specifically I should look at? I suppose I can look into the Task module and make sure that task switches are really being disabled around the critical section.

    I used Log_info# and each log used Task_self() so I can tie it back to the running task. Also, I used the gate handle to look into the ti_sysbios_gates_GateMutexPri_Object struct for the mutexCnt and owner. I used this logging in addition to Task, SWI, and HWI in order to determine the flow.

    I'll continue to try recreating with a simple test case.

  • Mike,

    If Tasks or Swis are disabled prior to calling GateMutexPri_enter(), the calling thread will not block as expected.

    Newer versions of GateMutexPri_enter() Assert that the Task and Swi schedulers were enabled upon entry so that the task will block correctly.within the Task_restore() call rather than simply return.

    Try replacing your copy of GateMutexPri.c in your BIOS installation with the latest version attached. Then rebuild your application using BIOS.libType = BIOS.LibType_Custom and see if you hit the Assert.

    Alan

    1464.GateMutexPri.c

  • Thanks for the suggestion Alan. I went ahead and made the changes you suggested. I will post here with my findings when the XDC runtime error is thrown.

  • Got into the office this morning to find that the same problem happened using the GateMutexPri.c that Alan attached.

    After looking through the logs it has failed in the same way I described above.

  • It looks like I didn't properly integrate the changes Alan suggested. After taking another look it appears I need more than just that .c file. I'm looking into what else needs to be changed in order to add this extra assert.

  • Mike,


    That was my bad. Can you try replacing this:

            Assert_isTrue(((tskKey == 0) && Swi_enabled()),
                         GateMutexPri_A_enterTaskDisabled);

    with this:

            Assert_isTrue(((tskKey == 0) && Swi_enabled()),
                         NULL);

    Alan

  • Alan,

    Yesterday I ended up using GateMutexPri_A_badContext instead of NULL. Regardless, I had the software fail over night and the new asserts were not hit. It is still failing on the Event_pend assert.

    Is there some section of code that an Event_post running at the SWI level would cause problems in GateMutexPri_enter?

    Mike

  • Is there some way I can instrument the GateMutexPri functions with Log_write or Log_info? If so, what files do I need to modify?

  • Would System_printf() work for you?

  • Would that require me to have a JTAG debugger hooked up or could I obtain the information after the failure happens with the debugger?

    If I can obtain the information after the failure happens by using a debugger then that would work. Also, preferably I could tie it to the same time stamps used in the Logger for completeness.

  • Tell me more about the hardware timer that runs the Clock module.

    Are you using your own timer for this?

    If so, can you show me how you're configuring this timer?

    I'm wondering if the Clock function that posts the event is being run from a Hwi that is not being invoked from the Hwi dispatcher. This would cause serious Task scheduling issues.


    Alan

  • If you use SysMin as your systemProvider, the System_printg() output goes into the SysMin output buffer which is viewable with ROV after you attach to the device.

     var System = xdc.useModule('xdc.runtime.System');
     SysMin = xdc.useModule('xdc.runtime.SysMin')
     System.SupportProxy = SysMin;
     SysMin.bufSize = 2048;
     SysMin.flushAtExit = false;

    Alan

    Alan

  • The Clock module has the "Use Timer to automatically call Clock_tick()" radio button selected. Which is believe uses one of the general purpose timers on the microcontroller. The tick period is set to 1ms. A clock instance is created at runtime using the following code:

    Clock_Params clkParams;

    Clock_Params_init(&clkParams);
    clkParams.period = 1;
    clkParams.startFlag = TRUE;
    clkParams.arg = (UArg)0x5555;
    Clock_create(prd_timer_1ms, 1, &clkParams, NULL);

    The function prd_timer_1ms keeps track of a count and every 60 times through will post the event EVT_inverter_cntl I mentioned in the original post. I believe that the prd_timer_1ms is supposed to run at the SWI level not the HWI level.

    I'll work on adding some prints in the GateMutexPri_enter code so I can see what's going on inside the function when the problem happens.

    Edit:

    Looks like SYS/BIOS sets up interrupt 35 for this timer and it is of type 'Dispatched'. Looks like that calls ti_sysbios_knl_Clock_doTick__I which does in fact post a SWI that wakes up the Clock functions as necessary.

  • Alan,

    The application has been running for 4 days now and I haven't seen a failure. I wonder if the System_printfs are preventing the failure. Any ideas on this?

    Is there a less intrusive way to do more logging in GateMutexPri_enter instead of System_printf?

  • Still looking for some help on this. It has been a week straight of running the code (on 5 boards) and I cannot make it fail with System_printf in the GateMutexPri_enter function. Before adding the System_printf calls multiple boards would fail within a 24 hour window.

    I would like to get to the root cause of the problem in order to understand what is going on so it can be fixed for this application and avoid the problem in the many other applications we have running similar implementations.

    The fact that it won't fail now still makes me think that there is some timing issue with a critical section code not being properly protected.

    Help would be greatly appreciated, thanks.

  • Mike,

    My Apologies. My attention got diverted for a few days. I'm back on this today.

    Please tell me more about the interrupts in your application. How are they being created? I know about the Timer interrupt used for the Clock module. But are there others that you are explicitly configuring?

    Alan

  • Hi Alan, thanks for getting back.

    Hardware Interrupts:
    - CAN0 (in cfg file)
    - CAN1 (in cfg file)
    - SSI0 (in cfg file)
    - EMAC from TI-RTOS EMAC drivers
    - I2C from TI-RTOS I2C drivers
    - The two timers that SYS/BIOS sets up

    Software Interrupts:
    - Clock function
    - EMAC_handleRx

  • Thanks. Can you share the config file so I can have a look around?

    What kind of BIOS APIs are being invoked from your interrupt handlers?

    Alan

  • Alan,

    We use very few BIOS APIs from the interrupts. I've attached the configuration file.

    Hardware Interrupts:
    - Mailbox_post (BIOS_NO_WAIT)
    - Event_post
    EMAC and I2C are TI provided so I assume you would know more about them than I.

    Software Interrupts:
    - Event_post

    8551.bios_config.cfg

  • Can you try commenting out this line in your config file:

      HwiM3.dispatcherAutoNestingSupport = false;

    For your application, since no interrupts are created with a different priority than any other interrupt, nesting of interrupts is already prevented by the M4's internal NVIC.

    When HwiM3.dispatcherAutoNestingSupport is set to 'false', the code in the interrupt dispatcher that invokes the user's Hwi function expects that global interrupts are explicitly disabled upon returning from the Hwi function. If this is NOT the case, then critical section code executed immediately afterwards is UNPROTECTED from interrupts.

    Alan

  • This sounds promising Alan. I'll give it a shot.

    if HwiM3.dispatcherAutoNestingSupport is set to 'false' is the ISR supposed to disable global interrupts?

  • By default, HwiM3.dispatcherAutoNestingSupport is true, which means the Hwi functions are called with interrupts enabled.

    The idea behind HwiM3.dispatcherAutoNestingSupport = false is to allow the user to decide if and when they want interrupts enabled within their Hwi function.

    The problem is that certain BIOS APIs (such as all of the XXX_post() APIs) have the hidden side effect of enabling interrupts. This is admittedly not a well documented feature.

    On the other hand, to date, I think you may be the first customer to have ever set HwiM3.dispatcherAutoNestingSupport to false...

    Alan

  • Interesting, I set this a few years ago so I can't remember exactly why I did so. However after looking at the documentation it does say, "Set this parameter to false if you don't need interrupts enabled during the execution of your Hwi functions." Which I am speculating I did because my application doesn't need nested interrupts. I'll load up all my boards and let it run overnight and see what happens.

    I might be jumping the gun on this one but would there be a good way to verify that this is in fact the problem that I'm running into?

  • Alan,

    I had 2 of my 4 boards fail over night in the same way that they did in the original post. I took out all of the printfs in the GateMutexPri_enter. I have full JTAG access and logs for task, event, hwi, swi, and application layer.

    Is there anything that might be of more help to you accessible through JTAG? Something in the ROV window?

    If necessary I can interleave the logs together to make a flow outlining the events.

  • This problem turned out to be a very subtle code-reordering issue by the TI code generation tools. It was resolved by careful use of the 'volatile' attribute on several kernel-internal variables. The fix will be included in the latest BIOS 6.41 release due out within a week and will also be included in several older BIOS point releases, the soonest being 6.34.06 which will be released within a week.

    If anyone needs a temporary workaround for this they can edit their ti/sysbios/knl/Task.c file and use the BIOS.libType_Custom option to rebuild their SYS/BIOS kernel with the fix in place.

    Change this:

    UInt Task_disable()
    {
        UInt key = Task_module->locked;

        Task_module->locked = TRUE;
        return (key);
    }

    to this:

    UInt Task_disable()
    {
        UInt hwiKey = Hwi_disable();
        UInt key = Task_module->locked;

        Task_module->locked = TRUE;
        Hwi_restore(hwiKey);
        return (key);
    }

    While you're at it, you might as well fix another issue we uncovered within GateMutexPri. It has to do with how Tasks that unsuccessfully attempt to enter a gate are queued. They are supposed to be queued up in task priority order so that when the owner task calls GateMutexPri_leave(), the highest priority task will be given the gate first. Sadly, the queueing mechanism is broken. The fix for this is simple.
    in ti/sysbios/gates/GateMutexPri,c, change this:
            /* Tasks of equal priority will be FIFO, so '>', not '>='. */
            if (newPri > Task_getPri((Task_Handle)qelem)) {
                /* Place the new element in front of the current qelem. */

    to this:
            /* Tasks of equal priority will be FIFO, so '>', not '>='. */
            if (newPri > Task_getPri(((Task_PendElem *)qelem)->task)) {
                /* Place the new element in front of the current qelem. */

    Sorry for any trouble.
    Alan
  • To follow up on this thread and this bug, here's some additional info.

    This bug has only been seen when using the TI/Arm compiler and the TI/C28 floating point compiler although it might be present for other compilers depending on version and optimization levels. There's a variable in the kernel that should be declared as 'volatile' to keep the optimizer from reordering accesses to that variable. This bug only affects applications where the user code that uses GateMutexPri. GateMutexPri is not used internally by SYS/BIOS. This bug has been fixed in the following releases: 6.33.08, 6.34.06, 6.35.06, 6.37.05, 6.40.04, and 6.41.00. We recommend that you update to one of these releases (or later) if you are using GateMutexPri.
    http://software-dl.ti.com/dsps/dsps_public_sw/sdo_sb/targetcontent/bios/sysbios/index.html