crash due fdCloseSession in TASK's hooks.deleteFxn

F. Brettschneider

Other Parts Discussed in Thread: SYSBIOS

Hi,

My program doesn't crash with bios_5_41_07_24 (CCS4.2.1) but does crash with bios_6_33_04_39 (CCS5). After a call of Task_delete the crash happens in my exit handler I registered in my .cfg file as hooks.deleteFxn. The reported crash message is "stack overflow".

The problem is that my cleanup calls fdCloseSession there. After analysing it turns out that my code accesses the stack, but now with SysBIOS 6 the stack has been already deleted when hooks.deleteFxn is called. This means the crash message "stack overflow" is just a subsequent error. The whole thing works with SysBIOS 5 because it releases the stack later after hooks.deleteFxn.

How can I fix this problem?

over 12 years ago

0 MarkGrosen over 12 years ago

TI__Expert 4125 points

Use the exit hook instead of the delete hook to call fdCloseSession().

Mark

0 F. Brettschneider over 12 years ago in reply to MarkGrosen

Intellectual 480 points

I registered both ones, hook.exitFxn and hook.deleteFxn. But I've seen on debugging that the exit hook is never called after my control task killed the worker task by Task_delete(). I need a possibility that a task forces the exit of another task, and only found Task_delete to do that.

So I suspect your hint (to put fdCloseSession) in the exit hook does not work, because only hook.deleteFxn is called, right?

P.S.: I also use this settings in the .cfg file:

var Task = xdc.useModule('ti.sysbios.knl.Task');

...

Task.deleteTerminatedTasks = true;

0 Karl Wechsler over 12 years ago in reply to F. Brettschneider

TI__Mastermind 20805 points

Which version of the NDK are you using?

If you are using NDK 2.20 or earlier, you should not set Task.deleteTerminated to true as this will conflict with some internal task delete logic used within the NDK.

Starting with NDK 2.21, you need to set Task.deleteTerminated to true to get internal NDK threads to be deleted after they terminate.

Since the NDK 2.21 is dependent on this new configuration parameter to do internal cleanup, the NDK 2.21 will issue a warning if you do not have this configuration parameter set to true.

-Karl-

0 F. Brettschneider over 12 years ago in reply to Karl Wechsler

Intellectual 480 points

Karl Wechsler said:

Which version of the NDK are you using?

Starting with NDK 2.21, you need to set Task.deleteTerminated to true

ndk_2_21_00_32, and as already mentioned, Task.deleteTerminatedTasks is set to true. So I guess I'm doing it right. I already noticed I'm running in an "out of memory" if it is set to false because my webserver starts new tasks with every new connection, so I need that automatic task cleanup of SysBios.

Another part of my application (but not the webserver) consists of a control and worker task. The control task kills the worker task by Task_delete() and this crashes in the deleteFxn handler of the worker task as reported above, and the exitFxn of the worker task is never called.

What do you suggest?

0 Karl Wechsler over 12 years ago in reply to F. Brettschneider

TI__Mastermind 20805 points

Can you please send the contents of the function you registered as hooks.deleteFxn? What is this function doing?

Can you also please send the entire "stack overflow" error string? I assume that it is this string, but I want to be sure:

"E_stackOverflow: Task 0x%x stack overflow."

This error happens if the stack for the running task has overflowed. Can you use ROV to check the size of your control task stack and see if you are close to the limit? If the control task calls Task_delete() and this calls you deleteFxn and your deleteFxn uses a lot of stack, then this could be the cause of the overflow.

Do you crash in the deleteFxn? Does your deleteFxn block (i.e., call Semaphore_pend() or give up the CPU for some other reason)?

Or does you return from the deleteFxn and crash sometime after that?

Sorry for the 100 questions, but I need more info.

Thanks,
-Karl-

0 F. Brettschneider over 12 years ago in reply to Karl Wechsler

Intellectual 480 points

Karl Wechsler said:

Can you please send the contents of the function you registered as hooks.deleteFxn? What is this function doing?

...

Do you crash in the deleteFxn? Does your deleteFxn block (i.e., call Semaphore_pend() or give up the CPU for some other reason)?

About the contents of deleteFxn: The function is rather complex and calls several other funcs. Thus it makes no sense to post any code. I think the most interesting actions are:

Several Memory_free()to free memory that has been allocated by the task to be killed
A call to fdCloseSession() with the task handle of the task to be killed as argument
Maybe I should mention that we guard setting or clearing some bits in a status var by calling Task_disable() + Hwi_disable() and Hwi_enable() + Task_restore().
And there are several calls to Task_yield() (legacy code of our application, it makes no difference if I remove it)

It looks like the crash (if it occurs occasionally) happens in fdCloseSession() – I can see my logging before this func is called, but I can’t see my logging after it returns.

Karl Wechsler said:

Can you also please send the entire "stack overflow" error string?

The error string is:

ti.sysbios.knl.Task: line 334: E_stackOverflow: Task 0x834a9fd8 stack overflow.

0x834a9fd8 is the task handle of the Task for which I called Task_delete().

But I’m very sure there is no real stack overflow. A look into the SYS/BIOS code confirmed my assumption: The stack of the task to be deleted is deleted before the hook.deleteFxn is called. (BIOS 5 did it vice versa: It first called hook.deleteFxn and deleted the stack after that, that’s why the same code didn’t crash with BIOS 5). I can see in the memory view that the start of the stack at 0x847A8188 is properly filled with a lot of “0xBEBEBEBE” before I call Task_delete() and at the same place there is “0x849C1AF0 0x00200000” and a lot of 0xBEBEBEBE following when the hook.deleteFxn is called. As 0x00200000 is the size of the stack of the task to be killed it looks like some memory block information about the size of a free memory block and the address of the next free block written by Mem_free().

The stack overflow is detected by a task switch (calling Task_checkStacks() which is registered as hook.switchFxn by default) trying to switch to the task for which I called Task_delete() while the hook.deleteFxn is running, I think especially while it is in fdCloseSession(). I wonder why SYS/BIOS tries to switch to a task that is currently under deletion by Task_delete().

Karl Wechsler said:

Can you use ROV to check the size of your control task stack

Well, I have some trouble using ROV, maybe since we use Makefile based projects instead of managed projects.

Anyway, I’m very sure the stack of the control task is actually not overflowed and (as mentioned above) the stack overflow is reported for the task I try to delete, not for the task that calls the Task_delete().

0 Alan DeMars over 12 years ago in reply to F. Brettschneider

TI__Mastermind 30830 points

Are you able to switch to using the latest SYS/BIOS release (6.34.02.18)?

http://software-dl.ti.com/dsps/dsps_public_sw/sdo_sb/targetcontent/bios/sysbios/index.html

There is a bug fix regarding a race condition in Task_delete() that sounds like what you are running into.

Alan

0 F. Brettschneider over 12 years ago in reply to Alan DeMars

Intellectual 480 points

Sorry, I've updated to 6.34.02.18 but it's like before. No effect so to speak.

0 Alan DeMars over 12 years ago in reply to F. Brettschneider

TI__Mastermind 30830 points

Hmm.

Are you absolutely sure you rebuilt your application with 6.34.02.18?

There was a hole in the implementation of Task_delete() in SYS/BIOS 6.34.01.14 that allowed interrupts to go off while the task was being deleted.

The theory is that the interrupt that posts the semaphore that readys the task you're deleting is going off somewhere during the execution of the Task_delete() function.

With the 6.34.01.14 version of Task_delete() this would result in the very behavior you were seeing: the Task scheduler would attempt to switch to the just-deleted task. The task switch hook that checks for stack overflow would then discover that the switched-to task's stack appears to have overflowed.

I believe the implementation of Task_delete() in 6.34.02.18 prevents this fatal race condition from occurring.

Alan

0 Lars Beikirch over 12 years ago in reply to Alan DeMars

Intellectual 615 points

Hello all,

I'm a colleague of F. Brettschneider and involved in debugging this problem as well.

Triggered by the advice in this discussion I updated to SYS/BIOS 6.34.02.18 and XDCTools 3.24.03.33 and use it together with NDK 2.21.00.32 and CCS 5.2.0.00069. I removed all other version of SYS/BIOS, NDK and XDCTools from my PC and rebuilt our application. So I'm very sure I use the recommended version of SYS/BIOS.

The result is exatly the same:

ti.sysbios.knl.Task: line 359: E_stackOverflow: Task 0x834aa048 stack overflow.

The circumstances are as described before: For some reasons (which I don't understand so far) the system tries to switch to the task for wich Task_delete() is currently in progress and for which the delete hook func is currently running and is inside fdSessionClose(). The task switch hook func checks the stack and interpretes the already deleted stack memory block as overflowed stack.

Some more things I found out so far:

It looks like the task ready queues got a bit wired. The task I am currently deleting is not listed in the ROV -> Task -> Basic tab page, but in the ROV -> Task -> ReadyQs tab page there are some obviously wired data:

The task for which Task_delete() has been called (handle 0x834aa048) is listed twice for priority 9 and twice for priority 15 (it usually has prio 9). for all lines "next" and "prev" is the same task (0x834aa048) itself and it is in mode "Blocked". I checked this ROV under normal operation - in that case the task is properly listed once for prio 9.

Another interesting information: In case of crash the ROV -> LoggerBuf -> Records -> RTASystemLog looks always like this:

The interesting data a little bit better readable:

ti.sysbios.knl.Semaphore, LM_pend: sem: 0x8127d1c8, count: 1, timeout: -1
ti.sysbios.knl.Semaphore, LM_post: sem: 0x8127d1c8, count: 0
ti.sysbios.knl.Semaphore, LM_pend: sem: 0x82443f50, count: 1, timeout: -1
ti.sysbios.knl.Semaphore, LM_post: sem: 0x82443f50, count: 0
ti.sysbios.knl.Semaphore, LM_pend: sem: 0x82443f50, count: 1, timeout: -1
ti.sysbios.knl.Semaphore, LM_post: sem: 0x82443f50, count: 0
ti.sysbios.knl.Semaphore, LM_pend: sem: 0x82443f50, count: 1, timeout: -1
ti.sysbios.knl.Semaphore, LM_post: sem: 0x82443f50, count: 0
ti.sysbios.knl.Semaphore, LM_pend: sem: 0x82443f50, count: 1, timeout: -1
ti.sysbios.knl.Semaphore, LM_post: sem: 0x82443f50, count: 0
ti.sysbios.knl.Task, LM_setPri: tsk: 0x8127d4a0, func: 0x802b8ff4, oldPri: 13, newPri 15
ti.sysbios.knl.Semaphore, LM_post: sem: 0x834aa0f0, count: 0
ti.sysbios.knl.Semaphore, LM_pend: sem: 0x8127d1c8, count: 1, timeout: -1
ti.sysbios.knl.Semaphore, LM_post: sem: 0x8127d1c8, count: 0
ti.sysbios.knl.Task, LM_setPri: tsk: 0x8127d4a0, func: 0x802b8ff4, oldPri: 15, newPri 13

I think the semaphore post/pend at the beginning (sem 0x8127d1c8, 0x82443f50) come from some Memory_free(). Finally the task with handle 0x8127d4a0 (it's the one that calls Task_delete()) increases prio from 13 (it's normal prio) to 15 (there are no tasks of prio 14 or 15 in our application) and finally switches back to prio 13. As far as I've understood from other logging this happens inside fdSessionClose().

For me it looks like when the task that called Task_delete() does the prio switch from 15 to 13 the ready queue data is broken (for some reasons I don't understand) and this finally causes the crash.

I think we did not yet mention explicitly that this crash happens only occasionally - with a probabilty of about 1:1000. This means our code (Task_delete() and fdCloseSession() in the delete hooks etc.) works fine some hundred times before it crashes occasionally. I didn't yet figure out any circumstances that may be crucial if the crash happens or not.

Any idea???

Lars

0 Alan DeMars over 12 years ago in reply to Lars Beikirch

TI__Mastermind 30830 points

This is extremely helpful.

Can you confirm that the semaphore posted in this log entry:

ti.sysbios.knl.Semaphore, LM_post: sem: 0x834aa0f0, count: 0

is the semaphore that the task being deleted usually pends on?

Alan

0 Lars Beikirch over 12 years ago in reply to Alan DeMars

Intellectual 615 points

Hello Alan,

Alan DeMars said:

Can you confirm that the semaphore posted in this log entry:

ti.sysbios.knl.Semaphore, LM_post: sem: 0x834aa0f0, count: 0

is the semaphore that the task being deleted usually pends on?

Unfortunately it's not that easy. The posted semaphore 0x834aa0f0 is the FDTABLE::hSem of the NDK stack posted by fdint_signaltimeout() called in fdCloseSession(). (BTW: Now I understood that the prio switches 13 -> 15 and 15 -> 13 come from llEnter() / llExit() in NDK's fdCloseSession().)

The task being deleted usually should not pend on it. And it looks like it doesn't pend there in fact: I set a conditional breakpoint in Semaphore_pend() for sem == 0x834aa0f0 - it has never been reached and the application crashes anyway.

From the context I've seen that the task being deleted is not blocked. It is running before the scheduler switches to the task calling Task_delete(). Here my analysis of the RTASystemLog:

LD_block: tsk: 0x8127d4a0, func: 0x802b8ff4
LM_switch: oldtsk: 0x8127d4a0, oldfunc: 0x802b8ff4, newtsk: 0x834aa048, newfunc: 0x802b8ff4	// switch from ctrl task to task being deleted later
LM_post: sem: 0x83513858, count: 0	// IRQ handler wakes up ctrl task after "logical timeout"
LD_ready: tsk: 0x8127d4a0, func: 0x802b8ff4, pri: 13
LM_switch: oldtsk: 0x834aa048, oldfunc: 0x802b8ff4, newtsk: 0x8127d4a0, newfunc: 0x802b8ff4	// switch to ctrl task (task being deleted later is currently running)
LM_pend: sem: 0x8127d1c8, count: 1, timeout: -1	// ctrl task calls Mem_free() several times
LM_post: sem: 0x8127d1c8, count: 0	// ctrl task calls Mem_free() several times
LM_pend: sem: 0x82443f50, count: 1, timeout: -1	// ctrl task calls some application code protected by semaphore several times
LM_post: sem: 0x82443f50, count: 0	// ctrl task calls some application code protected by semaphore several times
...
LM_setPri: tsk: 0x8127d4a0, func: 0x802b8ff4, oldPri: 13, newPri 15	// llEnter() in fdCloseSession() called in delete hook on Task_delete()
LM_post: sem: 0x834aa0f0, count: 0	// Semaphore_pend(FDTABLE::hSem) in fdint_signaltimeout() in fdCloseSession() called in delete hook on Task_delete()
LM_pend: sem: 0x8127d1c8, count: 1, timeout: -1	// Mem_free() probably called by fdint_freefdt( pfdt ) in fdCloseSession() called in delete hook on Task_delete()
LM_post: sem: 0x8127d1c8, count: 0	// Mem_free() probably called by fdint_freefdt( pfdt ) in fdCloseSession() called in delete hook on Task_delete()
LM_setPri: tsk: 0x8127d4a0, func: 0x802b8ff4, oldPri: 15, newPri 13	// llExit() in fdCloseSession() called in delete hook on Task_delete()

Frankly spoken - I've still no idea what's going on. Can you figure out more from this information?

Thanks, Lars

0 Alan DeMars over 12 years ago in reply to Lars Beikirch

TI__Mastermind 30830 points

When the condition occurs, take a snapshot of the Semaphore module's ROV view. I suspect you'll see a task in the 'pendedTasks' column of Semaphore 0x834aa0f0.

The confused readyQ ROV view you provided previously is symptomatic of incorrect usage of the Task_disable()/Task_restore() APIs. If a blocking API (ie Semaphore_pend()) is called within a Task_disable/Task_restore region the task that called the blocking API will be marked as BLOCKED and removed from its ready Q but will not actually block because the Task scheduler was disabled at the time. If that same still-running task is then pre-empted by another task, it will then be placed back on its ready Q while still being marked as BLOCKED (as the ROV view shows). At this point, the task is simultaneously in a Semaphore's pendedTasks Q AND in a Task ready Q: a fatal condition. At this point, many different failure scenarios can occur.

You mentioned earlier that you "guard setting or clearing some bits ... by calling Task_disable() ... Task_restore().."

Can you confirm that you are NOT calling any blocking APIs within any Task_disable() / Task_restore() thread of code?

Alan

0 Lars Beikirch over 12 years ago in reply to Alan DeMars

Intellectual 615 points

Hello Alan,

Alan DeMars said:
When the condition occurs, take a snapshot of the Semaphore module's ROV view. I suspect you'll see a task in the 'pendedTasks' column of Semaphore 0x834aa0f0.

unfortunately not. I looked for it several times before and checked once again right now. I've never seen the semaphore 0x834aa0f0 posted just before the crash in the semaphores list. I think the reason is quite obvious. Some code snippets from the NDK:

void fdCloseSession( HANDLE hTask )
{
    FDTABLE *pfdt;
    llEnter();
    pfdt = fdint_getfdt( hTask );
    // If the pointer is NULL, the session may already be closed
    if(!pfdt)
    {
        llExit();
        return;
    }
    // fClosing will prevent the table from being accessed again
    pfdt->fClosing = 1;
    if( hTask != TaskSelf() )
    {
        // Not closing our own session. Signal the session
        // to allow the owner to exit gracefully
        fdint_signaltimeout( pfdt );
    }
    // Now clear the fdt out of the environment pointer
    TaskSetEnv(hTask, 0, 0);
    // If not in use, free the fd table
    if( !--pfdt->RefCount )
        fdint_freefdt( pfdt );
    llExit();
}

void fdint_signaltimeout( FDTABLE *pfdt )
{
    pfdt->fEvented = 0;
    SemPost( pfdt->hSem );
}

void fdint_freefdt( FDTABLE *pfdt )
{
    // Kill type for debug
    pfdt->Type = 0;
    // Free the semaphore
    SemDelete( pfdt->hSem );
    // Free the table
    mmFree( pfdt );
}

fdCloseSession calls fdint_signaltimeout() which posts the semaphore pfdt->hSem and after that calls fdint_freefdt() which deletes the semaphore pfdt->hSem. Finally it calls llExit() which reduces the priority from 15 to 13 which finally causes the crash.

As mentioned in my previous post: I set a "conditional breakpoint" in Semaphore_pend() for sem==0x834aa0f0 and this breakpoint has never been reached, no task calls pend() on this semaphore at any time.

Alan DeMars said:

You mentioned earlier that you "guard setting or clearing some bits ... by calling Task_disable() ... Task_restore().."

Can you confirm that you are NOT calling any blocking APIs within any Task_disable() / Task_restore() thread of code?

The code I meant looks like this:

static bool bTaskRestoreKeyIsSet;
static unsigned int taskRestoreKey;

void globalDisable()
{
    if (BIOS_getThreadType() == BIOS_ThreadType_Task) {
        taskRestoreKey = Task_disable();
        bTaskRestoreKeyIsSet = true;
    }
    Hwi_disable();
}

void globalEnable()
{
    Hwi_enable();
    if (BIOS_getThreadType() == BIOS_ThreadType_Task && bTaskRestoreKeyIsSet) {
        bTaskRestoreKeyIsSet = false;
        Task_restore(taskRestoreKey);
    }
}

static int nBar;
void foo()
{
    ....
    globalDisable();
    nBar |= 0x02;
    globalEnable();
    ...
}

Actually it is implemented as some nested #defines but I think this more readable representation gets the point.

Our framework tries to assure that the task being deleted is in an "uncritical section of code", i.e. it has no system ressources locked nor does wait for any ressources. And the task priority architecture should take care that the task beinig deleted stays in such a state while the framework deletes the task.

Can you give any additional hint from this information?

I'll try to figure out more about the state of the task being killed (i.e. if it has blocked or something like that) meanwhile. If I find out something new I'll post it immediately.

Thanks, Lars

0 Alan DeMars over 12 years ago in reply to Lars Beikirch

TI__Mastermind 30830 points

The use case you provide for globalDisable() / globalEnable() does not benefit from the embedded Task_disable() / Task_restore() invocations.

Can you try removing those calls from the implementations of globalEnable() and globalDisable() and see how it effects behavior?

Alan

0 F. Brettschneider over 12 years ago in reply to Alan DeMars

Intellectual 480 points

Please, could you explain what you mean with:

Alan DeMars said:

... does not benefit from ... invocations.

0 Alan DeMars over 12 years ago in reply to F. Brettschneider

TI__Mastermind 30830 points

In the example use case of globalDisable() and globalEnable() you provided, the Hwi_disable()/Hwi_enable() calls are sufficient to completely guarantee thread safety.

Calling Task_disable() and Task_restore() provide no additional thread protection.

Alan

0 Lars Beikirch over 12 years ago in reply to Alan DeMars

Intellectual 615 points

Hello Alan and all other readers,

I think I could solve the problem. One could argue if the issue was in my code or in SYS/BIOS...

Here are the details:

In our application we reduce the priority of the so called "worker task" from time to time if it is too busy to give some other parts of the application the chance to get a little bit CPU time as well. The priority of the "worker task" is restored to it's original value by a timer callback method.

This is the pseudo code demonstrating the issue:

static TaskHandle s_thWorker; // global var holding the task handle of the "worker task"

void timerCallback()
{
if (s_thWorker != 0L) Task_setPri(s_thWorker);
}

void ctrlTask()
{
    while (1) {
        s_thWorker = Task_create(...);
        ... // wait for timeout condition
        Task_delete(&s_thWorker);
    }
}

This timer callback ocassionally has been invoked while Task_delete() was busy. And more detailed: It has been invoked after the task has been removed from the ready queue but before the task handle s_thWorker has been set to 0. In that case:

Task_delete() (Task_Instance_finalize() in particular) already removed the worker task from the ready queue.
After that Task_setPri() returned the task to the ready queue. Now the task handle of the task to be deleted was in the ready queue despite it shouldn't be.
Task_create() created and started the new worker task (with the same handle, probably because the same memory blocks have been allocated). Now the task handle of the worker task exists twice in the ready queues.
From now on it's was probably fortune for how long this still worked fine. All operations that should have used one and the same task handle probably used acciedntly one of the two task handles. I saw the task handle appear in the ready queues of different priorities. But somewhen:
The control task calls Task_delete() the next time and it removes the task to be deleted (the worker task) from the ready queue, but of course only once. Thus the task handle of the deleted task was still in the ready queues.
In the case that the crash acutually happens the remaining orphaned task handle was in the ready queue of the highest priority. Thus the scheduler tried to switch to this task despite it shouldn't.
Now the scheduler accessed outdated data (i.e. the deleted stack as described before). Not to mention that this wasn't a good thing... ;-)

I don't know why I didn't see such a crash with SYS/BIOS 5. May be the Task module itself checked the task handle in TSK_setPri(), may be the timing was different and I was lucky that I didn't meet these timing conditions...

Well, and of course one could discuss if Task_setPrio should check the given task handle and should not change the prio for task that are under deletetion...

I just changed my code so that the timer callback will not try to call Task_setPri() after Task_delete() has been started. The test now works fine for about 3000 task deletes, but I've sometimes seen such a number of successful task deletes in previous test and got a crash later. I'll report about progress.

Thanks again for the valuable hints that finally made me check the right things in our code.

Lars

0 Alan DeMars over 12 years ago in reply to Lars Beikirch

TI__Mastermind 30830 points

Lars,

Great detective work! Thank you for providing the detailed explanation of the failure scenario.

Carefully orchestrating the deletion of a task so that all the involved threads (ie the timer callback) are safely disarmed prior to Task_delete() being called is arguably the job of the application.

Having said that, we will study the details of this particular scenario and see if a carefully placed Assert might help reveal/avoid this kind of pitfall in the future.

At a minimum, we will enhance the Task_delete() API documentation to emphasize the need for care when calling it.

I filed an internal bug report referencing this forum thread for follow up.

Alan

0 F. Brettschneider over 12 years ago in reply to Alan DeMars

Intellectual 480 points

Alan DeMars said:

Carefully orchestrating the deletion of a task so that all the involved threads (ie the timer callback) are safely disarmed prior to Task_delete() being called is arguably the job of the application.

I wonder why it works with bios_5_41_07_24. That implementation seems to be more robust than with newer versions. Secondly, IMO the operating system itself must deny execution of TSK functions on task objects scheduled for deletion, that's like a null pointer check.

Anyway, thank you very much for your patience and help. I've closed this thread.

Processors

Processors forum

crash due fdCloseSession in TASK's hooks.deleteFxn