Assertion failed: Task scheduler disabled

Matthias Rosenfelder

Other Parts Discussed in Thread: SYSBIOS

Hello,

I am using SysBios 6.37.02.27 on a dual-Cortex M4 in SMP mode (Jacinto6, i.e. DRA7xx). Occasionally, we see the following error message:

cat /d/remoteproc/remoteproc0/trace0
[t=0x00000001:3376844b] ti.sysbios.knl.Task: ERROR: line 1817: assertion failure: A_sleepTaskDisabled: Cannot call Task_sleep() while the Task scheduler is disabled.
ti.sysbios.knl.Task: line 1817: assertion failure: A_sleepTaskDisabled: Cannot call Task_sleep() while the Task scheduler is disabled.
xdc.runtime.Error.raise: terminating execution

The error varies slightly. Sometimes it complains about other blocking functions like Semaphore_pend(). However, we neither use Task_disable or Swi_disable in our code (which both disable the task scheduler). The only synchronization primitives that we use are Hwi_disable and sempahores.

This is a list of API functions that we call within a Hwi_disable section:

malloc free memset vsnprintf System_printf Event_create Event_post Event_getPostedEvents Event_pend with last parameter BIOS_NO_WAIT -> ok according to docu! Event_deleteError_check

Is there anything suspicious?

I already searched to forum, but did not find any satisfying answer. The example posted here () is trivial as it is documented behavior.

Thus, I have the following questions:

- Which SysBios API functions do actually disable the Task scheduler? So far I know of Task_disable() and Swi_disable(). Are there any other functions which might do this (maybe as a side-effect)?

- What is the meaning of "Disabling the Task Scheduler" when running in SMP mode? Does this only disable the scheduling on the core that called this function or does it disable the scheduling on all cores?
- If the answer is on all cores, how is a task that is running on the other core supposed to know that somebody else called to disable the scheduler? I think in this case, SysBios would be fundamentally broken as you could NEVER call any blocking API. You could never be sure that just in the moment some other code running on the other core disabled the scheduler.
(This is why I assume it only disables the scheduling on the core that called Task_disable() or Swi_disable()).
- If the answer is "only the task that called it", then if we get a stack trace (or set a breakpoint in the Error function), there must be a call to some task-disabling function in the call-chain of this particular task, correct? As far as I understand, it cannot be in any other task. Could you please confirm if this understanding is correct?

- Is there any SysBios configuration that needs to be changed when enabling SMP mode? Maybe the Task_disable() call resides in one of the SysBios modules (e.g. Logging module) because it is configured to use a certain Gate-Type that is not suitable for SMP? Would this be detected somehow?
Background: I spotted this here:

"Use these SMP-aware clone modules in place of their xdc.runtime equivalents:
• SysMin, SysStd, LoggerBuf (in ti.sybios.smp package)"

Found in http://processors.wiki.ti.com/images/1/14/Public_SmpBiosSlides.pdf

Is this mandatory or just optional?

Our SysBios configuration can be found here: http://pastebin.com/CsWsrBga

Thanks.

Best Regards,

Matthias

over 7 years ago

0 ScottG over 7 years ago

TI__Mastermind 26780 points

Matthias,

I will have to research several of your questions and get back later...

I cannot access the link you sent to the .cfg file (that site is blocked from access from within TI). Can you please attach the .cfg file directly to this forum thread so I can see it?

Thanks,
Scott

0 Matthias Rosenfelder over 7 years ago in reply to ScottG

Prodigy 165 points

Hello Scott,

Sorry, I did not know that I could upload files here. I hope this works:

1682.Ipu1.cfg

Thanks for your help!

Best Regards,

Matthias

0 Matthias Rosenfelder over 7 years ago in reply to Matthias Rosenfelder

Prodigy 165 points

Sorry, I have to correct myself: We're using SysBios 6.41.03.51.

Best Regards,

Matthias

0 ScottG over 7 years ago in reply to Matthias Rosenfelder

TI__Mastermind 26780 points

Hi Matthias,

OK, thanks for clarifying.

I’m trying to understand the problem more. In your original post you give a list of API functions called within a Hwi_disable() section. Why did you indicate that list?

Within that list, both Event_pend() and Event_post() will call Task_disable(). Semaphore_pend() and Semaphore_post() will too.

Are you explicitly invoking Task_sleep() somewhere in your code? Or is that call that is asserting a mystery? If it is a mystery, can you set a breakpoint in that assert code in Task_sleep() and then look at the call stack?

I know I’m not answering a lot of your questions yet, but I’m still trying to understand your application structure and flow, and the specific problem. I will have to talk to one of the SMP experts tomorrow to confirm answers for your more general questions.

Best regards,
Scott

0 Matthias Rosenfelder over 7 years ago in reply to ScottG

Prodigy 165 points

Hi Scott,

> I’m trying to understand the problem more. In your original post you give a list of API functions called within a Hwi_disable() section. Why did you indicate that list?

I would like to rule out the possibility that we're calling something that we are not supposed to call in a particular context. Like calling a blocking function when the task scheduler has been disabled before. However, as we're neither using Task_disable not Swi_disable, I drew my attention towards Hwi_disable. That's why I looked into our code to find all functions that are being called with Hwi_disable() being called and before Hwi_enalbe() or Hwi_restore().

For example, I think that doing Hwi_disable() and then Task_sleep() would be a bad idea (we don't do that!), as Hwi_disable aquires the inter-core spinlock within SysBios. Thus, any action on the other core that would aquire the spinlock would result in spinning there for a very long time until the sleeping Task would wake up again and call Hwi_enable() or Hwi_restore().

I don't know if this is the right direction to look. I just thought it might help us here. If you don't see any problem calling any of these functions from the list while Hwi_disable() has been called prior, then we can forget about this.

> Are you explicitly invoking Task_sleep() somewhere in your code?

Yes, we do. However, as far as I can see, we never do this with holding any semaphore or within any Hwi_disable()-section.

I hope this helps. Thanks,

Matthias

0 Matthias Rosenfelder over 7 years ago in reply to Matthias Rosenfelder

Prodigy 165 points

Hello,

we managed to reproduce the issue two times with stack traces. Here are the screenshots:

The stack traces are unfortunately not complete, but they further investigation show that the location of the call of ipcMgr_ipcStartup() and the location of the call of operator new are different. What is common is that they both use GateMutex and the call to Semaphore_pend fails internally.

I hope this helps. Best Regards,

Matthias

0 ScottG over 7 years ago in reply to Matthias Rosenfelder

TI__Mastermind 26780 points

Hi Matthias,

Thanks for the updates. As you were sending this I was meeting with a coworker to confirm my responses to your previous questions…

From your first post you’d sent the list of functions you are calling from within a Hwi_disable() section. Calls to malloc, free, Event_create, and Event_delete can all cause problems. These functions use the default heap (which is HeapMem), and HeapMem uses GateMutex by default. GateMutex calls Semaphore_pend() (with a wait forever), and this results in a call to Task_disable(). So a call to Task_sleep() on the other core will result in the assertion you first reported. This could potentially lead to a deadlock or app failure.

I don’t know if this also explains what your new stack traces show. But maybe it does, if core 0 was calling one of these as part of the Hwi function(?) But there isn’t sufficient trace to clarify that.

Some of your other questions…

Matthias Rosenfelder said:

What is the meaning of "Disabling the Task Scheduler" when running in SMP mode? Does this only disable the scheduling on the core that called this function or does it disable the scheduling on all cores?

Yes, calling Task_disable() on one core will also lock Task rescheduling on the other core.

Matthias Rosenfelder said:

I think in this case, SysBios would be fundamentally broken as you could NEVER call any blocking API.

This doesn’t mean that you cannot call a blocking API on the other core. But if you do, the API will stay blocked until the other core releases the disable. The kernel APIs were carefully written to allow a mix of calls on both cores. But Task_sleep() has a special check in it to be sure that Task scheduling is enabled to be sure that the included Task_restore() call will indeed block. This API is a special case, and calling it in an SMP environment needs to be done carefully, and only when Task scheduling is indeed enabled.

Matthias Rosenfelder said:

"Use these SMP-aware clone modules in place of their xdc.runtime equivalents:
• SysMin, SysStd, LoggerBuf (in ti.sybios.smp package)"

Is this mandatory or just optional?

Yes, you really should use these three SMP-aware modules. If you don't there will still be output, but the output from both cores will be jumbled together.

Can you try removing the Hwi_disable() around the four noted APIs and see if you still see failures? You should also remove any calls to other functions that would be allocating from the heap while Hwis are disabled.

Thanks,
Scott

0 Matthias Rosenfelder over 7 years ago in reply to ScottG

Prodigy 165 points

Hi Scott,

> Calls to malloc, free, Event_create, and Event_delete can all cause problems.

Ok, I changed our code to either move these calls out of the critical sections formed with Hwi_disable/enable/restore, or use a different synchronization mechanism.
This change alone did not have any effect; we're still seeing crashes. (However, read on...)

> This doesn’t mean that you cannot call a blocking API on the other core. But if you do, the API will stay blocked until the other core releases the disable.

As far as I understand the code in Task_disable(), it calls Hwi_disable() which acquires the inter-core spinlock and disabled IRQs. Then it calls Core_hwiRestore() that ONLY re-enables IRQs, so it returns from the Task_disable() functions with the spinlock still held. And this is what you mean with "API will stay blocked", because in any of such API, there is first a call to Hwi_disable() that tries to aquire the intercore spinlock. So another core trying to call a blocking API will keep spinning until the other core releases the spinlock again by a call to Task_enable() (that does Hwi_enable() and such releases the spinlock) and also resets the disabled flag in Task_enable().

Thus, in summary, this code relies on the correctness of the spinlock implementation. However, there is a fix in SysBios 6_45_01_29 that is described as

"Cortex-M3/M4 GateSmp_enter() function has a bug"

Doing a diff on packages/ti/sysbios/family/arm/ducati/GateSmp.c reveals that in functiion GateSmp_enter() the two line

            gateBytePtr = (volatile UInt8 *)&gate->gateWord;
            gateBytePtr = &gateBytePtr[coreId];

has been added. We don't have these two lines in our 6.42.02.29 version, so I tried putting them in there manually.

It turnes out that the crashes vanished. We're still testing here, but this might have been it. Thanks everybody.

Best Regards,

Matthias

btw: One last question: What have been the reasons to implement the spinlock the way you did and not use ARMv7 atomic instructions like LDREX and STREX? Your implementation is relying on the fact that any store is immediately visible to any other core, i.e. you are assuming a sequential consistent memory model (which ARMv7 does not have, but CortexM3/4 seems to implement that in its microarchitecture). This does not work on any microarchitecture with a store buffer, i.e. where each core sees ITS OWN store instructions immediately (through store-to-load forwarding), but any other core might see them only later (i.e. there is no global ordering on store-instructions in the system).

This seems risky and I doubt that it works e.g. on a Cortex A15.

And what about performance of this implementation in comparison to LDREX/STREX?

infocenter.arm.com/.../index.jsp

0 Ashish Kapania over 7 years ago in reply to Matthias Rosenfelder

TI__Mastermind 21645 points

Hi Matthias,

Matthias Rosenfelder said:

btw: One last question: What have been the reasons to implement the spinlock the way you did and not use ARMv7 atomic instructions like LDREX and STREX? Your implementation is relying on the fact that any store is immediately visible to any other core, i.e. you are assuming a sequential consistent memory model (which ARMv7 does not have, but CortexM3/4 seems to implement that in its microarchitecture). This does not work on any microarchitecture with a store buffer, i.e. where each core sees ITS OWN store instructions immediately (through store-to-load forwarding), but any other core might see them only later (i.e. there is no global ordering on store-instructions in the system).

This seems risky and I doubt that it works e.g. on a Cortex A15.

And what about performance of this implementation in comparison to LDREX/STREX?

The dual Cortex-M3/M4 sub-system (IPUs) on multi-core parts like Jacinto 6 does not include a global monitor. Therefore exclusive access instructions like LDREX/STREX cannot be used. This is the reason we implemented the GateSmp lock using a SW locking mechanism.

On the Cortex-A15s, exclusive access instructions can be used and therefore our GateSmp implementation for Cortex-A15 uses LDREX/STREX instructions.

Best,

Ashish

Processors

Processors forum

Assertion failed: Task scheduler disabled