scheduling while atomic fault in Syslink, OMAPL138

Taran Tripathi

Hi,

We are using OMAP L138 and ipc_1_25_03_15, bios_6_35_04_50, xdctools_3_25_03_72, syslink_2_21_02_10.

We have our network driver on the ARM side that talks to syslink and provides the DSP as a node in the network from the ARM side. The driver also strips the network headers when sending over syslink and add then them back when feeding up the network stack.

In the setup the ARM and DSP talk over simplex and duplex channels. For each channel ARM Creates the Heap. No separate Gate is created for each Heap. Each core creates it own Rx MessageQ and registers with the heap. Each Core Opens the remote Tx MessageQ. The Notify Driver and Transport mechanism used are:

SYSLINK_NOTIFYDRIVER=NOTIFYDRIVERSHM
SYSLINK_TRANSPORT=TRANSPORTSHM

With this setup we get a scheduling while atomic fault.

We referred the post: http://e2e.ti.com/support/embedded/tirtos/f/355/t/250517.aspx?pi199400

and changed the NOTIFY driver to use NotifyDriverCirc and Transport to use TransportShmNotify but that also did not make any difference.

Attached is the Dump

0268.Dump.txt

We have our arm_dsp_driver that provides read/write abstraction over syslink. The my_network_driv provides the network interface and talk to the arm_dsp_driver.

We are able to create send and receive over TCP and UDP sockets until this fault occurs.

How can we get solve this problem and move ahead?

over 10 years ago

0 Ramsey over 10 years ago

TI__Genius 12025 points

Taran,

I just came across this post. I'm not a Linux expert, so I have limited understanding of the stack dump. But from a high level, it seems that the kernel has taken an interrupt and has called an ISR. The ISR calls into the tcp stack which calls into your arm_dsp driver which calls into SysLink MessageQ_alloc. Ultimately, SysLink tries to take a mutex which seems to be unavailable which invokes the scheduler. However, it is illegal to call schedule from an ISR.

It seems that the ISR is doing too much work. Typically, an ISR should wake a kernel thread to do work which might block (i.e. take a mutex).

Let me know if you have any new development on this issue.

~Ramsey

0 Taran Tripathi over 10 years ago in reply to Ramsey

Prodigy 190 points

Hi Ramsey,

The kernel is PREMPT able.

Syslink is kernel module.

When I follow the stack trace it leads me to the OsalMutex_enter where it calls mutex_lock_interruptible. Which makes me believe that if the kernel is premptable then the mutex get can sleep if not available.

If this is not allowed in kernel then the OsalMutex_enter should not be using the mutex_lock_interruptible.

when I look at the syslink_2_21_02_10\packages\ti\syslink\utils\hlos\knl\osal\Linux\OsalMutex.c function OsalMutex_enter(). there is

- Gate_entersystem()

- Check if mutex is not owned if so own.(the owner)

- If owned and not by current PID then blocked=true and Gate_leaveSystem()

- if blocked then go to Gate_enterSystem().

This loop exits when the blocked is set to FALSE.

If not blocked and the mutex owner is set to current PID then why not just mutex_lock?

Or am I looking at the wrong OsalMutex.c file?

Regards,

Taran Tripathi

0 Ramsey over 10 years ago in reply to Taran Tripathi

TI__Genius 12025 points

Taran,

Taran Tripathi said:
The kernel is PREMPT able.

From your attached kernel dump file, I see that you are running Linux 2.6.37. In the 2.6 kernel, the Linux kernel became preemptive. I assume this is what you mean? However, this does not mean you can call schedule from anywhere. Here is a quote from Robert Love's book, "Linux Kernel Development", Chapter 4, Section Kernel Preemption:

In the 2.6 kernel, however, the Linux kernel became preemptive: It is now possible to preempt a task at any point, so long as the kernel is in a state in which it is safe to reschedule.

The kernel can be executing in "process context" or "interrupt context". When in process context, it is executing on behalf of a user thread. For example, when executing a system function like read. But, when an interrupt is taken by the CPU, the kernel switches to interrupt context. It has preempted the task running in process context, and it has invoked an interrupt service routine (ISR). At this point, the kernel is running in interrupt context.

When the kernel is running in interrupt context, it cannot sleep. Again, from "Linux Kernel Development", Chapter 6, Section Interrupt Context:

Interrupt context, on the other hand, is not associated with a process. The current macro is not relevant (although it points to the interrupted process). Without a backing process, interrupt context cannot sleep--how would it ever reschedule? Therefore, you cannot call certain functions from interrupt context. If a function sleeps, you cannot use it from your interrupt handler--this limits the functions that one can call from an interrupt handler.

The error reported in your dump file (BUG: scheduling while atomic) tells me that the kernel is running in interrupt context and someone called schedule. Looking at the call traceback, you can see this.

It looks to me like arm_dsp_get_buffer called MessagQ_alloc. Since the call to MessageQ_alloc might ultimately block, it is not safe to call this function from an ISR. You need to schedule a kernel task to do this work. Chapter 7 of the book explains how to write "a bottom-half" of an ISR for doing deferred work which is allowed to sleep.

Taran Tripathi said:
If not blocked and the mutex owner is set to current PID then why not just mutex_lock?

Your description of OsalMutex_enter looks correct. Regarding your question, are you asking about mutex_lock vs. mutex_lock_interruptible near the end of the function? Well, I'm not sure why it was written this way. Both these calls might sleep if the lock is unavailable. But the preceding code would guarantee nobody else has the mutex, so the calls should never sleep. The call to mutex_lock_interruptible means that if the thread is sleeping (waiting for the lock to become available) and the process receives a signal, the thread will wake so it can handle the signal.

~Ramsey

0 Taran Tripathi over 10 years ago in reply to Ramsey

Prodigy 190 points

Hi Ramsey,

Thanks.

We changed our design to defer the calls to syslink and have the soft IRQ just read the network stack and defer the send over syslink to later. This resolved the problem of schedule while atomic problem.

Is it possible to document this in the syslink user Guide that syslink uses mutexes and mutex lock can cause the caller to sleep. So syslink should not be used in the IRQ context.

Regards,
Taran Tripathi

Processors

Processors forum

scheduling while atomic fault in Syslink, OMAPL138