This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

NDK locks up waiting on semaphore

I am pulling my hair out on this one. There are a couple of unexpected things.

The first is that when sitting at main() and ready to start debugging, NDK StackThread has priority 5 and is READY. The function is ti_ndk_config_Global_stackThread. It is not weird but the idle thread is sitting READY as well and these are the only tasks in the system.

I have several tasks that get created with priorities at 3. I set the UDP task set at 7.  The NDK is configured with priorities of 10, 8, 6, and 4 in the config file.

I hit F8 to run the system and all of my code runs. The UDP task is stopped at a semaphore (probably in a call to select() ). The NDK Stack Thread now has a priority of 8 and is blocked at a semaphore. My task that communicates with the UDP task is stopped at its input Mailbox_pend(). I have sent a message to the UDP task from its computer but it appears that the NDK did not receive the packet.

Another weirdness is that the ROV data in the Detailed tab is all red when I pause the system.

If I comment out the call to create my communication task that pends on the mailbox, the system runs more or less correctly (except the part where it should be getting a message from the UDP task). The communication task pends on a mailbox from my NDK task communicating via UDP and then sends a message back through another mailbox to the UDP task.

Anyone have any ideas why adding a task that pends on a mailbox would interfere with network communications?

I must admit that I have not followed the NDK rules just yet and used the NDK TaskCreate() call rather than the SYS/BIOS call. It didn't seem like it would be necessary since the system runs until I add a task using mailboxes. That is the next step, but I don't hold out much hope.

  • OK, so, using TaskCreate() instead of Task_create() really is important. They aren't kidding!!

    I created my task that pends on a mailbox with priority 7 and everything works. I suppose our other two NDK projects got really lucky. Of course, those projects do not use mailboxes.
  • Glad you found the answer to your own question.

    Please mark the thread answered so it can be closed!

  • I spent most of yesterday with the problem alternately appearing and disappearing. I posted about the NDK Task API after making just that one change and having it start working. I am now convinced that did not solve the issue, based on looking at the code in ndk/os/task.c.

    If you look at task.c, you will find that there is zero difference between calling TaskCreate() and calling Task_create() after setting up your parameters. I am unclear why the NDK user manual would suggest not to use the SYS/BIOS call. I got the implication from the manual that the NDK API provided additional parameter checking and functionality to ensure that the priorities of tasks using the NDK stack would work correctly.

    As a side note, I suspect that the intent for the NDK API for TaskSleep() was that the parameter really would be milliseconds. The shim layer does not, however, translate from OS timer ticks to milliseconds. The manual and code don't match.

    The stack seems to be particularly sensitive to task priorities. I spread out the four NDK priorities to be 2, 6, 10, and 13.  I had been running some tasks at priority 3 since they were not especially urgent. Those were also the ones using mailboxes for messaging. I ended up setting my task doing UDP to 7, the mailbox tasks to 6, 8, and 9. Moving them from 3 to above the middle of the NDK priorities seems to have gotten things going.


    The part that is hard to get past is that these tasks do not interface with NDK, so their relative priorities and Sys/BIOS resources should not interfere with NDK running. It is also interesting that when the stack is blocked, it has priority 8 rather than 6 or 10. Changing the XDC settings for the 4 levels does not alter that the NDK task gets set to 8. The initial NDK priority has stayed at 5 in the config file.

    It may be that the process of switching from priority protection to semaphore protection is not happening correctly. I have useSemLibs set to false. I have it working for now, so I am not going to risk losing more time. If this problem comes back, I will be turning on the semaphore version of the libraries to give that a shot!

    If I ever get a chance, I'll try to figure out why the NDK task gets blocked at the semaphore and give instructions how to avoid the issue. I suspect that there is some issue with using mailboxes (perhaps they use semaphores under the covers?). Thankfully, TI gives us all source code for the OS and for the stack. It is so useful to be able to trace issues into the code to see how things are behaving badly.

  • First, let me say massive amounts of thanks to TI for shipping all the code to the RTOS and NDK with the products. It *really* helps with tracking this stuff down.

    So, I *think* I finally found the issue, but maybe not. It has come and gone so often after seemingly unrelated changes.

    When we started this new project, we (I) changed the timer tick from 50 us to 100 us. That should be a pretty innocuous change since nothing in the system is especially time critical. Plus every 50 us would chew up a lot of CPU if it really wasn't doing anything. Apparently, the NDK stack needs a really short timer period in order to process events which appear to be timer driven rather than Ethernet NIC interrupt driven. It looks like the semaphore where the stack gets hung is the STACKEVENT semaphore and it is not getting the Semaphore_put calls happening in the correct order. It seems kind of strange that the timer could cause what appears to be a deadlock if it is what seems only moderately different.

    One important piece of information is that setting the stack to semaphore operation or task priority for llEnter() does not change the symptoms.

    Can someone verify that the timer tick value is extremely critical?

    Thanks.
  • Which timer tick are you referring to? How did you change the value?

    Which version of SYS/BIOS and NDK are you using and on which device?
  • Todd:


    This has been a truly strange problem. I am using tirtos_tivac_2_00_01_23, bios_6_40_01_15, ndk_2_23_01_01, TivaWare_C_Series-2.1.0.12573c, and xdctools_3_30_01_25_core.

    There is a timer tick that is set up in TI_RTOS->Products->SYSBIOS->Scheduling->Clock-Module Settings. We had been running at 50 us and I lowered it to 100 us and really was looking towards lowering it to 500 us.

    When I was debugging the semaphore lock up for NDK, I found that the STKEVENT system uses the timer tick to schedule the stack operation in some fashion. I quickly found that the timer value does not seem to be the direct problem, but it could still have an impact based on what shows in the code.

    I think have narrowed it down how to get the problem to appear and then disappear. It seems to be position dependent (which is truly scary) since adding text to System_printf() calls seems to make it appear and also disappear. Adding other code occasionally has caused it to appear as well. The quick fix is to simply add one or more System_printf() calls and the system starts running again. The fact that it is position dependent seems to point to an uninitialized or otherwise defective pointer or some sort of other memory issue. It is really bizarre since the issue never makes it crash and burn, but simply makes the stack hang at the STKEVENT semaphore.

    I tried to move the .text and .const sections around in the linker .cmd file, but something forces .text to always be the first section loaded in the executable part of the image. It is unclear where that is happening. Perhaps we have no control over that.

    I can make progress now that I know I can add or delete debug calls, but the lock up is troubling. I haven't had a chance to track down the sequence or otherwise look at the stack to see why the events are not being handled properly. Once I get this software delivered, I can get back to looking at where in the stack things go wonky. Fortunately, this is internal code rather than customer code, so it just has to work in the "happy case".

    As I re-read before sending I noticed that I mentioned System_printf() as a big correlation. Is there a possibility that there is an interaction between the System subsystem and the NDK? So far I have only been running in the debugger with semihosting turned on. Obviously, running a live system will not have semihosting running or at least not in the same way.

  • The default Clock period is 1ms. Do you really need such fine granularity for this timer. The NDK does not need it. It works fine with 1ms.

    Which System provider are you using? Look in the .cfg...are you using SysMin, SysStd, SysCallback, etc?

    Note: have you read this page? processors.wiki.ti.com/.../TI-RTOS_TM4C129_Emac_Issues
    If you are getting some communicate in or out of the device, it is not related. Since you are using 2.00.01.23, I thought I'd better mention it.
  • Todd:
    I have been up to my eyeballs trying to get this project out the door, so that is why no reply.

    Thanks for the words of wisdom on the tick value. 1 ms is a value that makes sense from a system loading perspective.

    We ran into the EMACSnow interrupt issue over a year ago and have the appropriate code changes in place. I think we are on production silicon now, so that is likely no longer an issue anyway.

    I have been able to make fair progress until just minutes ago by simply adding or deleting System_printf() calls every time it locks up. Until just recently, the symptom was that the system would simply hang waiting at the Event semaphore in NDK.
    Just minutes ago it started halting at a non-existent breakpoint. The latest place is line 136 of lltimer.c. It stopped at a mystery breakpoint somewhere else a while ago so I added more System_printf() calls which changed it to the lltimer.c place. The next step is to reduce the number of System_printf() calls to get back to a working system.

    I sure hope that this is not something weird in my code, but it could be. 90% of my code is library calls to do things like setting GPIO pins or communicating with the various peripherals using the Tiva driver library. As I said in an earlier post, this seems to have started occurring when I started using the Mailbox system, but that is an educated guess.

    This is a very real problem, but almost impossible for y'all to troubleshoot unless I can get you an exact copy of my development environment and the actual hardware. I presume y'all are either in the Dallas area or Houston. I get to both areas on a regular basis. If you find this compelling, send me an email outside the forum and I can arrange to get the appropriate stuff to y'all to see just what is happening. It might also be appropriate to wait until I get this project delivered and I actually find where the issue is. At least then we can look at exactly what is causing the issue.

    Ray
  • More data on the weird breakpoint. I have been taking System_printf() calls out one at a time. I just went from being blocked at the Event semaphore when running to having the inital breakpoint at main() showing up in Timer.c at line 624. I guess this points to a tool issue of some sort.
  • Hi Raymond,

    Actually the TI-RTOS team is in Santa Barbara, CA. Can you start a new thread for this break-point issue? This thread is starting to cover too many things. In the new thread include the version of TI-RTOS, device, CCS version and which compiler you are using.

    One thing to look at though is the description of System_printf in the TI-RTOS User Guide. Specifically the "Generating printf Output" section. This talks about what happens in System_printf (e.g. CIO breakpoint when using CCS, TI compiler and SysStd).


    Todd
  • I will post an issue in the RTOS forum later today.

    I wish it was as simple as a CIO problem with SysStd. The way I read the documentation is that SysMin should not have this kind of issue and we are using SysMin.

  • Todd:

    Yesterday the symptoms changed again. We are able to get a few packets through and then it hangs at the STKEVENT semaphore.

    I did some more digging elsewhere on the forum and found some new messages regarding the prefetch anomaly that was present in early silicon. For some background, we started all of our development using the original silicon from the LaunchPad boards. Probably in May, we ran into the prefetch issue and made the changes to EMACSnow.c. I went to the forum yesterday just to double/triple check that we were using the correct workaround. I found a thread that described turning off the prefetch queue early in the process and then turning it back on in EMACSnow_NIMUInit(). This is in addition to what we implemented last year. BTW, my search did not reveal the original prefetch fix thread.

    My symptoms sure do seem like the prefetch issue. It has just been really weird that System_printf() seemed to have a precipitating effect.

    Our initial run of boards for prototyping used XM4C1294NCPDT12 and now our boards are using TM4C129NCPDT13 silicon.  Our production boards of that original design still seem to be working OK, but this new design uses a lot more RTOS resources and more heavily uses the mailbox functionality.

    I saw that some of these issues have been resolved in 2.00.02.36. We are looking at the issues surrounding doing an update. I plan to download, but not install 2.00.02.36 to compare against 2.00.01.23 later today. We are too close to launch on a product to risk wholesale changes to the code base.

    Is this likely to solve the issue? Any other words of wisdom?

    Ray

  • Todd:
    How do we close this?

    Can you feel the heat from my glowing face even where you are? It is quite red right now. I am so incredibly embarrassed that it took me so long to find it.

    It turns out the issue was entirely in our code. I had a rogue array index (working too late at night and not enough coffee) that was happily writing over the .data section. It would only trash 16 bytes at a constant offset from our own data which is why it looked very much like the prefetch issue that we patch in EMACSnow.c. Adding other data would move the affected data in or out of the danger zone!

    I am still perplexed why the symptom *always* showed up as NDK blocked at its semaphore. Everything else in the system would still happily keep running. Maybe the semaphore was actually getting written to the wrong value by the rogue code.

    Thanks again for all of your excellent support while I struggled with this. I cannot say enough good things about the support that TI provides on this forum.

    Ray
  • Hi Ray,

    Glad this got resolved. I love getting emails like this on the first day back from vacation (reason I did not reply to the previous post).

    Todd