Potential Bug in 1.11.1 Host Driver - Stuck in hci_event_handler

Alan

Other Parts Discussed in Thread: CC3200, CC3100

It seems that if I do not have a line of debug in my code, I can become stuck in hci_event_handler. I believe I have found the culprit to be a race condition in which tSLInformation.usRxEventOpcode is set too late - after the IRQ arrives. Unfortunately the SPI data receive handler performs a key check on this variable, (evnt_handler.c:639) which will scrap the received data if it hasn't been set. This happens if the interrupt happens before we get into the SimpleLinkWaitEvent call, (evnt_handler.c:816)

/* The following shows a race condition due to setting tSLInformation.usRxEventOpcode
   too late. (Only code of note shown.)
   CC3000 Host Driver: 1.11.1 */

/* Call to wlan_start */
wlan.c:256          wlan_start()
                        /* Stuff ... */
wlan.c:306              hci_command_send(HCI_CMND_READ_BUFFER_SIZE,
                                         tSLInformation.pucTxCommandBuffer, 0);
wlan.c:307              SimpleLinkWaitEvent(HCI_CMND_READ_BUFFER_SIZE, 0);
                            /* If an interrupt should happen before this point, 
                               it will result in the host driver getting stuck
                               in a loop */
evnt_handler.c:816          tSLInformation.usRxEventOpcode = usOpcode;


/* The following is called after IRQ pin drops, SPI data read in */
wlan.c:219          void SpiReceiveHandler(void *pvBuffer)
wlan.c:224              hci_unsolicited_event_handler();
evnt_handler.c:673          if (hci_unsol_event_handler((char *)pucReceivedData) == 1)
evnt_handler.c:639              if ((event_type != tSLInformation.usRxEventOpcode) 
                                     && (event_type != HCI_EVNT_PATCHES_REQ))
                                {
                                    return(1); 
                                }
                            /* Now considering what happens if this returns 1... */
evnt_handler.c:673          if (hci_unsol_event_handler((char *)pucReceivedData) == 1)
                            {
                                /* This ultimately causes this command to be
                                   lost... */
                                tSLInformation.usEventOrDataReceived = 0;
                                
                                res = 1;
                                SpiResumeSpi();
                            }

/* Once the IRQ finishes we continue processing */
evnt_handler.c:816          hci_event_handler(pRetParams, 0, 0);
                                /* Since we set this to zero, we loop on this 
                                   forever... */
                                while (1)
                                {
evnt_handler.c:241                  if (tSLInformation.usEventOrDataReceived != 0)

Thoughts and suggested fixes?

From searching this forum I think there is a good chance the following posts are describing this problem.

http://e2e.ti.com/support/low_power_rf/f/851/t/305178.aspx

http://e2e.ti.com/support/low_power_rf/f/851/t/295158.aspx

http://e2e.ti.com/support/low_power_rf/f/851/t/292783.aspx

http://e2e.ti.com/support/low_power_rf/f/851/t/303733.aspx

EDIT for some clarification: My host is running a RTOS (ChibiOS) so rather than handling the interrupt in an ISR it is passed onto a thread. This problem can manifest itself in in the following way:

CC3000 Control Thread	IRQ Processing Thread
Send SPI Command
	Receive Reply
	Unexpected opcode - delete data
Set expected reply opcode

My solution to this problem was to insert a wait in the IRQ processing thread, before the call to the registered processing callback is triggered. This wait will permit the control thread to execute a bit further, hopefully setting usRxEventOpcode:

over 10 years ago

Thomas L. over 10 years ago

Prodigy 250 points

I think my issue is somewhat related: http://e2e.ti.com/support/low_power_rf/f/851/t/312136.aspx

Jimmy Pedersen over 10 years ago

Intellectual 430 points

Hi.

Anyone had a look at this?

If this is correct and could be fixed it would probably help alot on the strange errors I'm sometimes getting.

Kind

regards

Jimmy

Markus Siegert over 10 years ago in reply to Jimmy Pedersen

Prodigy 50 points

Another bump from me.

I am seeing the same behaviour. My CC3000 gets stuck in the while(1) loop in hci_event_handler.

My application connects to the AP via a stored profile and the just sends udp packets to my server via sendto.

If I only send less than 5 packets per second, everything works OK, but if I increase the frequency of the sendto calls to 10 or 20 per second, I normally get stuck in the while loop more or less right away (sometimes I can send 3-4 packets, but never more). And I get even stuck faster if I send to a broadcast address instead to a normal address (most of the time I only can send one packet).

So TI, please comment on this.

Thanks,

Markus

Marie Maurer over 10 years ago in reply to Markus Siegert

Genius 4895 points

Hello Markus et all,

has someone of you already created debug logs of the two debug pins and can create debug logs of this case?

I think also SPI logs could be helpful in this case?

Best regards,

Martin

Alan over 10 years ago in reply to Marie Maurer

Expert 2260 points

My current fix for this problem was inserting a 100 ms wait into my IRQ handler (obviously undesired).

I don't currently have any intention of purchasing a 1.8 V UART cable, unless I run into other problems. To be honest I don't believe monitoring the debug pins would be particularly useful for this problem.

If I can find the time, I will try to recreate this problem when a logic analyser is attached.

Diagnostics should require an extra GPIO to be set/toggled before this line is executed: tSLInformation.usRxEventOpcode = usOpcode;.

It should be possible to see if an IRQ response comes in before this line is executed. However, just from looking at the source code it is possible to say that this is at least potentially a problem, and even if it seldom occurs it is enough of a problem to lock the host up.

Marie Maurer over 10 years ago in reply to Alan

Genius 4895 points

I had captured the debug pins via logic analyzer (for a different problem). It was working.

So this could be a way to get traces without buying such a 1.8V cable.

Alan over 10 years ago in reply to Marie Maurer

Expert 2260 points

Hmm I didn't consider that option, though my logic analyser is a knock off, so I have low expectations of what it can do below 3 V logic :) I shall keep that tip in mind for the future though.

This problem though exists solely in the host driver, so I doubt the CC3000 debug log will be of much use.

Basically the opcode of the expected reply is being set after the spi write. So if the CC3000 fires its interrupt quickly with the reply, this expected opcode variable will not be set in time. Any unexpected data will be will be deleted.

Markus Siegert over 10 years ago in reply to Marie Maurer

Prodigy 50 points

Just some more observations: If I add a delay of > 200 ms after the first sendto, it works with much shorter delays afterwards. So if I do a:

while(1) { 
   sendto(ulSocket, buffer, bufferSize, 0, &tSocketAddr, sizeof(sockaddr));
   sleepMilliseconds(100);
}

this hangs after a few (<5) sendTo's

But if I do instead:

while(1) { 
   sendto(ulSocket, buffer, bufferSize, 0, &tSocketAddr, sizeof(sockaddr));
   sleepMilliseconds(300);
}


or a:

sendto(ulSocket, buffer, bufferSize, 0, &tSocketAddr, sizeof(sockaddr));
sleepMilliseconds(300);
while(1) { 
   sendto(ulSocket, buffer, bufferSize, 0, &tSocketAddr, sizeof(sockaddr));
   sleepMilliseconds(10);
}

those work without hanging...

And BTW. if I try to send broadcast packages, it hangs all the time after the first package...

Alan over 10 years ago in reply to Markus Siegert

Expert 2260 points

Looking at this example, I don't believe the problem you are seeing is being caused by the issue I reported.

Markus Siegert over 10 years ago in reply to Alan

Prodigy 50 points

I don' t know, but for sure it's hanging at the same point. I have the feeling this is due to the module is still handling the ARP messages for the first sendto when the next sendto is happening. After the ARP is done, I can send as fast as I want... This is just guessing, I don't even know if there are ARP packages sent or not, but I would expect so.

A response to an ARP request could cause an interrupt which is happening right where you suspect the race condition?

Alan over 10 years ago in reply to Markus Siegert

Expert 2260 points

I'm not even sure ARP results in an interrupt request. You could add some debug or read through the event handler to see what it expects. Using a logic analyser on the lines might be a good idea.

While it may be getting stuck in the same place, the waits you are inserting to fix your problem would unlikely prevent the problem I described.

Glenn Vassallo over 10 years ago in reply to Alan

Guru 21435 points

I am experiencing this issue also. In that it gets stuck in the hci_event_handler loop.

When I look at the stack, it seems to be the select() statement that initiated the issue.

My setup is as follows:-

I have 2 UDP sockets. 1 is a regular socket listening on the CC3000 IP address. The other socket is a broadcast socket listening on 0.0.0.0.

My first regular socket does not have issues, I am receiving packets at about 20 per second. It does not however perform any sendto() after this, as it processes the data internally and sends out to attached hardware via USB or SPI.

My broadcast socket is set up to echo data back to the device that sent it the broadcast packet. It does use a sendto() to do this. This is the socket that is having issues.

I am on the latest service pack 1.24. I am using everything TI, including the Tiva C LaunchPad, the CC3000 BoosterPack and CCS 5.5.

I have tried a number of different configuration setups for my broadcast socket, including timeouts, buffer size etc, the only thing that reduces the speed with which this issue occurs is to reduce the number of packets being sent to the CC3000.

This issue also seems to occur if I am not even sending broadcast packets, but just my regular packets used by my regular socket (even though it does not perform a sendto - need to verify this scenario is not an unrelated issue with my App). I have 2 separate selects for each socket in my main program loop. I was hoping that my code was the issue, but it seems from this post, that it is unlikely the cause.

I will try the delay solution suggested by Markus and report back. As mentioned above my packets are broadcast packets, so we will see how it goes.

Glenn Vassallo over 10 years ago in reply to Glenn Vassallo

Guru 21435 points

Here are what my additional tests reveal.

I have tried a delay of 500ms after the sendto() and this has made no difference when sending broadcast packets.

Sending packets to my regular socket that does not execute a sendto() does not cause the issues. But it does seem to speed up the process.

It takes a lot longer for the issue to occur when I have it attached to a debugger. When not attached, I am sending a broadcast packet every 0.5 seconds and it then takes about 30 minutes for the issue to occur.

My app does not usually send these broadcast packets, it sends the packets till it finds the CC3000 device, and then stops sending broadcast packets, so I will not usually be hitting it so hard, but I am concerned this issue could still cause instability over longer periods. I am also wanting to put in place a heartbeat, so it can check every minute that the device can still be found.

Here is the stack backtrace:-

This time it seemed to be trigger by the wlan_ioctl_statusget() in my code, it is included in the main program loop and will make the LED flash if the CC3000 loses connection with the Wifi Router. It was connected and fine at the time of the issue.

wlan_ioctl_statusget() does contain hci_command_send(...)

Even though I am experiencing the same conditions as posted by others. I am going to investigate whether this is my code. I need some verification that my method of configuring 2 sockets (1 regular and 1 broadcast) and the way I use select() in the main program loop is the correct way to go about things.

Oh and some further info, similar issues have been raised in this forum a number of times:-

Sometimes hang in hci_event_handler - http://e2e.ti.com/support/low_power_rf/f/851/t/295158.aspx

Event_handler_hci in cc3000 - http://e2e.ti.com/support/low_power_rf/f/851/t/303733.aspx

recv api hangs erratically - http://e2e.ti.com/support/low_power_rf/f/851/t/316867.aspx (This may provide details on a work around)

Vishal Talwar over 10 years ago in reply to Glenn Vassallo

Intellectual 635 points

Hi Glenn,

Ok, based on this I've found a few differences in our socket setups:

I only have one "broadcast" (0.0.0.0) UDP socket, which I call my command socket - commands are parsed from here and responses are sent through here.
Sometimes, a TCP socket is created for temporary streams of data, like firmware updates!! - this is usually when problems arise for me, and firmware updates are where problems can really make you hurt.
With the TCP issues, reducing the transfer sizes reduced the problems immensely but I haven't yet quantified the rate or the reduction - it may still be problematic for production.
I realize I'm not using select() for recv(), which is probably a bug on my part, but I do have a 20ms timeout set on recv() calls.

Other than that, I'm using a Freescale microcontroller and the CodeWarrior IDE without an RTOS but with a state machine framework, so some major differences exist there.

-Vishal

Glenn Vassallo over 10 years ago in reply to Vishal Talwar

Guru 21435 points

Hi Vishal,

It does sound like you are experiencing the "Host Driver - Stuck in hci_event_handler" issue as you have outlined in this post - http://e2e.ti.com/support/low_power_rf/f/851/p/313896/1109538.aspx#1109538

For you it is related to TCP connections. I have seen it related to select() and even wlan_ioctl_statusget(). So it looks like there is definite issues with this part of the host driver. And as you suggest, I also suspect a race condition, as attaching the debugger causes a heisenbug...or at least it takes a lot lot longer to reproduce.

I would like to avoid your workaround of using a timeout inside hci_event_handler() that will force the module to reset if it's been busy-waiting for more than a second. I was looking into setting up a watchdog timer, but that is more for unforeseen issues rather than ones I already know exist.

So it looks like this issue is real and needs to be looked at and resolved.

Alan over 10 years ago in reply to Glenn Vassallo

Expert 2260 points

I spent a few hours looking at this again.

I may well be seeing it with such frequency due to my implementation/RTOS, though I still think it is a valid bug.

Initial Observations:

This happens directly after a KEEPALIVE event.

Quick Fix:

Mask out KEEPALIVE events- which does seem to work, at least for the small time I tested it.

GDB Observations

Unfortunately I cannot see what is going on in my threads (I couldn't get OpenOCD to work properly) so GDB is somewhat limited. To view this bug I conducted the following steps:

Make event_type in hci_unsol_event_handler() static (to easily watch with GDB)
Break on the first KEEPALIVE: (gdb) "break evnt_handler.c :545"
Run, and wait for the breakpoint. Then:
watch event_type
watch tSLInformation.usRxEventOpcode
Continue
GDB breaks evnt_hander:510 with event_type = 519
Continue
GDB breaks evnt_handler.c:820 with tSLInformation.usRxEventOpcode = 519

Step 7 shows the event coming in before we expect in in step 9. This doesn't always work due to the nature of the race condition. Ideally you need to enable the two hardware breakpoints before the next KEEPALIVE comes it.

Potential Permanent Fix

I didn't go looking to see if this covers all scenarios, so this may well only work for a handful, but:

Moving the setting of tSLInformation.usRxEventOpcode from SimpleLinkWaitEvent() to hci_command_send() just before SpiWrite() seems to work for my test.

EDIT: This doesn't appear to work particularly reliably outside my test case for reasons I am currently not sure of.

Test

void test_bug(void)
{

#include "nvmem.h"
#include "hci.h"
#include "wlan.h"
    uint8_t patchVer;

    PRINT("before wlan_start :%u", NULL);
    wlan_start(0);
    PRINT("after wlan_start :%u", NULL);

#if 0
    wlan_set_event_mask(HCI_EVNT_WLAN_KEEPALIVE);
#endif

    while(1)
    {
        nvmem_read_sp_version(&patchVer);
        PRINT("ver:%u", patchVer);
    }
}

Alan over 10 years ago in reply to Alan

Expert 2260 points

I ran into another issue there, where closesocket() was hanging.

Here is a saleae trace of the closesocket() call. https://dl.dropboxusercontent.com/u/2866686/closesocketproblem.logicdata

So it can be seen that immediately after calling closesocket() there was a CLOSE_WAIT event.

In this trace, you can clearly see the ~100ms delay I have added to try and fix the above problem.

After GDB'ing it, it looks like somehow the above problem was still occurring, so that makes me think there is some issue with my implementation, since that 100ms sleep should have ensured that the thread initiating the closesocket() call should have continued to the point where it sets the expected rx data.

The fix I suggested above: http://e2e.ti.com/support/low_power_rf/f/851/p/312391/1148240.aspx#1148240 appears to work for this problem. I am not sure if this is all somehow a coincidence or not, since I cannot understand how this is a problem in this case, but I thought it was worth reporting.

Derek Knapp over 10 years ago in reply to Alan

Prodigy 240 points

I am seeing the same issue. I am not even getting past wlan_start(0); it hangs every time in

SimpleLinkWaitEvent(HCI_CMND_READ_BUFFER_SIZE, 0);

I modified the code in wlan.c line 307 to first set the usRxEventOpcode before caling the hci_command_send. This seems to allow me to get past the wlan_start, but I encounter the same problem on other wlan_* commands.

	tSLInformation.usRxEventOpcode = HCI_CMND_READ_BUFFER_SIZE;
	hci_command_send(HCI_CMND_READ_BUFFER_SIZE, tSLInformation.pucTxCommandBuffer, 0);
	SimpleLinkWaitEvent(HCI_CMND_READ_BUFFER_SIZE, 0);

The solution posted by the OP does not really work for me, since I am not using a RTOS, and both are running in the same thread.

I assume this is being exaggerated for me due to the number of debug messages I have added to the code to solve a problem I was having previously, or possibly the speed of my chip (LPC1114).

I am going to try to remove all my debug messages and see if this happens less frequently, but this is clearly a flaw in the code, which can lead to never ending loops.

Alan over 10 years ago in reply to Derek Knapp

Expert 2260 points

Note that becoming stuck in this loop can happen for a number of reasons, including an incorrect SPI driver implementation.

Since setting usRxEventOpcode prior to sending gets you past a command it does sound like this is at least part of the problem.

I haven't went back to try and determine why setting this variable first didn't appear to be a permanent fix. To my understanding it should have been unless an asynchronous event can somehow interfere... Did you try disabling the keep alive events?

I have found that a logic analyser is more or less necessary for debugging the CC3000.

Mahesh Sutariya over 9 years ago in reply to Alan

Intellectual 805 points

Hi Alan,

I am also getting similar problem. I upgraded my CC3000 sp version to 1.11.1. But code never comes out of wlan_set_event_mask function call. I tried masking out keep alive event but nothing changed.

I also tried latest CC3000 sp version 1.13. But same things happen.

When I comment following lines

if ((event_type != tSLInformation.usRxEventOpcode) && (event_type != HCI_EVNT_PATCHES_REQ))
   {
       return(1);
   }

in hci_unsol_event_handler function call. It do comes out of wlan_set_event_mask and everything works as it was working in 1.11(atleast)

Did you find any solution on this?

Derek Knapp over 9 years ago in reply to Mahesh Sutariya

Prodigy 240 points

I have given up on the CC3000. After spending some time searching the internet, I have come to the conclusion that the CC3000 is riddled with bugs that go deeper than firmware can fix. This seems to be confirmed with the release of the new CC3100 and CC3200.

I would give the new CC3100 a shot, I haven't had any issues with it yet.

Glenn Vassallo over 9 years ago in reply to Derek Knapp

Guru 21435 points

Alan,

I too recommend giving the CC3100/CC3200 a go. I have migrated my code from a Tiva C/CC3000 to a Tiva C/3100 solution and there really is not all that much work involved. Additionally you get so many more features, not just stability, that it is worth the effort. Yes, I know a custom board will need to be redesigned, but this should not be a big issue, you just replace the CC3000 with the CC3100 (different foot print and pins) and add an external SFlash, and you should be good to go.

More information here - http://e2e.ti.com/support/wireless_connectivity/f/851/t/348514.aspx

Glenn.

Jim Laskowski over 8 years ago

Intellectual 755 points

I came across the same problem and fixed it, but not by slowing down interrupt processing with a wait. The problem is exactly as Alan described...

The main thread calls hci_command_send() which sends the command to the CC3000 via SPI, then it calls SimpleLinkWaitEvent(). SimpleLinkWaitEvent sets the tSLInformation.usRxEventOpcode variable so that when the interrupt hits for the response, the interrupt code knows which op code is being waited for. Then SimpleLinkWaitEvent goes into a loop waiting for the interrupt code to indicate that the response message has arrived.

More specifically, when the interrupt comes, it calls the unsolicited ("non-host initiated") interrupt handler first to see if the message received is one of the known unsolicited messages. If not, it then compares the opCode of the incoming message to tSLInformation.usRxEventOpcode to see if it's something the main thread is waiting for. If it doesn't match tSLInformation.usRxEventOpcode, it blows off the message entirely.

The problem, as Alan described, is that the main thread sends the command to the CC3000 then the interrupt happens before the main thread gets a chance to set tSLInformation.usRxEventOpcode. So, the interrupt handler blows off the message and the main thread remains in an infinite loop.

The solution is to set tSLInformation.usRxEventOpcode BEFORE hci_command_send() or hci_data_send() send the commands to the CC3000. So, comment this out in SimpleLinkWaitEvent() in evnt_handler.c, and add it instead to the top of hci_command_send() in hci.c:

   tSLInformation.usRxEventOpcode = usOpcode;

Unfortunately, hci_data_send() doesn't get the opCode, so you can't add it there, but it's only called once in simple_link_send() in socket.c, so change the code at the bottom from this:

hci_data_send(opcode, ptr, uArgSize, len,(UINT8*)to, tolen);

if (opcode == HCI_CMND_SENDTO)
   SimpleLinkWaitEvent(HCI_EVNT_SENDTO, &tSocketSendEvent);
else
   SimpleLinkWaitEvent(HCI_EVNT_SEND, &tSocketSendEvent);

to this:

if (opcode == HCI_CMND_SENDTO) {

   tSLInformation.usRxEventOpcode = HCI_EVNT_SENDTO;

   hci_data_send(opcode, ptr, uArgSize, len,(UINT8*)to, tolen);

   SimpleLinkWaitEvent(HCI_EVNT_SENDTO, &tSocketSendEvent);

} else {

   tSLInformation.usRxEventOpcode = HCI_EVNT_SEND;

   hci_data_send(opcode, ptr, uArgSize, len,(UINT8*)to, tolen);

   SimpleLinkWaitEvent(HCI_EVNT_SEND, &tSocketSendEvent);

}

Wi-Fi

Wi-Fi forum

Potential Bug in 1.11.1 Host Driver - Stuck in hci_event_handler