This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CC3000 and drivers intermittent problem

Other Parts Discussed in Thread: MSP430F5510, MSP430G2553, TM4C123GH6PM, MSP430F5310

Hi

I have ported the CC3000 drivers to run on an MSP430F5510. Initially I had a lot of problems with the driver 'hanging' waiting for the CC3000 IRQ line to go high to generate an interrupt. (see http://e2e.ti.com/support/low_power_rf/f/851/p/283515/1011290.aspx#1011290)

This problem was resolved with patch release 11.1, where I was able to make a WiFi connection using SimpleLink and send and receive UDP packets. However, I still had an intermittent re-occurrence of this same problem.

My interrupt service routine is dealing with two interrupts on the same port, and when I prioritised the interrupt from the CC3000, the problem improved considerably, although I still have some more long term testing to do to ensure the problem is truly resolved.

My conclusion from the available evidence I have from my own experiences and the many other postings I have come across (on similar if not quite identical topics) is that there is a fundamental timing problem between the CC3000 internal code and the TI drivers / IRQ handling. This has been 'fixed' by various users, by making software changes, but my feeling is that what is really happening is that the timing is being slightly altered causing the problem to be masked, and therefore appear to go away.

I cannot be certain about this as I do not have access to to CC3000 code, or sufficient understanding of its internal behaviour. Clearly, the examples generated by TI and using the TI development boards work fine and do not (as far as I know) show this problem. However, there seems to be a some evidence of people like me having problems when code is ported to 'real' product boards.

I would appreciate a response from the TI team on their view of my comments above.

Best Regards,

Dave Smith

  • +1 to this. I'm also dissatisfied with my "fix" for the intermittent hanging behavior I was seeing, mostly due to the section of code within SpiWrite() in spi.c responsible for de/asserting CS and disabling/enabling IRQ handling. Something in there just seems fishy and prone to race conditions. The latest code added a section after asserting CS and enabling IRQ with the following comment:

    // check for a missing interrupt between the CS assertion and enabling back the interrupts
    if (tSLInformation.ReadWlanInterruptPin() == 0) {
    ... 

    So clearly TI recognizes a class of problems related to missed interrupts, which is great, but what I don't understand is if the code can get that far without the interrupt handler firing, isn't there a race condition now between the IRQ low condition check and the interrupt controller? Also, why can't IRQ be enabled before CS is asserted?

    I would love an official response on this, and for others out there just beginning to port the driver code, I think some guidance on that section of code and interrupt handling in general would be much appreciated.

    Thanks!!
    Vishal

  • +1 Bumping this thread. Anyone from TI still paying attention?

  • Vishal Talwar said:
    // check for a missing interrupt between the CS assertion and enabling back the interrupts
    if (tSLInformation.ReadWlanInterruptPin() == 0) {
    ... 

    I removed this check for missed interrupts. On my platform, interrupts are not missed, and this check on occasion was causing the write to occur at this point. The interrupt handler, oblivious to this, then continued with handling the detected interrupt incorrectly treating it as a read (reading no useful information).

    This check is present in the MSP430G2553 example code for a reason I don't understand. As I mentioned in another post, http://e2e.ti.com/support/low_power_rf/f/851/p/312136/1086858.aspx#1086858, I don't believe MSP430G2553's can miss this interrupt.

  • Looking for a response from TI Employee...

  • Hi all,


    I removed the check for missed interrupts as well (I'm using a Freescale K10). By the way, when I saw the code saying to check for missed interrupts, alarm bells started ringing. What is this, you handle interrupts but just in case you poll anyway. Very poor.

    Ciarán

  • Hi All,

    I too have this problem in my case I have a number of custom boards, this is for a commercial product and we have 1000 boards half sold already waiting for firmware, the problem manifests itself as random intermittent resets, It gets caught in a loop somewhere and the watchdog does it's thing.

    The problem seems to be more or less frequent depending on the individual units, some boards are worse than others.

    I need a graceful way of recovering from one of these failures, or a step by step guide to: this is what causes the failure and this is how to avoid the problem in the first place.

    The boss is getting more and more anxious about things not going right.

    I have tried implementing the various fixes from the other posts on this forum, it seems improved but not fixed.

    This is more a bump of this thread rather than any useful information. (or to let TI know we will be looking for alternatives for the first hardware revision)

    Trevor

  • Alan, that's really interesting. It does seem that interrupts are "unmissable" at a low level in the NVIC world - they are just in a pending state until cleared or enabled. I think this is pretty much what you point out in the thread you linked to. Maybe this check has more to do with platform-independence or resilience to different kinds of interrupt-enabling application code. For example, I'm just calling a library function to re-enable the edge-triggered interrupt inside SpiResume() and I have no idea if the pending interrupt is being cleared or what - something I will need to go re-examine.

    I really, REALLY wish TI would step in and clear up some of these issues. Despite all our attempts to iron out hangups and stalls, we've had to resort to attaching a watchdog to the CC3000 and hard reset it when it misbehaves.

    (Source for the pending interrupts: "If a trigger flag is set, but the interrupts are disabled (I=1), the interrupt level is not high enough, or the flag is disarmed, the request is not dismissed. Rather the request is held pending, postponed until a later time, when the system deems it convenient to handle the requests." - http://users.ece.utexas.edu/~valvano/Volume1/E-Book/C12_Interrupts.htm)

  • +1 on this!

    Me and my team have been battling this problem for the last 3 weeks now. The only solution we came with was to monitor the CC3000 and reset it then it hangs. On average we are doing 10 resets per hour. However this "workaround" is a huge problem for us - we are forced to call methods that write to the EEPROM on each reset. The method we need to call is wlan_ioctl_set_scan_params(...) and without calling it it is not possible to initiate a scan for open APs. At this rate we would exhaust the EEPROM's write cycles in about 1 year.

    I asked one of my engineers to see what happens when the CC3000's EEPROM has been exhausted by continuously restarting and calling wlan_ioctl_set_scan_params(...). He was able to restart it every 1.2 seconds in a loop. He discovered that the CC3000 failed after about 40 hours, or ~117,000 restarts.

    Obviously we cannot go into production with this, since our devices would start failing after a little over a year, which would definitely ruin our company.

    Since I don't think the underlying issue can be easily found or fixed, I believe TI should allow the device to be restarted frequently without exhausting its EEPROM. However all my queries about how to file a bug or when will the next patch be available have remained unanswered.

    Considering how much time and money we have invested already in this product and the negative experiences and lack of response from TI have brought a very dark suspicion in my mind:

    Is TI planning to drop this product?

  • Vishal Talwar said:

    I really, REALLY wish TI would step in and clear up some of these issues. Despite all our attempts to iron out hangups and stalls, we've had to resort to attaching a watchdog to the CC3000 and hard reset it when it misbehaves.

    This is something I have also implemented, where by the MCU resets to exit out of a stuck event handler loop.

    The only "solution" I can think of for this problem is instead of looping in the host driver, I think the event handler checks should be in the user's program, within their own loop. This would make exiting a bad state easier by resetting the CC3000 instead of everything.

    To do this the user would have to register their own callback function at init or call a  function after every command... the former is probably preferred.

    In their implemented looping function they call a new  API which is a one shot reply checker, increment some form of error check counter and sleep. They can exit with the reply from the CC3000 or a timeout error code.

    The IRQ handler still reads the data in and determines if the event was asynchronous - which is still handled in the async handler, otherwise sets a variable to be picked up by API function in the loop.

    I think that the Host Driver's ability to completely hang the MCU is something which should be addressed, after all I have seen people interested in interfacing the CC3000 to a  Beagle Bone. Of course with the host driver being open source people can add their own exit conditions as they see fit, but I don't think that is a particularly elegant solution. I believe such a driver should not need to be customised in such a way.

  • Hi All,

    Are majority of us having a problem related to the below block?
        if (tSLInformation.ReadWlanInterruptPin() == 0)
      {

        SpiWriteDataSynchronous(sSpiInformation.pTxPacket,sSpiInformation.usTxPacketLength);

        sSpiInformation.ulSpiState = eSPI_STATE_IDLE;
        DEASSERT_CS();
      }    
    Looks like this is interpreted and handled differently on different MCUs. As far as I understand from the above description, we have both the ISR and the above section of code executing and then changing the states unexpectedly. Is that right?

    Hi Ivor,
     We understand your concern, but it is a very specific use-case. And seperating out scan and scan-settings would call for customized changes. The ideal way would be to understand what is causing this hang and then the reset.

    Thanks & Regards,
    Raghavendra

  • Hi Raghavendra,

    I'm not 100% sure if that is the source of the problem, as Alan and I both think interrupts should not be missable on our platforms. Where it manifests most obviously for us is as an infinite loop in hci_event_handler(). The most common places we would see such things happen is in socket activity, especially with TCP reads/writes/connects but also with UDP reads/writes when a TCP socket was also active.

    Perhaps someone who is still battling this problem and hasn't yet implemented a hacky workaround could give you more details.

    Thanks,
    Vishal

  • Alan said:

    I really, REALLY wish TI would step in and clear up some of these issues. Despite all our attempts to iron out hangups and stalls, we've had to resort to attaching a watchdog to the CC3000 and hard reset it when it misbehaves.

    The only "solution" I can think of for this problem is instead of looping in the host driver, I think the event handler checks should be in the user's program, within their own loop. This would make exiting a bad state easier by resetting the CC3000 instead of everything.

    To do this the user would have to register their own callback function at init or call a  function after every command... the former is probably preferred.

    [/quote]

    We actually do this by injecting code in hci_event_handler that times out the while loop after a few seconds have elapsed. If there's a hang that long, we assume the driver/module will never recover and reset the CC3000 and driver state. As you suggest, we exit with a "timeout" event code that has to then be handled and propagated all the way up the stack. Future calls to the driver have to be aware that the module is in a reset state. It's a bundle of hacks - but, uh, a bundle is less brittle than a single hack? Not really, this shouldn't be necessary.

  • Hi Raghavendra,

    The problem is not in this block, since commenting it out does not fix the issue, although it seems to make it about 50% less frequent.

    The problem is inside the CC3000 firmware. There is a bug causing the CC3000 to hang. And without access to the firmware code, there is nothing we can do, other than detect the condition and reset the CC3000. This is the only possible workaround.

    However you are saying that having to reset the CC3000 is a very specific use case and you don't want to make it less damaging to the CC3000. I am saying it is not, it is the only workaround, therefore if you guys cannot find the issue, you should at least provide a usable workaround.

    Also I don't understand why it is so complicated to simply modify the existing wlan_ioctl_set_scan_params(...) to write to the EEPROM only if the values have changed. We are not talking about adding another API or modifying an existing one - simply read the EEPROM first and if the values are the same as the ones the user is providing - don't re-write them. We are talking about 3-5 lines of C code here.

  • Ivan,

    As a point of clarification, you don't necessarily have to reset power to the CC3000 right, you just have to reset your host processor and go through the start-up sequence again. That is to say, just using a watchdog timer in the host processor seems to restore operation on host reset. Not excusing the CC3000 behavior, but wanting to characterize the problem being noted in this thread it correctly.

    Regards,

    Vic

  • Victor Aprea said:

    Ivan,

    As a point of clarification, you don't necessarily have to reset power to the CC3000 right, you just have to reset your host processor and go through the start-up sequence again. That is to say, just using a watchdog timer in the host processor seems to restore operation on host reset. Not excusing the CC3000 behavior, but wanting to characterize the problem being noted in this thread it correctly.

    Regards,

    Vic

    Surely you have to toggle the WLAN EN pin at some point? I would imagine it likely happens as your MCU goes through reset?

  • Hi All,


    Just to add that I have the same problem - the stack was hanging in the event handler. I implemented a timeout which improved the problem but this exposed another weakness in the stack. If an event arrives other than the one which the higher layer (say hci) is waiting for, it crashes. Here's why: the higher layers of the stack provide a buffer all the way down to where the recevied data is handled in the event handler. But some events/responses have lots of data, others only a byte or so. If an event arrives with lots of data, and the higher layer is waiting on an event for which it expects only a byte or two, then the higher layer buffer is blown. In generally I find the stack terribly unrobust.


    Back to my timeout - the stack was coded with no timeout, now I have a timeout. The question is, what is the appropriate level of timeout before considering that no event response will come back. Trial and error - since I don't know the internals of the CC3000. What I now see is that from time to time I get a timeout before an event arrives (missed event), and that is ok. But then, from time to time the part seems to stop responding to any event and a recovery procedure is required.


    TI - again, please comment.

    Ciarán

  • Hi All,

    I have solved my problems or at least (thanks to information from these forums) added enough workarounds to get a result, whether it is an acceptable result only time will tell.


    The main side effect of this problem was a watchdog timeout that reset the running process as well as the processor. I added a timeout to the event handler, I set my timeout to be a bit less than the watchdog timer.

    From another post, I increased the SPI clock rate from 1MHz to 8MHz.

    I also added the TI fix of checking the gpio pin when resuming the SPI interrupt.

    I added the Adafruit method of checking the gpio pin when resuming the SPI interrupt. (this made the biggest improvement)

    All of these steps improved things in small degrees so I was getting one watchdog timeout in 24 hours, the final piece of the puzzle was, I was using the uart to print out a number of debug and status messages, when I disabled the majority of the messages I finally got to a 24 hour period without a reset.

    The hanging problem showed up as a result of a firmware update to the CC3000 module, it was working when the module reported 1.07 as it's version, after the update it reports 1.26, we updated it to use the better smartconfig support.

    I wish you all the best of luck with your efforts to get this thing working in your application.

    Trevor

  • Trevor Hancock said:

    Hi All,

    I have solved my problems or at least (thanks to information from these forums) added enough workarounds to get a result, whether it is an acceptable result only time will tell.


    The main side effect of this problem was a watchdog timeout that reset the running process as well as the processor. I added a timeout to the event handler, I set my timeout to be a bit less than the watchdog timer.

    From another post, I increased the SPI clock rate from 1MHz to 8MHz.

    I also added the TI fix of checking the gpio pin when resuming the SPI interrupt.

    I added the Adafruit method of checking the gpio pin when resuming the SPI interrupt. (this made the biggest improvement)

    All of these steps improved things in small degrees so I was getting one watchdog timeout in 24 hours, the final piece of the puzzle was, I was using the uart to print out a number of debug and status messages, when I disabled the majority of the messages I finally got to a 24 hour period without a reset.

    The hanging problem showed up as a result of a firmware update to the CC3000 module, it was working when the module reported 1.07 as it's version, after the update it reports 1.26, we updated it to use the better smartconfig support.

    I wish you all the best of luck with your efforts to get this thing working in your application.

    Trevor

    Hi Trevor,


    Thanks for sharing! A couple of questions:

    1. What is "TI fix of checking the gpio pin when resuming the SPI interrupt"?

    2. What is "the Adafruit method of checking the gpio pin when resuming the SPI interrupt"?

    Thanks!

  • Hi Trevor,

    Very interesting. Can you provide a reference for the 8Mhz post.

    Ciarán

  • Hi Ciarán,

    The reference to SPI clock rates came from this post they were talking 1.5 and 3MHz. I measured mine at 1MHz so I doubled it and things improved so I doubled it again and again with no noticable improvement. It still worked so I just left it at 8MHz the CC3000 is rated to 16MHz so I'm not pushing anything.

    http://e2e.ti.com/support/wireless_connectivity/f/851/p/283515/997876.aspx#997876

    This is also a very good post on this subject.

    http://e2e.ti.com/support/wireless_connectivity/f/851/t/265783.aspx

    Hi Ivor,

    The TI fix is discussed here and is in the latest demo code

    http://e2e.ti.com/support/wireless_connectivity/f/851/p/260521/916032.aspx#916032

    The Adafruit reference came from here

    http://e2e.ti.com/support/wireless_connectivity/f/851/p/265783/1107422.aspx#1107422

    I assume that this is some sort of race conition that is different between individual hardware assemblies, I have one that works so well that I took it out of the testing pool. Now I have to try them at different temperatures.

    Trevor

  • I spoke too soon 2 failures in 48 hours, this is extremely frustrating.

    Trevor

  • Trevor Hancock said:

    I spoke too soon 2 failures in 48 hours, this is extremely frustrating.

    Trevor

    Yes, I was fearing this. I am also getting similar failure rates in my lab. However as soon as I take it out of the lab and put it in my car, the reset rate goes up by a factor of 20. When driving through an area with many open APs, I can get up to 5 resets per hour.

    I don't mind resetting it often if it wasn't for the issues of exhausting the EEPROM. You see my device connects to open APs, and in order to initiate the scan I have to call a method that writes to the EEPROM. So after each reset the EEPROM has one less write cycle. And after about 120,000 to 130,000 resets, the CC3000 can no longer boot, since its EEPROM is dead. At this point the CC3000 is bricked for good and so is my device.

    I am hoping that TI has fixed this race condition in their latest patch, which is coming out in a couple of weeks. Otherwise I don't see how anyone in their right mind would even contemplate putting the CC3000 in a production device.

  • I'm facing the same problem, but for me it hangs in the while loop every 30-60 seconds. And some people suggested that this happens only on non-ti chips, but I'm using a ti launchpad - tm4c123gxl with the example code from the latest tivaware and still it happens.

    I found a few things that seem to drastically increase the hangs:

    1) increasing the cpu speed to 80mhz

    2) increasing the amount of data sent over the wifi. (I'm using a while(1){send(...)} which continuously sends large amounts of data - that's where it hangs.) 

    3) increasing the load on the mcu, by putting some sd card writes, etc.

    So when I do the above 3, it starts hanging every 30-60 seconds, if I lower them it can run for up to 15mins or so.

    And I was under the impression that this mcu (tm4c123gh6pm) cannot miss an interrupt? If an interrupt happens while interrupts are disabled, the flag gets set and when they are reenabled - the interrupt fires, no?

    If it is so then the only way to miss an interrupt would be to clear the int flag before they are reenabled? I'm starting to wonder if it's really a missed interrupt...

    I also tried contitiously calling a function to check for a missed interrupt, but doesn't help. The code is like that:

    if (isInIrq == 0 && tSLInformation.ReadWlanInterruptPin() == 0)
    {
        SpiIntHandler();
    }

    The TI fix from this post http://e2e.ti.com/support/wireless_connectivity/f/851/p/260521/916032.aspx#916032 doesn't work in my case - cc3000 fails to initialize when I apply it. Probably because my driver version seems to be quite different (uses DMA, etc.). 

    I already tried everything I could think of (I'm not an expert though) with no effect, so I'll just wait for some time to see if a fix comes up, otherwise I'll have to go looking for another chip. Unfortunately restarting in my case is not an option and that bug makes the product unusable.

  • Lacho Tomov said:

    I'm facing the same problem, but for me it hangs in the while loop every 30-60 seconds. And some people suggested that this happens only on non-ti chips, but I'm using a ti launchpad - tm4c123gxl with the example code from the latest tivaware and still it happens.

    I found a few things that seem to drastically increase the hangs:

    1) increasing the cpu speed to 80mhz

    2) increasing the amount of data sent over the wifi. (I'm using a while(1){send(...)} which continuously sends large amounts of data - that's where it hangs.) 

    3) increasing the load on the mcu, by putting some sd card writes, etc.

    So when I do the above 3, it starts hanging every 30-60 seconds, if I lower them it can run for up to 15mins or so.

    And I was under the impression that this mcu (tm4c123gh6pm) cannot miss an interrupt? If an interrupt happens while interrupts are disabled, the flag gets set and when they are reenabled - the interrupt fires, no?

    If it is so then the only way to miss an interrupt would be to clear the int flag before they are reenabled? I'm starting to wonder if it's really a missed interrupt...

    I also tried contitiously calling a function to check for a missed interrupt, but doesn't help. The code is like that:

    1
    2
    3
    4
    if (isInIrq == 0 && tSLInformation.ReadWlanInterruptPin() == 0)
    {
        SpiIntHandler();
    }

    The TI fix from this post http://e2e.ti.com/support/wireless_connectivity/f/851/p/260521/916032.aspx#916032 doesn't work in my case - cc3000 fails to initialize when I apply it. Probably because my driver version seems to be quite different (uses DMA, etc.). 

    I already tried everything I could think of (I'm not an expert though) with no effect, so I'll just wait for some time to see if a fix comes up, otherwise I'll have to go looking for another chip. Unfortunately restarting in my case is not an option and that bug makes the product unusable.

    The issue is definitely not limited to non-TI MCUs - I am using MSP430F5310 for example. But it is great that you were able to reproduce it so frequently and with TI's own Tiva C series LaunchPad. Hopefully this post will help the TI engineers repro and fix it.

    Reading through all the posts regarding this issue shows that the anomaly happens inside the CC3000 module with a certain frequency. All the different tricks proposed to either increase or decrease the period between hangs work by changing the frequency of your MCU's main loop, so that it falls in or out of phase with the CC3000's anomaly frequency. But the two can be never be synchronized perfectly, so sooner or later there will be a hang.

    I haven't given up on the CC3000 yet and I am hoping that TI will fix this issue. Btw, I live in Sofia as well, so it's good to see that I am not the only crazy guy in town working with this module.

  • @Ivor ha it's nice to hear of fellow sofians here :)

    I think I came upon a workaround by Alan in another topic, not sure if it completely fixes it in my case, but it's been running for ~90 mins already (with usually 1-2 mins hang):

    http://e2e.ti.com/support/wireless_connectivity/f/851/p/312391/1108120.aspx#1108120

    It's inserting a delay at the end of SpiIntHandler(), Alan is suggesting 100ms, but I put about 20-30 cycles I believe (ROM_SysCtlDelay(SysCtlClockGet() / 10000000)) and it seems to have an effect.

    That probably means that indeed it's not a missed interrupt, but more like a race condition, at least in my case. In other cases, there may be both? I'll keep running it to see how long it lasts.

  • The fix with the delay that I mentioned above no longer seems to work. It probably had a full moon yesterday or whatever that kept it running for so long, but today it's back to the usual hanging, no matter how I tweak it.

  • Lacho Tomov said:

    The fix with the delay that I mentioned above no longer seems to work. It probably had a full moon yesterday or whatever that kept it running for so long, but today it's back to the usual hanging, no matter how I tweak it.

    Yep, in a lab environment with a strong AP signal it is possible to time the MCU code in a such a way that the period between hangs becomes extremely long. However as as soon as the device it taken into the unpredictability of the real world (by simply driving around with it), the problem reappears.

    In fact for me it is very difficult to reproduce the issue in the lab. But while driving it happens all the time. And I have tried at least 5 different "fixes" proposed by various people - none of them pass the driving test.

    Any densely  populated area has a large number of APs with various signal strengths, which is the real WiFi environment and it is ideal for testing. Also when using a car it is very easy to vary the signal strength by simply driving away from the AP. Using this approach I discovered that the CC3000 frequently hangs when it is sending data while the connection to the AP is severed due to a weak signal.