This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CC3100 sl_Recv() hangs when used simultaneously with sl_Send

Other Parts Discussed in Thread: CC3100BOOST, CC3100

Hello all,

I am using the following setup:

CC3100BOOST connected to my custom motherboard via SPI interface.
SPI bitrate = 1 Mbps. CC3100BOOST is flashed with latest firmware.

I have the following observations:

1. Only sl_Send called periodically every 100 or 200 millisec, works fine.

2. Only sl_Recv called periodically every 100 or 200 millisec in non-blocking mode, works fine. In this case the other end sends data to CC3100 every 5 or 8 secs.

3. sl_Send and sl_Recv are interleaved every 200 millisec: The execution hangs in sl_Recv -> _SlNonOsSemGet -> _SlNonOsMainLoopTask after random time, but generally less than 5 mins.

My code for data reception is:

RecvData()
{
    RMSRecvData_EntryCnt++;
   
    SlFdSet_t ReadFds;
    SlTimeval_t timeout;

    sl_Recv_mark = 0xaa;

    /* check if data is present */
    len = sl_Recv(RmsInfo_t.sockID, RmsInfo_t.recvData, 5, SL_MSG_DONTWAIT);

    sl_Recv_mark = 0xbb;

    /* if we have data */
    if(len > 0)
    {
        RMSRecvData_HaveDataCnt++;

        /* check if the data contains our serial number */
        if(CMP_MACH_SERNUM(RmsInfo_t.recvData))
        {
            sl_Recv_mark = 0x11;
            /* read the remaining data */
            len += sl_Recv(RmsInfo_t.sockID, &RmsInfo_t.recvData[5], 66, SL_MSG_DONTWAIT);

            sl_Recv_mark = 0x22;
            RmsInfo_t.recvDataSz = len;
         }
    }
    RMSRecvData_ExitCnt++;

    return len;
}

The hang as mentioned in 3 occurs consistently. Every time the hang occurs,
sl_Recv_mark shows 0xaa, indicating that sl_Recv has not returned.

Can anyone throw some light on what could be a possible reason
for sl_Recv to hang when working simultaneously with sl_Send ?
Thanks
Ajit

  • Hi Ajit,

    Can you please provide a bit more information on wither you are using an OS environment, also, are you using single thread or multi thread environment?

    Thanks,
    Alon

  • Hi Alon,

    1. No, I am not using any OS. The code is part of a typical single threaded non-OS application.

    2. The CC3100 Send and Receive events are scheduled periodically using system timer. Also note that the send and receive are 100% interleaved. For e.g every 200 millisec send and receive are alternately scheduled. So I believe there is sufficient time gap between successive socket calls.

    3. Can you also provide a bit more information about the SPI flow control between the host and CC3100 ?

    Let me know if you need more information.

    Thanks,

    Ajit

  • Hi Ajit,

    I´m having a pretty similar problem. Take a look at:

    http://e2e.ti.com/support/embedded/tirtos/f/355/t/386050.aspx

    http://e2e.ti.com/support/wireless_connectivity/f/968/p/384881/1361499.aspx#1361499

    Please let me know if you find some workaround. Thanks! I will let you know if I find something too. 

  • Hello Alon, Fernando,


    ** I have attached 2 screen shots (sl_Recv_hang.jpg in this post and sl_Recv_good.jpg in next post) of my logic analyzer. The one named sl_Recv_hang shows the SPI activity for the sl_Recv() call that hung.

    The main observations are:

    1. The host began by sending the WRITE SYNC word + (Opcode + Len) + descriptor.

    2. In response CC3100 raised the host IRQ.

    3. Then the host sent the READ SYNC and started sending 0xFF bytes.

    4. Meanwhile CC3100 de-asserted the host IRQ. BUT then raised it again !

    After this the master stopped sending the 0xFF bytes and the SPI communication hung.

    ** The screen shot sl_Recv_good shows the SPI activity for the sl_Recv() call that returned ok.

    ** Than main difference between the two, is that in sl_Recv_good, the host had finished sending 4 0xFF bytes after the READ SYNC before the CC3100 de-asserted tthe IRQ. Whereas, in the call that hung, the host had not finished sending the 4 0xFF bytes when CC3100 de-asserted the IRQ (note that CC3100 again raised the IRQ). In this later case, the software probably went into the IRQ handler and then the code hung.

    I do not know, if this difference is really significant. Does CC3100 have a timeout in this case ?

    Thanks.

    P.S could not add both files in this post. so the next post contains the 2nd file.

  • this post has sl_Recv_good.jpg.

  • Hi Ajit,

    We are looking into it.
    We will update you as soon as we have some more information.

    Thanks,
    Alon

  • Hi,


    We have the same problem since a few months: http://e2e.ti.com/support/wireless_connectivity/f/968/p/371929/1310740.aspx

    First the devices hangs about once all 5h but since we have increased the "sl_RecvFrom" calls from once a second to once all 100ms the devices get stuck all 5min.

    Best regards,

    David

  • Hello all,

    i have attached one more log from my logic analyzer. i am calling sl_Send and sl_Recv alternately every 50 millisec. the first few sl_Recv() calls work fine. and then the sl_Recv() socket call fails. the main observation for the failed sl_Recv call is as follows:

    1. the host sent the Write Sync (0x43211234) + SL_RECV opcode etc.

    2. cc3100 raised the interrupt.

    3. host sent the Read sync (0x87655678) and then began waiting for the D2H Sync. But the device NEVER sent the D2H sync ! i just see all 0x00 on the miso line.

    Please have a look into the logic analyzer data. i am unable to understand why cc3100 suddenly stopped responding with the D2H sync. is there something in the previous transactions that lead to the failure of cc3100 ?

    i could not attach the logic file for the Saleae Logic 1.1.15. so i have attached a screen shot.

    if somebody from TI support can provide an email id, would be happy to share the Saleae logic analyzer file since that will be more useful for analysis.

    Thanks

    Ajit

  • Hi Ajit


    Good post! I think this should help TI to find the issue.

    By the way, when I set the "always on" policy on the device, it failes not that often:

    sl_WlanPolicySet(SL_POLICY_PM, SL_ALWAYS_ON_POLICY, NULL, 0);

    best regards,

    David

  • Hi Ajit,

    I have similar issues. I'm interested in the solution.

    Best,

    Dennis

  • Hi Guys,

    I am back with one more log. there is a slightly different scenario this time.in this case, for the failed sl_Recv() call, the following happened:

    1. host sends Write Sync + SL_RECV opcode

    2. cc3100 raised the irq.

    3. host sends read Sync.

    4. cc3100 lowered the irq and just kept on sending 0xBEDCCDAB. and the host just continued sending 0xffffffff.

    i could not find the file attach link here. will send the dropbox link to the file.

    @David, Denis:

    1. what is your observation of the SPI host interface activity when the device hangs ?

    2. if you have logic analyser data, can you share it ?

    Regards,

    Ajit
  • Hi guys,
    want to share 1 more observation:

    in one failure scenario, I see that cc3100 raises the irq line a second time when sl_Recv is in progress. in this case,
    it sends the opcode SL_OPCODE_DEVICE_DEVICEASYNCDUMMY = 0x0063. the exact byte stream I see on the miso line is:
    63 00 04 00 17 01 00 00 f5 ff 10 c0 f5 ff 10 c0 be dc cd ab 0a......bf dc cd ab 00 63 04 00 17 01 00 00
    then the master stops sending the clock. i just hope this extra info helps TI to debug the problem.

    also have a look at this post. this is also the same failure scenario:

    http://e2e.ti.com/support/wireless_connectivity/f/968/p/387269/1367982#1367982

    TI, it seems that a lot of your customers are facing this issue. we would appreciate if you could be a bit more proactive in addressing this issue. we would be happy to share our logic traces and also to perform test cases which could help to gather data and debug the issue quickly. the only reason that we are persisting with this product is that we have spent quite a bit of effort working on this chip. i am sure, that just like us, other customers might also be thinking of shifting to other modules. i just hope that we are able to get cc3100 working ASAP.

    Regards
    Ajit
  • Hi Ajit,

    Sorry for the delayed response, but I want to assure you that we are looking into that.
    Can you please send me your entire code so I can review it?
    In addition, can you please check that you are not calling any driver APIs from within an interrupt callback function?
    I suspect that you might be calling a driver API from within a driver context (driver IRQ CB). This is not allowed and might cause issues as you are describing in your post.

    Thanks,
    Alon
  • Hi Ajit,
    I can't capture the whole SPI. I'm using Saleae Logic to capture the SPI activity but it takes too much memory; my test PC slow down before CC3100 and host MCU stop.
    Regards,
    Dennis
  • Hi Alon,

    Thanks for the reply.

    1. Following is the dropbox link to the software module "rms" that implements the socket io.

    https://www.dropbox.com/s/yz95phqt11gng8u/rms.zip?dl=0

    Please read the readme.txt. it contains comments that will help you to understand the basic code flow.

    2. Also, I am not calling a driver API from within the host irq isr. Request you to please
    verify the same. I am listing the failure modes again:

    i. cc3100 does not send the D2H sync after raising the irq.
    ii. cc3100 just keeps on sending the D2H sync and does not send the opcode+status etc.
    iii. cc3100 raises the irq in quick succession and sends SL_OPCODE_DEVICE_DEVICEASYNCDUMMY

    i would assume that the above failure modes are in a way independent of whether there is
    a driver api call from an irq context.

    Let me know if you need more information.

    Thanks,

    Ajit

  • Hi Ajit,

    Thanks for the detailed debug info.

    I'll try to reproduce the issues you are experiencing.

    What TCP server have you used on the other side? Is it something generic or something you implemented yourself?

    I'm asking sinec I want to build the same setup.

    Regards,

    Shlomi

  • Hi Shlomi,

    the TCP server at the other end is not generic. it is implemented for our application needs.
    it continuously receives data from my CC3100 based board. and whenever the user selects,
    it sends data to my board.

    regards
    Ajit
  • Ok. So I'll use a simple TCP server that sends every 5 seconds or so (just as you started on your first post) and continuously receive from CC3100 packets every ~200mSec.

  • Hi Ajit,

    In order to build a setup as similar as I can to what you have (and without diving into your cdoe at 1st stage), I used an MSP5529LP platform with non-OS as well.

    Then, I use CC3100 as TCP client and my PC as TCP server doing send and receive every 100mSec. Receive is non blocking.

    Buttom line, I could not reproduce it.

    I'll try to look at your code to understand more.

    Just as a reference, please have a look at my BsdTcpClient().

     

    static

    _i32 BsdTcpClient(_u16 Port)

    {

    SlSockAddrIn_t Addr;

    _u16 idx = 0;

    _u16 AddrSize = 0;

    _i16 SockID = 0;

    _u16 LoopCount = 0;

    _i16 recvSize = 0;

    for (idx=0 ; idx<BUF_SIZE ; idx++)

    {

    uBuf.

    BsdBuf[idx] = (_u8)(idx % 10);

    }

    Addr.

    sin_family = SL_AF_INET;

    Addr.

    sin_port = sl_Htons((_u16)Port);

    Addr.

    sin_addr.s_addr = sl_Htonl((_u32)IP_ADDR);

    AddrSize =

    sizeof(SlSockAddrIn_t);

    SockID = sl_Socket(SL_AF_INET,SL_SOCK_STREAM, 0);

    if( SockID < 0 )

    {

    CLI_Write(

    " [TCP Client] Create socket Error \n\r");

    ASSERT_ON_ERROR(SockID);

    }

    RetStatus = sl_Connect(SockID, (

    SlSockAddr_t *)&Addr, AddrSize);

    if( RetStatus < 0 )

    {

    sl_Close(SockID);

    CLI_Write(

    " [TCP Client] TCP connection Error \n\r");

    ASSERT_ON_ERROR(RetStatus);

    }

    while (LoopCount < NO_OF_PACKETS)

    {

    RetStatus = sl_Send(SockID, uBuf.

    BsdBuf, BUF_SIZE, 0 );

    if( RetStatus <= 0 )

    {

    CLI_Write(

    " [TCP Client] Data send Error \n\r");

    RetStatus = sl_Close(SockID);

    ASSERT_ON_ERROR(

    TCP_SEND_ERROR);

    }

    CLI_Write(

    " [TCP Client] Data transmitted \n\r");

    Delay(100);

    recvSize = BUF_SIZE;

    do

    {

    RetStatus = sl_Recv(SockID, &(uBuf.

    BsdBuf[BUF_SIZE - recvSize]), recvSize, SL_MSG_DONTWAIT);

    if( (RetStatus <= 0) && (RetStatus != SL_EAGAIN) )

    {

    sl_Close(SockID);

    CLI_Write(

    " [TCP Client] Data recv Error \n\r");

    ASSERT_ON_ERROR(

    TCP_RECV_ERROR);

    }

    if( RetStatus > 0)

    {

    recvSize -= RetStatus;

    CLI_Write(

    " [TCP Client] Data recvieved \n\r");

    }

    if( RetStatus == SL_EAGAIN)

    {

    break;

    }

    }

    while(recvSize > 0);

    LoopCount++;

    }

    RetStatus = sl_Close(SockID);

    ASSERT_ON_ERROR(RetStatus);

    return SUCCESS;

    }

     

    Regards,

    Shlomi

  • Hi Shlomi,

    Thanks! Your support is appreciated.

    your test looks similar to my use case. just to repeat, my most recent use case:
    the cc3100 is the client which does sl_Send() and sl_Recv() alternately every 50 millisec.
    the tcp server receives the client data and sends data to it, every 100 millisec.

    is the cc3100 is interfaced to the MSP5529LP over SPI ?

    1. if the interface is spi, could i request you to please capture the spi traffic using a logic analyser and send it to me ?
    a comparison with my trace might give some additional information.
    2. does the cc3100 require some minimum time interval between successive socket calls ?
    3. could there be an issue if cc3100 receives data over air when a spi transfer is in progress ?

    i will share my Saleae logic traces via dropbox tomorrow. i request you to pls. have a look at those.
    i am also looking forward to your comments on my code.

    Thanks
    Ajit
  • Hi Ajit,

    Yes, CC3100 is connected to MSP5529LP over SPI.

    I continue to try reproduce your use case, this time I implemented a TCP server using Python. for your request it is sending and receiving in interlaced manner every 50mSec.

    I'll try to capture some SPI traffic but if the issue does not occur you would not see anything interesting (it can be used as a reference for a good case but you can also capture it locally).

    There is no connection between sending/receiving over the air packets and the communication over SPI. Each one is independent and have its own flow control mechanism. For over the air, it utilizes the TCP flow control mechanism and for SPI interface there is a propriatery flow control on the TX buffers pool.

    Will update later.

    Shlomi

  • Hi Ajit,

    One more observation: when I look at your logic figures, it seems that many times I could see that data lines (MISO, MOSI) are transmitting data that is not aligned with the clock. Is it possible that you have some noise/jitters on the SPI lines? What I did is captured the SPI lines in my case and I could always see that the clock and data lines are perfectly aligned. Have you tested your use case on TI's EVB as well?

    Regards,

    Shlomi

  • Hi Shlomi,

    Thanks for the feedback. I have one more SPI slave connected on the bus. So you are
    probably seeing it's data traffic. You will observe that the c3100 CS is high during this
    time and the clock is 1 Mhz (cc3100 works at 2 MHz). however, I am also analysing if there is
    some noise introduced on the spi bus due to the multi slave config (it seems that this other slave
    is selected and de-selected very fast during it's operation cycle. so I am suspecting if some noise is
    induced on the bus due to the transients). will update as soon as I have something conclusive.

    Regards
    Ajit
  • me too.... see http://e2e.ti.com/support/wireless_connectivity/f/968/t/387269 for details....

    i have saleae logic trace file that i can share....
  • hi Bob,
    can you share your saleae logic trace ?

    Ajit
  • here is a zip file, which contains a saleae data dump:  simplink-fail.zip

    the extra glitch on the WIFI-REQ line occurs at 24.8s....

    decoding the messages over SPI MOSI/MISO, you'll see that i am invoking sl_Send() every second and sl_Recv() every 200ms....

  • Hi Ajit, Bob,

    Thanks for the extra info from Bob. I went through the logic logs (I believe Ajit is seeing the same behavior) and I believe I can shed some light on what you are experiencing.

    The way it is suppose to work in your use case is as follows (I assume non-OS):

    • sl_recv() invoked and the sent to the device
    • At the end of this function, the host is waiting on SyncObj
    • Inside SyncObj, _SlNonOsMainLoopTask is invoked, waiting for an interrupt to occur
    • when an interrupt is triggered, the registerred _SlDrvMsgReadSpawnCtx CB is invoked which sends a CNYS to the device and reads the rest of the data
    • sending CNYS to the device causes the interrupt line to get deasserted

    Interrupt line would stay asserted until the device gets a CNYS pattern so in your case you probably didn't get the pattern. The question is why?

    Well, reading your post some more, it appears as you are working in trigger edge interrupt but also activate the masking of interrupt line. In this case, you obviously miss the 2nd interrupt you get from the device (this one could be an unsilicited event). Since you miss this inerrupt, when calling sl_recv() API again later on, the corresponding interrupt would not be asserted again from the device as it is asserted already. The response to sl_recv() would still be pending inside the device until you read the pending interrupt by sending CNYS.

    The question is why are you masking the interrupt? If you were working in a level interrupt I could understand but since you are working in trigger edge, you may miss interrupts. Can you try removing the masking of interrupts and see if it works for you?

    Regards,

    Shlomi

  • Appears that I'm having the same issues as Ajit and don't appear to be masking the interrupt. I found, as others mentioned, that adding some delay between _SlDrvMsgWrite() (at the end of the function) and _SlDrvMsgReadSpawnCtx() (at the beginning) appears to help, but does not fix the issue. I've tried everything that I can think of to recover from the error, but only a reset of the CC3100 appears to lower the interrupt line.
  • without the masking of the REQ interrupt, things are much better -- but not perfect....

    at this time, i'm experiencing an occassional glitch on the SPI connection that cause the 0xBEDCCDAB sync header from the slave to become garbled -- at which point the host has no real way to recover, as far as i can tell....

    but, assuming this is an electrical connection issue, i no long see the "missed" REQ interrupt that -- which is a postive step forward....
  • My biggest issue at the moment is that after I fail this line in _SlDrvRxHdrRead():

    ORIGINAL VERIFY_PROTOCOL(SyncCnt < SL_SYNC_SCAN_THRESHOLD);

    I'm unable to recover. Is there any way to handle this error and recover without a reset?
  • Thanks Bob,

    Let me know if you find something that can make noise on SPI lines.

    Shlomi

  • Shlomi,

    I seem to be running into a related error pretty frequently. In the function _SlDrvRxHdrRead() I fail the test: VERIFY_PROTOCOL(SyncCnt < SL_SYNC_SCAN_THRESHOLD); inside the while loop.

    The function _SlDrvMsgReadSpawnCtx() eventually leads to this error. Observationally I've been able to reduce the frequency of the error by including a delay of 25 to 50 ms at the start of _SlDrvMsgReadSpawnCtx(), but this slows down our application and doesn't eliminate the errors. It appears that calling _SlDrvMsgWrite() and then quickly SlDrvMsgReadSpawnCtx() may be the issue or at least related.

    It appears I'm missing a CNYS pattern and simply hanging?

    My only solution so far has been to reset the CC3100 and start the application from scratch, is there a way to prevent the error or at least gracefully recover and allow me to handle the error without a reset?

    Thanks!
  • Hi All,

    I would like to try and summarize all the users and use cases so we better understand how we can fix it.

    Generally, I've tried many setups as similar as I can to what you have but unfortunatelly could not reproduce. I'll continue trying but I need you to fill the ? in the table below.

    The fastest way would be to reproduce on one of TI's EVBs and if it is reproducible, then just share the code and we can take it from this point on.

    Also, it seems that the suggestions I made to not use masking of interrupt helped Bob. Bob, can you tell where you stand at the moment?

    Also, Ajit, any feedback on the shared SPI? what happens if you completely remove the other SPI devie and remain only with CC3100 connected to your host? is it possible to test?

    The table is as follows:

    User

    description

    Versions (CC31xx device and service pack)

    platform

    OS/Non-OS

    Shared SPI

    SPI rate

    Ajit

    • Sl_recv() non blocking
    • Sl_recv() and sl_send() interlaced every 100mSec or so
    • execution hangs in sl_Recv -> _SlNonOsSemGet -> _SlNonOsMainLoopTask in generally few minutes
    • 4 scenarios reported:
    1. 1.        CNYS sent from host causes IRQ to deassert and then another IRQ is asserted but never deasserted again
    2. 2.        After sending CNYS by host, interrupt is deasserted but the device never send D2H signal back
    3. 3.        After sending CNYS by host, interrupt is deasserted, the device send D2H signal back but never the rest of opcode+status
    4. 4.        Device send dummy packet and the host stops generating SPI clock

     

    SP?

    CC3100?

    custom

    Non-OS

    yes

    1MHz

    Bob

    • Sl_recv() non blocking
    • Sl_recv() and sl_send() interlaced every 100mSec or so
    • execution hangs in sl_Recv -> _SlNonOsSemGet -> _SlNonOsMainLoopTask in generally few minutes
    • scenario reported is: CNYS sent from host causes IRQ to deassert and then another IRQ is asserted but never deasserted again

    SP 1.0.0.1.1

    CC3100?

    ?

    Non-OS

    ?

    ?

    Evan

    ??? scenario is not very clear.

    Seems to get stuck at: _SlDrvRxHdrRead():

    ORIGINAL VERIFY_PROTOCOL(SyncCnt < SL_SYNC_SCAN_THRESHOLD);

    ?

    ?

    ?

    ?

    ?

    Regards,

    Shlomi

  • some new information based on additional testing these last two days..... my original issue with the "extra REQ" pulse has been resolved; that was an oversight on our part.... but we still are seeing two (somewhat unrelated) issues....

    first, the SPI connection: originally running at 1mbps over a dedicated SPI connection, i was seeing "corruption" of the 0xBxCDDCAB response from the slave; this in turn would lead to the host continually reading to no end.... to help here, we did several things: 1) we've improved the physical connection between our two dev boards, using shorter wires that were directly soldered to the connectors; 2) we've inserted a 10us delay between the time we assert SS and we perform the first SPI transfer; 3) we've inserted some spacing between the individual bytes of the SPI transfer; and we've lowered the data rate to 200kbps.... i've yet to see any failure on my setup after running for several hours; my colleague, however, has observed failures of this type on a different set of boards....

    a more difficult issue, however, is that we've seen several cases (after running for 30-60 minutes) of the system essentially getting stuck in an infinite polling mode.... here's the scenario: 1) we receive a REQ intr; 2) we call the driver.c handler, which "spawns" a task; 3) we return to our main loop, which is alternately calling sl_Recv() and the main sl scheduling loop.... the problem is that the latter never seems to run the spawned task!!!!

    looking through driver.c and nonos.c (and having lots of compiler experience), i can certainly imagine a number of "race conditions" that might be occurring.... there are global variables that should be (at least) declared volatile and possibly manipuated atomically with interrupts disabled.... a prime example are the RxIsrCnt and RxDoneCnt variables -- which are compared to see if tasks need to be run.... we've compiled your C code using a very aggressive form of inter-procedure optimation; and we happen to be targeting an *8-bit* MCU!!!!! the latter implies that even reading/writing a 16-bit int is more than one instruction; that is, it is *not* atomic.... and tight loops where you are constantly polling shared variables in the face of interrupts can also lead to "missing" a software signal.....

    short of "solving these problems" (which we'll continue to investigate), we will probably institute some of a "soft recovery" or "software healing" strategy in our own code.... at the end of the day, we have the source code; and we can easily detect when either of these issues has occurred.... ideally, we'd like to recover somewhat gracefully -- which in the worse case entails a reset of the cc3100....
  • Bob,

    Are these two observations (mainly the 2nd one) captured or just theoretical?

    Regarding #1, D2H sync pattern always look as you stated, i.e. 0xBxCDDCAB where x is 2 bits wrap around counter to indicate sequence number. Are you saying that you miss the sequence number?

    Regarding #2, I do not see such use case happening for several reasons: 

    1. RxIsrCnt is always modified from an interrupt context
    2. the interrupt context always get priority over driver context and "cuts" it whenever received
    3. RxIsrCnt is defined as volatile

    I'll address this post to R&D for further look but just to make sure, is the 2nd observation theoretical or did you actually see it happening?

    Shlomi

  • Shlomi,

    After reading Bob's reply to your post I think this condition is basically what we're seeing:

    "first, the SPI connection: originally running at 1mbps over a dedicated SPI connection, i was seeing "corruption" of the 0xBxCDDCAB response from the slave; this in turn would lead to the host continually reading to no end.... "

    For us, the failure occurs here: In the function _SlDrvRxHdrRead() I fail the test: VERIFY_PROTOCOL(SyncCnt < SL_SYNC_SCAN_THRESHOLD); inside the while loop.

    I'll try out some of Bob's suggestions, but we have experienced a decrease in errors when slowing down the SPI transfer or pausing between transfers, though this is not a good fix for our application. It does appear that we're either getting a corrupted response or simply missing it. I'll try some more experimentation today.

    To complete your table: We're using the CC3100 w/ version 1.0.0 of the SDK and SP 1.0.0.1.2. It's a custom Non-OS platform with a dedicated SPI at 4 MHz.

    If your unable to reproduce errors that would lead to the failure Bob and I are seeing, is it possible to advise us as to a way to overcome this error by retrying the transfer? I've tried numerous snippets of code and have never been able to recover short of resetting the CC3100

    thanks!
  • Just did a quick test printing out the received SPI data and got the same error. It occurs during a series of consecutive sl_Send() calls.

    After a long sequence of successfully transmitted data, we get the following:
    ...
    Read: 0xBDDCCDAB
    Read: 0x63000400
    Read: 0x14010000
    Read: 0xBEDCCDAB
    Read: 0x63000400
    Read: 0x15010000
    Read: 0xBFDCCDAB
    Read: 0x63000400
    Read: 0x14010000
    Read: 0x2E020000
    Read: 0x2E020000
    Read: 0x2E020000
    Read: 0x2E020000
    Read: 0x2E020000
    Read: 0x2E020000
    ... Repeats well over 100 times, more like SL_SYNC_SCAN_THRESHOLD times I assume?
    Read: 0x2E020000
    ERROR

    Then I fail the line ORIGINAL VERIFY_PROTOCOL(SyncCnt < SL_SYNC_SCAN_THRESHOLD); in _SlDrvRxHdrRead()

    I also hit the same error from the line above in sl_Recv() immediately after finishing the last sl_Send() call. The read data looked like:

    Read: 0xBDDCCDAB
    Read: 0x63000400
    Read: 0x15010000
    -- Finished sl_Send(), starting sl_Recv() --
    Read: 0x26020000
    Read: 0x26020000
    Read: 0x26020000
    Read: 0x26020000
    ... Repeats 100s of times
    Read: 0x26020000
    ERROR

    So maybe it's a different issue than Bob afterall... Does 0x2E020000 or 0x26020000 signify anything? It looks to me like the SDK is looking for a sync pattern that never comes due to the repeated 0x2E020000 or 0x26020000 transfers?

    Any suggestions as to how to fix the error or gracefully retry the send? Hope this helps...
  • After looking at it a bit more, I appear to be getting 0x2xxxxxxx from the CC3100 instead of the sync pattern and am forced to reboot after this occurs...
  • regarding #1, after processing a REQ intr, the host has sent 0x65877856 and the NWP responded with 0x79B99B57; at this point, the host keeps trying and eventually gives up.... i believe this was one of the "async" REQ intrs, in that it appears to have occurred very quickly after finishing an sl_Recv() cycle....

    regarding #2, we have two captured traces of this.... the host sends a recv command to the NVP; the NVP responds with a REQ intr; the host takes the interrupt and calls the registered Rx handler (presumably spawning a task); and then nothing happens.... the host never sends the 0x65877856 pattern; and the NVP of course never lows the REQ line....

    regarding atomicity, the real issue is when the polling loop *reads* RxIsrCnt and then compares it to RxDoneCnt.... again, on an 8-bit machine that is *not* atomic; an interrupt can occur *between* the instructions for just reading RxIsrCnt -- given the driver context a false reading....

    furthermore, there are other "shared variables" -- specifically, the table holding the various spawn entries.... having had some experience with kernel through t career (i am "bios.bob" === SPOX | DSP/BIOS | SYS/BIOS | TI-RTOS), the trick is ensuring that *scheduler decisions* are made with shared variables in a stable and consistent state....

    my plan was to add some critical sections around any code that reads (and/or writes) any shared variables also manipulated in the ISR context....

    at this point, we've add some more debug toggles to our code -- just to verify that we are somehow dropping a spawned task, hence spinning forever....
  • Hi Bob,

    #1 The pattern you got as a response from NWP (0x79B99B57) is a real pattern, but it was shifted left by 1 bit.
    Shifting it right you should get 0x3CDCCDAB which is the pattern we are expecting from NWP.
    In this case you should try to shield the lines and keep your traces as short as possible.

    #2 Can you send the traces?
    The irq counts we use are both u8 (RxDoneCnt & RxIsrCnt). The RxIsrCnt is incremented only during ISR context, and being read on the main loop.
    In case the systems gets into the state where the 0x65877856 pattern is never sent - can you extract the counters values?

    Thanks,
    Yaki
  • hi yaki,

    this .zip file contains a saleae trace of case #2:  SimpLinkFail.zip

    as you'll see, the last activity on the host was receipt of the REQ intr....   the 'Debug B' line is bracketing my own ISR; the 'Debug A' pulse occurs immediately after driver.c increments the RxIsrCnt....

    in the "working" cycles that precede the last, 'Debug A' will pulse either 3 or 4 times when RxDoneCnt is incremented in driver.c -- 3 times in the readCtx; 4 times in the spawnCtx....

    this morning, i added an atomic critical section to the MainLoop itself -- besides protecting the read of the various Cnt variables, it ensures that the table of spawn entries are read/cleared consistently....   remember -- the rxIsr *also* calls spawn(), which has to write several bytes into different fields of a spawn struct....  the scheduler needs to atomically read/clear these entries....

    at this point, i'm continuing to run tests -- and haven't seen a failure yet....

    regarding #1, i agree there is a noise issue here; i also noticed the 1-bit shift of the data....

  • Thanks Bob.

    At the last transaction, you MOSI is writing 1 bytes (of 4) sync message and the next write (3 more bytes) is delayed by ~10msec ?

    After this state you get an IRQ. Upon getting an IRQ, you register to spawn and should read the message by sending 0x65877856.

    In this exact state, can you send the values of the following parameters: RxIrqCnt, RxDoneCnt, g__SlNonOsCB.SpawnEntries[] ?

    Thanks,

    Yaki

  • i've added some code into _slNonOsMainLoopTask() which makes the scheduling decision *and* the data-structure update atomically.... once i added this code, i haven't seen a failure here!!!!

    failures caused by noise on the SPI lines do occur; but those can be solved with a better physical connection than i'm currently using....

    to help verify that the scheduler changes are robust, i'll do some testing using an MSP430 launchpad that is directly connected to the cc3100 booster-pack; this will hopefully give me a more secure SPI connection, allowing me to focus on verifying my scheduler change....
  • Thanks Bob.
    We will contact you for getting more technical details about this issue.
  • Bob, Ajit, all,

    Thanks for the debug effort.

    Yaki is contacting offline with Bob to bring it to closure.

    For better followup, I suggest closing this thread for now. Please make sure on your side whether the issue still persists.

    If it does, please open a new post and link to this one.

    Regards,

    Shlomi

  • I'm having the same problem. Can you tell me what the solution is?
  • I'm having the same problem. Can you tell me what the solution is?
  • Hello,

    Could a solution be provided in the forum for others experiencing the same problem.

    During normal operation of our product the cc3100 host driver occasionally hangs at:

    VERIFY_PROTOCOL(SyncCnt < SL_SYNC_SCAN_THRESHOLD);

    Within _SlDrvRxHdrRead.

    Currently our only solution is to completely reboot the cc3100 to recover, which obviously is non-ideal.  Can a solution be provided to gracefully recover from this error?

    Thanks,

    Ryan

  • Any news about that issue? I'm having the same troubles. The CC3100 is hung after like 3 minutes of connectivity. Sometimes it takes 10 seconds and sometimes it takes 5 minutes. What is the workaround for this issue?
  • I'm having similar issues.

    Is there a solution to this problem.

    Thanks,

    Paul