This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

LAUNCHXL-CC2640R2: Dequeue of application message queue interrupted by GAP callbacks

Part Number: LAUNCHXL-CC2640R2


Hello,

We are currently working on a project with the following details:

Device CC2640R2LAUNCHXL
SDK simplelink_cc2640r2_sdk_4_40_00_10
IDE IAR8.32.2
Reference Project

multi role project

In our application, we are looping through the following actions:

1. Post for an application event to call GAPRole_StartDiscovery with whitelist enabled for a specific device
2. On receiving the GAP_DEVICE_INFO_EVENT, we initiate for a connection with the found device by posting another application event to call GAPRole_EstablishLink
3. On receiving the GAP_LINK_ESTABLISHED_EVENT we are posting for an event to call GAPRole_CancelDiscovery and parallely post an event to disconnect the above link.

After step 3, neither are we receiving any GAP callbacks for GAPRole_CancelDiscovery (which was called with status as SUCCESS) and neither is the application queue message to disconnect (which was enqueued without any error) getting dequeued. The stack message received after the above stall is HCI_BLE_HARDWARE_ERROR_EVENT_CODE which would mean that the heap has some failure?

Also please note that we  are using a dynamic allocation for the application queue and the queuing is happening successfully without any heap failures.

After sometime the device under test will reset.

  • Hi,

    Thank you for reaching out.

    You are suggesting a heap issue. I think this could definitely be the source of the problem.

    I would recommend to pursue the debugging in this direction in order to confirm (or not) the heap issue. You can for example leverage the debugging guide available here.

    In parallel, could you please provide the minimal changes to implement on the multi_role example to reproduce the behavior you are observing?

    Best regards,

  • Hello,

    Thank you for the recommendation. We have observed the heap metrics in the debug mode when the issue was reproduced, below are the heap observations on live watch after the above issue occurred:

    As you can see that when the issue had occurred, the heap memory failure variable is 0. Would you recommend a better way to find the root cause?

    Regards,
    Niranjan

  • Hi Niranjan,

    Even if heapmgrMemFail is 0, it looks like the remaining amount of heap is pretty limited. 

    Could you please verify if by reproducing one more time you get similar heap metrics?

    Best regards,

  • Hello Clement,

    Apologies for the delay in response.

    We retried monitoring of the heap in the debug mode and we got the same observation as mentioned above.
    On further analysis of the HCI_BLE_HARDWARE_ERROR_EVENT_CODE, we also found that the hardwareCode from the hciEvt_HardwareError_t type pmsg is HW_FAIL_UNABLE_TO_SCHEDULE(0x8F).

    Here is the flow of events in our application:
     ScanAndConnect.pdf

    Hope this helps.

    Regards,
    Niranjan

  • Hi Niranjan,

    Thank you for the additional details provided.

    HW_FAIL_UNABLE_TO_SCHEDULE is raised when the scanning or the advertising did not get any scheduled time, meaning there was starvation for scanning or advertising because the other RF operations keeps on taking scheduling time.
    Based on the description you have made, I wonder if the Bluetooth LE controller (and in particular the scheduler) could actually be confused by getting (almost) simultaneously so many combination states (connection, scan stop, disconnection, while still being advertising).

    I would recommend to try these two elements:

    • Slightly delay the link termination. Instead of triggering the link termination when receiving GAP_LINK_ESTABLISHED_EVENT. Instead you could use a clock and trigger the disconnection a few hundreds of milliseconds later.
    • Verify you have enough air time available. To do so, you could enable the RF output (see here). If relevant, you could then try to reduce the amount of RF operations, for example by increasing the advertising interval.

    I hope this will help,

    Best regards,

  • Hello Clement,

    Thank you for you information and suggestions.

    As per your suggestions, I tried the following and here are my observations:

    1. As you suggested that this issue maybe due to too many RF operations, I removed all the operations with respect to advertising and connection, I had only the scan operation running. - The same HW_FAIL_UNABLE_TO_SCHEDULE came up.

    2. I further simplified the scan flow by removing all HCI whitelist APIs and just toggling between scan cancel and start operations as follows:


      With the above logic running, the error HW_FAIL_UNABLE_TO_SCHEDULE was produced.

    3. I tried replicating the same flow on the multi role project from simplelink_cc2640r2_sdk_4_40_00_10. I have attached the multi role c file with the above scanning loop simulated. 8508.multi_role_simulated_Scan_Loop.c
      The same
      HW_FAIL_UNABLE_TO_SCHEDULE error was observed.

      Also, the timing from when the above loop has been started till the HW error is not fixed. We saw the error in 2-3mins in one test cycle and in 20mins in the other test cycle.

       Regards,

       Niranjan

  • Hi Niranjan,

    Thank you for the thorough testing and all the details provided.

    We will have the team reproduce the issue and comment.

    I'll be back to you at the end of next week.

    Best regards,

  • Hello,

    We are waiting for your response to proceed with our implementation. Any leads on this issue?
    We would appreciate it if we could get a solution at the earliest since we are in a critical state with our project.

    Regards,
    Niranjan

  • Hi Niranjan,

    I have tried running the simulated scan loop, but have failed to reproduce the issue. Are there any dependencies needed to run the loop and reproduce the issue?

    Best,

    Nima

  • Hello Nima,

    Thank you for your response.
    Please use the multi role source file attached to this post to generate the hex file that is to be flashed on both the client(Scanner) and the peripheral(Advertiser) so that the client can loop through the scan start and cancel process.

    multi_role_simulated_Scan_Loop_ClientAndPeripheral.c

    Other than this, there are no other dependencies.

    Regards,
    Niranjan

  • Hi Niranjan,

    I was able to build the project and flash it to two devices. I noticed that mr_doScan() sets the nxtState variable to START_SCAN and queues that event to be processed. However, once in ScanM_ProcessInternalEvents() the nxtState variable is changed to SCAN_CANCEL without anything setting it to SCAN_CANCEL, so scanning would never start. I changed the nxtState variable to a global variable and this fixed the issue so that it would start scanning and this was successful. However, after running the example for 20 minutes I did not run into the HW_FAIL_UNABLE_TO_SCHEDULE error nor did I see anything wrong with the heap. Am I missing something? 

    Here is the modified 8802.multi_role.c.

    Best,

    Nima Behmanesh

  • Hello Nima,

    Just to clarify the procedure:
    1. Launchpad A(Client) and Launchpad B(Peripheral) was flashed with the same hex generated from the source file above(multi_role_simulated_Scan_Loop_ClientAndPeripheral.c) without any changes in the source.

    2. 'A' was put in scan mode and 'B' was kept in advertising mode. - In this scenario, did you see the HW_FAIL_UNABLE_TO_SCHEDULE  error on the launchpad A?

    3. If I run the code as is(nxtState is a local variable), following is the set of UART print statements(set in the code)  I get on my Launchpad A(scanner)

    *Main Menu
    < Next Item
     Scan >
    HCI status: 14
    HCI cmpl: 8195
    0x0C61CFA2FBD9
    Connected to 0
    Initialized
    Advertising
    HCI status: 255
    doscan...                 
    Scan started
    dev found
    Scan Cancelled GAP
    Scan started
    dev found
    Scan Cancelled GAP
    Scan started

    As you can see that after doscan, I get the print statement "Scan started" which is inside the case SCAN_START if  GAPRole_StartDiscovery is posted successfully. And this observation is consistent. I do not see any scenario where the queued data(SCAN_START) sent to 'multi_role_enqueueMsg' would change after de-queuing and processing the event.

    4. As mentioned earlier, since there is no fixed timing as to when the error would come up during the looping, I would suggest leaving the setup with the Launchpad 'A' going through the loop for as long as possible. Perhaps more than 20mins as well.

    Regards,
    Niranjan

  • Hi Niranjan,

    Thank you for the clarification. I will test again today and let it run for longer. I will update you with my results.

    Best,

    Nima Behmanesh

  • Hi Niranjan,

    I'm still unable to reproduce the expected behavior using the example you provided. I have followed the procedure you have laid out as well to no avail (I ran the example for over an hour without any unexpected behavior). 

    Do you mind sending over the hex file? 


    Best,
    Nima

  • Hello Nima,

    Please find the zipped file containing the hex file to be flashed on both the central and peripheral launchpads.

    ble_multi_role_ScanLooping_CentralAndPeripheral.zip

    Hope to hear from you, soon.

    Regards,
    Niranjan

  • Hi Niranjan,

    I have tried the hex file you provided. I was unable to reproduce the issue after some brief initial testing, however, my team and I will troubleshoot today and tomorrow and update you with results. Thank you for providing the hex files, it will be helpful in debugging this issue. In the meantime, if you are doing any debugging on your side and come across any additional information, please share it with us to ensure our troubleshooting/debugging process is efficient as possible.

    Best,

    Nima

  • Hello Niranjan,

    Thank you so much for your patience. We have ran some more testing and were able to identify some things that are keeping us from reproducing the issue. We are now able to detect advertisements. We are going to continue debugging this issue, and we will get back to you tomorrow with updates.

    Best,

    Nima

  • Hello Nima,

    Looking forward to your response.

    Thank you,
    Niranjan

  • Hi Niranjan,

    Thank you for your patience. We are still testing and will need a couple days. I will update you with more information as we get it.

    Best,

    Nima

  • Hi Niranjan,

    We are now able to reproduce the issue and have made some observations. It appears that time between enabling/disabling scanning may be causing an issue in the stack. By adding a delay between the enabling/disabling the issue appears to go away. We will need to test this further on our end, but in the meantime, try adding a delay between enabling and disabling scanning. Please let me know how that affects the issue on your end.

    Best,

    Nima

  • Hello Nima,

    I am glad you could reproduce the issue.
    The test loop between scan and cancel was done with a delay of 500ms, 1s and 2s introduced between the device found callback(after scan start event) and scan cancel event.

    Hence, there was considerable delay between the two.

    Unfortunately, the same HW error code 143 was observed.

    Also, do note, that the timing of the HW error occurrence was observed within 40-50mins with the 1 and 2 seconds delay.

    Regards,
    Niranjan

  • Hi Niranjan,

    Thank you for the information, that is very helpful. I will run some more tests and get back to you.

    Best,

    Nima

  • Hi Niranjan,

    I wanted to notify you that we are still working on this but will need more time to figure out the issue. Thank you for your patience!

    Best,

    Nima

  • Hello,

    Any updates on this issue?

    Regards,
    Niranjan

  • Hi Niranjan,

    I believe we should discuss this problem further offline so we can align on this issue.

    Best,

    Nima