LAUNCHXL-CC2640R2: Dequeue of application message queue interrupted by GAP callbacks

Niranjan Hegde

Hello,

We are currently working on a project with the following details:

Device	CC2640R2LAUNCHXL
SDK	simplelink_cc2640r2_sdk_4_40_00_10
IDE	IAR8.32.2
Reference Project	multi role project

In our application, we are looping through the following actions:

1. Post for an application event to call GAPRole_StartDiscovery with whitelist enabled for a specific device
2. On receiving the GAP_DEVICE_INFO_EVENT, we initiate for a connection with the found device by posting another application event to call GAPRole_EstablishLink
3. On receiving the GAP_LINK_ESTABLISHED_EVENT we are posting for an event to call GAPRole_CancelDiscovery and parallely post an event to disconnect the above link.

After step 3, neither are we receiving any GAP callbacks for GAPRole_CancelDiscovery (which was called with status as SUCCESS) and neither is the application queue message to disconnect (which was enqueued without any error) getting dequeued. The stack message received after the above stall is HCI_BLE_HARDWARE_ERROR_EVENT_CODE which would mean that the heap has some failure?

Also please note that we are using a dynamic allocation for the application queue and the queuing is happening successfully without any heap failures.

After sometime the device under test will reset.

over 3 years ago

0 Clément over 3 years ago

TI__Guru** 101460 points

Hi,

Thank you for reaching out.

You are suggesting a heap issue. I think this could definitely be the source of the problem.

I would recommend to pursue the debugging in this direction in order to confirm (or not) the heap issue. You can for example leverage the debugging guide available here.

In parallel, could you please provide the minimal changes to implement on the multi_role example to reproduce the behavior you are observing?

Best regards,

0 Niranjan Hegde over 3 years ago in reply to Clément

Intellectual 650 points

Hello,

Thank you for the recommendation. We have observed the heap metrics in the debug mode when the issue was reproduced, below are the heap observations on live watch after the above issue occurred:

As you can see that when the issue had occurred, the heap memory failure variable is 0. Would you recommend a better way to find the root cause?

Regards,
Niranjan

0 Clément over 3 years ago in reply to Niranjan Hegde

TI__Guru** 101460 points

Hi Niranjan,

Even if heapmgrMemFail is 0, it looks like the remaining amount of heap is pretty limited.

Could you please verify if by reproducing one more time you get similar heap metrics?

Best regards,

0 Niranjan Hegde over 3 years ago in reply to Clément

Intellectual 650 points

Hello Clement,

Apologies for the delay in response.

We retried monitoring of the heap in the debug mode and we got the same observation as mentioned above.
On further analysis of the HCI_BLE_HARDWARE_ERROR_EVENT_CODE, we also found that the hardwareCode from the hciEvt_HardwareError_t type pmsg is HW_FAIL_UNABLE_TO_SCHEDULE(0x8F).

Here is the flow of events in our application:
ScanAndConnect.pdf

Hope this helps.

Regards,
Niranjan

0 Clément over 3 years ago in reply to Niranjan Hegde

TI__Guru** 101460 points

Hi Niranjan,

Thank you for the additional details provided.

HW_FAIL_UNABLE_TO_SCHEDULE is raised when the scanning or the advertising did not get any scheduled time, meaning there was starvation for scanning or advertising because the other RF operations keeps on taking scheduling time.
Based on the description you have made, I wonder if the Bluetooth LE controller (and in particular the scheduler) could actually be confused by getting (almost) simultaneously so many combination states (connection, scan stop, disconnection, while still being advertising).

I would recommend to try these two elements:

Slightly delay the link termination. Instead of triggering the link termination when receiving GAP_LINK_ESTABLISHED_EVENT. Instead you could use a clock and trigger the disconnection a few hundreds of milliseconds later.
Verify you have enough air time available. To do so, you could enable the RF output (see here). If relevant, you could then try to reduce the amount of RF operations, for example by increasing the advertising interval.

I hope this will help,

Best regards,

0 Niranjan Hegde over 3 years ago in reply to Clément

Intellectual 650 points

Hello Clement,

Thank you for you information and suggestions.

As per your suggestions, I tried the following and here are my observations:

As you suggested that this issue maybe due to too many RF operations, I removed all the operations with respect to advertising and connection, I had only the scan operation running. - The same HW_FAIL_UNABLE_TO_SCHEDULE came up.
I further simplified the scan flow by removing all HCI whitelist APIs and just toggling between scan cancel and start operations as follows:

With the above logic running, the error HW_FAIL_UNABLE_TO_SCHEDULE was produced.
I tried replicating the same flow on the multi role project from simplelink_cc2640r2_sdk_4_40_00_10. I have attached the multi role c file with the above scanning loop simulated. 8508.multi_role_simulated_Scan_Loop.c
The same HW_FAIL_UNABLE_TO_SCHEDULE error was observed.

Also, the timing from when the above loop has been started till the HW error is not fixed. We saw the error in 2-3mins in one test cycle and in 20mins in the other test cycle.

Regards,

Niranjan

0 Clément over 3 years ago in reply to Niranjan Hegde

TI__Guru** 101460 points

Hi Niranjan,

Thank you for the thorough testing and all the details provided.

We will have the team reproduce the issue and comment.

I'll be back to you at the end of next week.

Best regards,

0 Niranjan Hegde over 3 years ago in reply to Clément

Intellectual 650 points

Hello,

We are waiting for your response to proceed with our implementation. Any leads on this issue?
We would appreciate it if we could get a solution at the earliest since we are in a critical state with our project.

Regards,
Niranjan

0 Nima over 3 years ago in reply to Niranjan Hegde

TI__Genius 15025 points

Hi Niranjan,

I have tried running the simulated scan loop, but have failed to reproduce the issue. Are there any dependencies needed to run the loop and reproduce the issue?

Best,

Nima

0 Niranjan Hegde over 3 years ago in reply to Nima

Intellectual 650 points

Hello Nima,

Thank you for your response.
Please use the multi role source file attached to this post to generate the hex file that is to be flashed on both the client(Scanner) and the peripheral(Advertiser) so that the client can loop through the scan start and cancel process.

multi_role_simulated_Scan_Loop_ClientAndPeripheral.c

Other than this, there are no other dependencies.

Regards,
Niranjan

0 Nima over 3 years ago in reply to Niranjan Hegde

TI__Genius 15025 points

Hi Niranjan,

I was able to build the project and flash it to two devices. I noticed that mr_doScan() sets the nxtState variable to START_SCAN and queues that event to be processed. However, once in ScanM_ProcessInternalEvents() the nxtState variable is changed to SCAN_CANCEL without anything setting it to SCAN_CANCEL, so scanning would never start. I changed the nxtState variable to a global variable and this fixed the issue so that it would start scanning and this was successful. However, after running the example for 20 minutes I did not run into the HW_FAIL_UNABLE_TO_SCHEDULE error nor did I see anything wrong with the heap. Am I missing something?

Here is the modified 8802.multi_role.c.

Best,

Nima Behmanesh

0 Niranjan Hegde over 3 years ago in reply to Nima

Intellectual 650 points

Hello Nima,

Just to clarify the procedure:
1. Launchpad A(Client) and Launchpad B(Peripheral) was flashed with the same hex generated from the source file above(multi_role_simulated_Scan_Loop_ClientAndPeripheral.c) without any changes in the source.

2. 'A' was put in scan mode and 'B' was kept in advertising mode. - In this scenario, did you see the HW_FAIL_UNABLE_TO_SCHEDULE error on the launchpad A?

3. If I run the code as is(nxtState is a local variable), following is the set of UART print statements(set in the code) I get on my Launchpad A(scanner)

*Main Menu
< Next Item
Scan >
HCI status: 14
HCI cmpl: 8195
0x0C61CFA2FBD9
Connected to 0
Initialized
Advertising
HCI status: 255
doscan...
Scan started
dev found
Scan Cancelled GAP
Scan started
dev found
Scan Cancelled GAP
Scan started

As you can see that after doscan, I get the print statement "Scan started" which is inside the case SCAN_START if GAPRole_StartDiscovery is posted successfully. And this observation is consistent. I do not see any scenario where the queued data(SCAN_START) sent to 'multi_role_enqueueMsg' would change after de-queuing and processing the event.

4. As mentioned earlier, since there is no fixed timing as to when the error would come up during the looping, I would suggest leaving the setup with the Launchpad 'A' going through the loop for as long as possible. Perhaps more than 20mins as well.

Regards,
Niranjan

0 Nima over 3 years ago in reply to Niranjan Hegde

TI__Genius 15025 points

Hi Niranjan,

Thank you for the clarification. I will test again today and let it run for longer. I will update you with my results.

Best,

Nima Behmanesh

0 Nima over 3 years ago in reply to Niranjan Hegde

TI__Genius 15025 points

Hi Niranjan,

I'm still unable to reproduce the expected behavior using the example you provided. I have followed the procedure you have laid out as well to no avail (I ran the example for over an hour without any unexpected behavior).

Do you mind sending over the hex file?

Best,
Nima

0 Niranjan Hegde over 3 years ago in reply to Nima

Intellectual 650 points

Hello Nima,

Please find the zipped file containing the hex file to be flashed on both the central and peripheral launchpads.

ble_multi_role_ScanLooping_CentralAndPeripheral.zip

Hope to hear from you, soon.

Regards,
Niranjan

0 Nima over 3 years ago in reply to Niranjan Hegde

TI__Genius 15025 points

Hi Niranjan,

I have tried the hex file you provided. I was unable to reproduce the issue after some brief initial testing, however, my team and I will troubleshoot today and tomorrow and update you with results. Thank you for providing the hex files, it will be helpful in debugging this issue. In the meantime, if you are doing any debugging on your side and come across any additional information, please share it with us to ensure our troubleshooting/debugging process is efficient as possible.

Best,

Nima

0 Nima over 3 years ago in reply to Niranjan Hegde

TI__Genius 15025 points

Hello Niranjan,

Thank you so much for your patience. We have ran some more testing and were able to identify some things that are keeping us from reproducing the issue. We are now able to detect advertisements. We are going to continue debugging this issue, and we will get back to you tomorrow with updates.

Best,

Nima

0 Niranjan Hegde over 3 years ago in reply to Nima

Intellectual 650 points

Hello Nima,

Looking forward to your response.

Thank you,
Niranjan

0 Nima over 3 years ago in reply to Niranjan Hegde

TI__Genius 15025 points

Hi Niranjan,

Thank you for your patience. We are still testing and will need a couple days. I will update you with more information as we get it.

Best,

Nima

0 Nima over 3 years ago in reply to Niranjan Hegde

TI__Genius 15025 points

Hi Niranjan,

We are now able to reproduce the issue and have made some observations. It appears that time between enabling/disabling scanning may be causing an issue in the stack. By adding a delay between the enabling/disabling the issue appears to go away. We will need to test this further on our end, but in the meantime, try adding a delay between enabling and disabling scanning. Please let me know how that affects the issue on your end.

Best,

Nima

0 Niranjan Hegde over 3 years ago in reply to Nima

Intellectual 650 points

Hello Nima,

I am glad you could reproduce the issue.
The test loop between scan and cancel was done with a delay of 500ms, 1s and 2s introduced between the device found callback(after scan start event) and scan cancel event.

Hence, there was considerable delay between the two.

Unfortunately, the same HW error code 143 was observed.

Also, do note, that the timing of the HW error occurrence was observed within 40-50mins with the 1 and 2 seconds delay.

Regards,
Niranjan

0 Nima over 3 years ago in reply to Niranjan Hegde

TI__Genius 15025 points

Hi Niranjan,

Thank you for the information, that is very helpful. I will run some more tests and get back to you.

Best,

Nima

0 Nima over 3 years ago in reply to Niranjan Hegde

TI__Genius 15025 points

Hi Niranjan,

I wanted to notify you that we are still working on this but will need more time to figure out the issue. Thank you for your patience!

Best,

Nima

0 Niranjan Hegde over 3 years ago in reply to Nima

Intellectual 650 points

Hello,

Any updates on this issue?

Regards,
Niranjan

0 Nima over 3 years ago in reply to Niranjan Hegde

TI__Genius 15025 points

Hi Niranjan,

I believe we should discuss this problem further offline so we can align on this issue.

Best,

Nima

Bluetooth®︎

Bluetooth forum

LAUNCHXL-CC2640R2: Dequeue of application message queue interrupted by GAP callbacks