CC2640R2F-Q1: Recovering from some HCI error codes

Part Number: CC2640R2F-Q1
Other Parts Discussed in Thread: CC2640R2F

Tool/software:

Hello,

My customer faces an issue in a production system and want to understand how the system should recover from the error messages. This issue is very rare and could not be reproduced in the lab. The SDK is 3_40_00_10.

The issue is that the BLE advertisement stops unexpectedly. The master controller directs the BLE device by always enabling the advertisement (even after a BLE connection when the advertisement is stopped inherently by the protocol, the advertisement is requested to be resumed).

The error codes are HCI_BLE_HARDWARE_ERROR_EVENT_CODE and HCI_DATA_BUFFER_OVERFLOW_EVENT but customer is not certain that one of them happens linked to the above issue, since the issue cannot be reproduced in the lab.

Questions:

  • If there is an advertisement ongoing, can one of those 2 error codes actually stop the advertisement?
  • Is the BLE stack the only source of these error codes, or could they be generated by e.g. the OS...?
  • Are these 2 events are also GAP events? (with ICALL_SERVICE_CLASS_BLE as source and HCI_GAP_EVENT_EVENT as event type)
  • What are the ways to reduce the likelihood of their occurrence? This post makes a couple of suggestions; are there others?
  • If one of those 2 errors occurs, is the OS still working, scheduling tasks and refreshing the watchdog? (They ask because they tested the watchdog and it resets correctly an assert/fault and the host controller is able to detect the BLE reset and re-start the advertisement. But if the OS is still running and only the BLE stack is stuck, the watchdog will not reset, correct?)
  • If these errors occur, can the BLE stack recover by itself, or should the host controller catch the error codes and order a reset of the BLE device?

Thank you.


Best regards,
François.

  • Hi François,

    Here are some preliminary answers, and they may change as I do some more research into this.

    If there is an advertisement ongoing, can one of those 2 error codes actually stop the advertisement?

    It's possible. HCI_BLE_HARDWARE_ERROR_EVENT_CODE seems to be triggered when the heap is full, or if there is not enough heap. I see that this is returned in two cases:

    1. Reading the local P256 public key once it has been generated.

    2. The callback that informs the controller that the DH key is done generating.

    In either case, memory is allocated, and if there isn't enough heap to allocate that memory then this code is returned.

    I'm not sure if this could affect ongoing advertisements. The reason I'm not sure is that this regards the heap, so if the heap is full or there is not enough memory, that alone could have effects on advertising. I think the more important question is why is the heap exhausted in the application? Are they sending a lot of GATT notifications/data with a small connection interval? Memory leaks? There's a couple of paths I would go down for debugging this, but this is definitely memory related.

    As far as the second code, I'm not actually seeing this called often in the stack code. This I will have to look into a bit further.

    Is the BLE stack the only source of these error codes, or could they be generated by e.g. the OS...?

    The BLE stack is the source of these error codes that could be caused by something in the OS. For example, the HCI_BLE_HARDWARE_ERROR_EVENT_CODE triggers when there is not enough heap memory. Memory is allocated via an OS abstraction layer. You can think of it this way:

    1. BLE function is called.

    2. Function tries to call malloc (malloc being from the OS).

    3. malloc fails or returns NULL (OS doesn't have enough heap).

    4. BLE function returns HCI_BLE_HARDWARE_ERROR_EVENT_CODE.

    So it's a bit of both. 

    Are these 2 events are also GAP events? (with ICALL_SERVICE_CLASS_BLE as source and HCI_GAP_EVENT_EVENT as event type)

    Based on the brief code analysis, yes they are HCI_GAP_EVENT_EVENTs.

    What are the ways to reduce the likelihood of their occurrence? This post makes a couple of suggestions; are there others?

    I think that post provides a specific scenario, but in general what I think is to look at two things:

    1. How much memory am I allocating and freeing? Can I reduce the amount of memory being allocated?

    2. Scheduling operations after an operation ends, and not before it ends. What I mean here is what Clement stated, wait until you get an event saying the current operation is over before scheduling another one. This will free up the queue, and thus result in less memory usage.

    If one of those 2 errors occurs, is the OS still working, scheduling tasks and refreshing the watchdog? (They ask because they tested the watchdog and it resets correctly an assert/fault and the host controller is able to detect the BLE reset and re-start the advertisement. But if the OS is still running and only the BLE stack is stuck, the watchdog will not reset, correct?)

    The watchdog is a driver, which is based on interrupts. If the OS is running, then the watchdog should be able to function regardless of the BLE stack.

    If these errors occur, can the BLE stack recover by itself, or should the host controller catch the error codes and order a reset of the BLE device?

    That's a good question, and for full transparency, without being able to reproduce it, it's hard to answer. From a code analysis perspective, when these error codes occur, they are just notifying the controller, but it doesn't seem that an exception handler is triggered. In theory, yes it could recover, but I would say that these errors indicate an issue that should be addressed via software design. 

    What is the heap usage? Profiling the application and its memory usage might show that we are using too much memory. We might not be going over all the time, but some optimizations might lead us to never see this issue. 

    All that to say that it depends on what the customer is comfortable with. Though, I believe that some memory profiling might be helpful.

    I hope this helps!

    Best,

    Nima Behmanesh 

  • Dear Nima Behmanesh,

    Thank you for your detailed answer, it helped us understand these errors much better.

    The issue in fact is that we see that the advertisment stops on the BLE device and we are searching for a root cause (which is difficult since the issue is rare, not reproducible and without any execution logs).

    This is why we focused our attention to the aforementioned HCI_BLE errors, thinking if one of them occur, they can affect the advertisment and also "freeze" the BLE stack. So we cannot be sure at this moment these errors actually occur and we are trying to force them by making intentional heap leaks.

    We will continue our analysis and debug efforts, I kindly ask yo leave this ticket open in case we need further support.

    Thank you,

    Valentin Vasiu

  • Dear Nima Behmanesh,

    We advanced our analysis and are evaluating our recovery features (in case the BLE chip resets, faults, freezes or the communication between BLE chip and master chip loses data).

    We have the following questions:
    - Do you consider any fixes between SDK 3_40_00_10 and the latest SDK relevant for a potential issue within the SDK that stops the advertisment without the advertisment itself being requested?
    - Is there any HCI command that we can use from the master controller to poll the BLE chip, checking the advertisment status? (e.g. HCI command with the meaning "is the advertisment running or not?")

    Thank you,

    Valentin Vasiu

  • Hello,

    Do you consider any fixes between SDK 3_40_00_10 and the latest SDK relevant for a potential issue within the SDK that stops the advertisment without the advertisment itself being requested?

    Considering that this SDK is fairly old, there may be many tickets that could be related to advertisements. What I can do is look for any fixes that relate to advertising but on a much smaller span of SDKs. Additionally, any fixes that were made into the newer SDKs may not be applicable to older SDKs. 

    I'll see what I can find regarding any related tickets.

    - Is there any HCI command that we can use from the master controller to poll the BLE chip, checking the advertisment status? (e.g. HCI command with the meaning "is the advertisment running or not?")

    The only way I can think of polling the status would be calling GapAdv_enable with the handle of the advertisement and checking for bleAlreadyInRequestedMode.

    Somewhat unrelated question, but are you using GATT notifications/indications in your software? If so, how often are these notifications/indications being sent?

    Best,

    Nima Behmanesh

  • Hello,

    Thank you for your answer.

    Regarding GapAdv_enable (HCI_EXT_GAP_MAKE_DISCOVERABLE), we are only using it once when starting the advertisment (and after every link established, to restart it), and periodically we are using HCI_EXT_GAP_UPDATE_ADV_DATA to update the advertisment data. But it is a good idea to re-start the advertisment after updating the data (or during a periodic keep-alive), even though normally it shouldn't be needed. This way, if there is any problem in the BLE stack that leads to the advertisment being stopped, it will be re-started by the master controller. And we already handle the bleAlreadyInRequestedMode response.

    Secondly, we have implemented a recovery feature, to recover the CC controller in case it is frozen (and we simulated the frozen scenario by your suggestion, leaking dynamic memory until the UART stack is no longer responding to the master). What we are considering now is to recover a freeze of the CC controller in the middle of handling a command, so while the SRDY pin is low.

    To answer your question, we don't use GATT notifications/indications in our software.

    Bottom line, since this issue is very rare and we could not reproduce it at our end, we are focusing more in implementing a robust recovery mechanism on the master side, that could handle everything (from CC chip reset, to freeze, to missing communication, to handshake pin left blocking, ...).

    PS: Don't hesitate to come back to us with some list of fixes on the SDK side which could fix an unexpected stop of the advertisment feature.

    Thank you for your support,

    Valentin Vasiu

  • Hello,

    Do you have any update on a list of tickets that could potentially fix issues with stopping the advertisment unexpectedly between SDK 3_40_00_10 and newer SDK versions? Even if you find something in a much newer version of the SDK (breaking changes due to major version increase), if there's a porting guide, we can consider upgrading the SDK.

    Thank you,

    Valentin Vasiu

  • Hello Valentin,

    All versions of the CC2640R2F SDK are available here: https://www.ti.com/tool/download/SIMPLELINK-CC2640R2-SDK/4.10.00.10

    In each version, you may follow the following links to list all changes and fixes brought to that version: Release Notes > What's new > Change Log

    However, as Nima said, any fixes that were made into the newer SDKs may not be applicable to older SDKs.

    : I guess that at this stage, we should suggest a robust recovery mechanism on the master side since that's what the customer is ultimately looking for. Please let us know your thoughts.


    Best regards,
    François.

  • Hello,

    Regarding GapAdv_enable (HCI_EXT_GAP_MAKE_DISCOVERABLE), we are only using it once when starting the advertisment (and after every link established, to restart it), and periodically we are using HCI_EXT_GAP_UPDATE_ADV_DATA to update the advertisment data. But it is a good idea to re-start the advertisment after updating the data (or during a periodic keep-alive), even though normally it shouldn't be needed. This way, if there is any problem in the BLE stack that leads to the advertisment being stopped, it will be re-started by the master controller. And we already handle the bleAlreadyInRequestedMode response.

    How are you updating the advertisement? Is advertising stopped when you update the data? When you update the data and look at the memory allocated on the heap, does it go up by how much you expect? When you change the data to a smaller payload, do you see the heap go down?

    : I guess that at this stage, we should suggest a robust recovery mechanism on the master side since that's what the customer is ultimately looking for. Please let us know your thoughts.

    One such recovery method would be to monitor the size of allocated data on the heap. If it reaches above a certain threshold (say 70% of the total memory) then issue a reset. Of course, that threshold is up to the developer to determine, but since the issue causes the memory to eventually fill, this would be the most robust way of recovery.

    I would make sure that for every allocation of dynamic memory that there is a free, especially when updating data for advertisements.

    Best,

    Nima Behmanesh

  • Hello Nina,

    How are you updating the advertisement? Is advertising stopped when you update the data? When you update the data and look at the memory allocated on the heap, does it go up by how much you expect? When you change the data to a smaller payload, do you see the heap go down?

    We are updating the advertisment data periodically, while the advertisment is ongoing, with the HCI_EXT_GAP_UPDATE_ADV_DATA command. This works fine as far as we could test. The advertisment is only re-started intentionally by the master controller after the link established notification is received (and the BLE stack implicitly stops the advertisment). 

    As per your suggestion, we are considering adding a restart of the advertisment with the HCI_EXT_GAP_MAKE_DISCOVERABLE command periodically (either after updating the data with previous cmd or on a keep-alive timer). If no issue with the BLE stack, it will respond with bleAlreadyInRequestedMode which won't be an issue for us, but if the adv stopped for whatever reason, we will restart it. So this would be a robust recovery feature.

    Regarding the heap usage, we didn't monitored it, the product is running fine for hours and even days, and we only have 2 reproductions of the "adv stop" on our customer side.

    One such recovery method would be to monitor the size of allocated data on the heap.

    This is a good idea to add a new recovery method on top of our own. We forced heap leaks and saw with enough leaks, the UART comm is down also, and this is recovered by our keep-alive feature (CC chip will not respond anymore to a keep-alive cmd so we reset). Any guess if the UART comm would "always" be down when the heap leaks and becomes full? Or better add a monitoring of the heap on top of our recoveries?

    One last point about the changelog, I found these 2 tickets:

    so you think these fixes could help in any way to overcome a sudden advertisment stops behaviour we are experiencing? Meaning we should consider upgrading the SDK?

    Thank you,

    Valentin Vasiu

  • Hello,

    I always think that upgrading the SDK is a good option if you have the time and resources to migrate. The latest SDKs often have many fixes that can improve your application.

    Considering those two tickets you've linked, I can take a look further into them, but I believe that changing the data while the advertisement is running could cause some undefined behavior. The fact that you have not observed this in your testing may mean that it may just happens very rarely. Additionally, since the issue can't be reproduced, I would say that adding a patch that we are not sure will have an effect may cause more issues down the line.

    Best,

    Nima Behmanesh 

  • Hello Valentin,

    Do you have additional questions on this topic?


    Best regards,
    François.

  • Hello Francois,

    No further questions on this topic.

    Thank you for all your support!

    Best Regards,

    Valentin Vasiu

  • Hello Valentin,

    Very well. Thank you for the feedback. Closing this discussion.


    Best regards,
    François.