This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CC2642R: Peripheral generating MIC errors

Part Number: CC2642R

Hello,

We've been tracking an issue in our peripheral project where the BLE connection is terminated by the central due to a MIC error. I was hoping that porting from SDK 1.6 to 2.2 for the 2642 would resolve the issue, but it still seems to be present.

The error is reproducible, but seems to occur somewhat randomly. It occurs most frequently during OTA's, when our BLE connection is under it's heaviest load. That said, I've also seen it occur during normal run-time message exchanges. 

I'm not sure there's much I can do to debug this issue, since the encryption is handled by the stack / controller portion of the system. I'm highly confident the problem is in the peripheral as I've tested it against both our production-intent central (Qualcomm), and on our peripheral hardware (TI) pretending to be the central. In both cases the central reports the 0x3D error code as the reason for the disconnection. I've also captured the error on a BLE sniffer which also points at the issue coming from the peripheral. Here's a screen cap from the sniffer:

Any ideas on how we can troubleshoot this? We can limp along for development with link encryption turned of, but its a must for production.

Thanks,

Josh

  • Hi Josh,

    I've seen some interesting MIC failure issues in the past, most of which were determined to be an issue in smart phone broadcom chips when the phone is used as a peripheral and the TI device initiates encryption. MIC failures can be generated by the ble protocol even when there isn't an MIC failure, for instance in BLE5 Spec Vol 6, Part B Section 5.1.3.1 last paragraph. Section 5.1.3.2 last paragraph.

    Can you describe exactly what is happening?
    What device is the central in this case?
    Is this mic failure caused when encryption is being started or after the LLCP Encryption Start procedure is complete?
    Is this mic failure caused when the encryption is being paused and a new encryption key is being generated/used?
    Who is initiating the encryption? Does the peripheral send a slave security request or is the central device enforcing the encryption?
    Can you provide a sniffer capture of the failing case and point to where it occurs? Is your sniffer BLE5 capable?
    After the connection is terminated, can you reconnect and encrypt using the previously generated keys or does it trigger a new pairing session?
  • Hey Evan,

    Thanks for the quick response, I can answer a lot of your questions straight away. As far as a sniffer log goes, can I send it to you directly? It's not something I'd want to post on a public forum as it exposes our proprietary GATT service.

    Can you describe exactly what is happening?

    After a random amount of time the central stops initiating connection events, and reports a connection termination due to MIC failure. The peripheral will disconnect a few seconds later from a connection timeout. The issue seems to happen more frequently when the BLE connection is being more heavily loaded, e.g. when I push an OTA in parallel with the normal run time messages. We use a 7.5 ms connection interval, with 0 slave latency.

    I can't prove this, but it seems like once one MIC failure occurs in a power cycle they become more frequent on subsequent connections. I'll do some more testing to see if I can back this up.

    What device is the central in this case?

    It's always the central reporting the MIC failure. This leads me to believe the peripheral is sending a message with a corrupted MIC, or the encrypted payload is being corrupted after the MIC has been calculated.

    The peripheral is always a CC2642. I've tested against two centrals: a board with a Qualcomm chip that's part of our central device, and also on a project I set up to act as the central, but runs on our peripheral hardware so I have more control for instrumentation and debugging. The issue occurs no matter which central is being used, in both cases the 0x3d disconnect code is seen. 

    Is this mic failure caused when encryption is being started or after the LLCP Encryption Start procedure is complete?

    It happens after the connection is established and encryption has started. Sometimes a few seconds later, sometimes many minutes later.

    Is this mic failure caused when the encryption is being paused and a new encryption key is being generated/used?

    I don't think new keys are being generated. I have our peripheral's bond manager instrumented to print the link's LTK on UART whenever a connection is established. I've been using the same key in the sniffer for all of the testing, and it's able to decrypt the traffic.

    Who is initiating the encryption? Does the peripheral send a slave security request or is the central device enforcing the encryption?

    Central always initiates, peripheral is configured for GAPBOND_PAIRING_MODE_WAIT_FOR_REQ.

    Can you provide a sniffer capture of the failing case and point to where it occurs? Is your sniffer BLE5 capable?

    Sniffer only supporters BT 4.2 (Frontline BPA Low Energy). The BLE stack on our central is only 4.2 as well. The "simulated central" firmware I run on our peripheral hardware has a BT 5 stack, but doesn't use any of the BT 5 features. 

    After the connection is terminated, can you reconnect and encrypt using the previously generated keys or does it trigger a new pairing session?

    Our central automatically reconnects, and the link is encrypted with the same keys as before. Things work normally until the next MIC failure occurs.

  • Josh,

    Strange. Allow me sometime to look into it a bit. I sent you a friend request on E2E to connect and then we can message on there, but lets try to keep the majority of the convo on the forums if possible. Thanks! Your descriptions are helpful and I do not believe I've seen this particular issue before as it's not to related to other MIC failure issues I've seen.
  • Evan,

    Have you had a chance to look at the log files I sent over yet?

    I've done some more testing where I changed the size and frequency of some of the messages being exchanged on our BLE connection. I was hoping to find a way to affect the frequency of the error, but it seems to be pretty random.

    Thanks,
    Josh
  • Hey Josh,

    I was out of the office yesterday for a personal matter and have not had a chance to dive deep into it. I will dig into it today and early next week.

    As a side question, have you found a reliable way to reproduce the issue using the example OTA application? Have you done any testing with that?
  • Hello Evan,

    Thanks for the update. I have not tried to reproduce this issue with any of the example applications. Our application has a custom OTA scheme that runs much faster than TI's OAD example.

    My guess is that it's some sort of subtle timing issue that only shows up under certain conditions. I can look into trying to tweak some example projects to reproduce it, but I don't know if it will be possible. There's a lot going on in our application that affects timing which is missing from the example projects, some of it asynchronous to the BLE activity. I also don't currently have a 2642 LaunchPad, still using one of the old 2640 ones.

    -Josh

  • Josh,

    Which of the two sniffer logs that you sent show the event you showed in your picture? I'm having a hard time locating them or understanding your trace.

    These MIC issues can take a while to track down as an FYI. It's interesting that the slave starts transmitting data in the middle of a multi-packet CE. This may not be related, but I was hoping to see if there was a pattern of some sort to see if this happens each time you see a failure.
  • Hello Evan,

    Both of the *.cfa files I sent should have a captured a MIC error leading to a disconnection. The "Cap4_62493_TI_Central.cfa" file is the one I posted the screen shot of. Open it and go to the "LE Data tab, scroll down to "Frame #" 62493 and you should see the packet that caused the disconnection.  

    In the Cap4_14818_UnableToDecrypt.cfa capture file, go to the "LE Data" tab, and scroll down to "Frame #" 14818 to see the corrupted packet. 

    Thanks,

    Josh

  • Josh,

    Can you tell me a bit more about what's going on in your application at that time of the failure?

    Are you using any SWI's, HWI's, hardware encryption accelerators at all at this time? If so, what's going on with them?

    What's in your call back for this, it looks like the slave is processing incoming data and sending a start event during the multiple packet connection event after processing some data of some sort. With just the logs it's not possible to tell you what's happening.

    To give you an idea, the last MIC failure I saw was because of a nasty SWI lockout that would occur in very rare conditions. That was fixed though in 2.20.

    Are you able to reproduce this issue on some sample application that we could use for further debugging by chance?
  • Evan,

    It's hard to say exactly what's going on in our application at the time of the MIC failure since it occurs somewhat randomly. It certainly seems like some sort of timing issue / race condition given the behavior I've observed thus far. Here's a high-level summary of our application:

    1. The application is regularly outputting log messages on UART, we're using the TI provided UART driver which I believe uses a HWI and SWI whenever we transmit (which is often)
    2. The co-processor on our PCB has a GPIO and SPI connection to the CC2642:
      1. The GPIO connection is used by the co-processor to trip an interrupt on the CC2642 when the co-processor has data it wants to send to the CC2642 over SPI (CC2642 is the SPI master)
      2. The co-processor sends ~30 messages per second to the CC2642, generating a GPIO interrupt for each message 
        1. These messages are sent regardless of the BLE state (connected/disconnected)
      3. The GPIO interrupt handler is thin, it puts a "service SPI" event into the application queue and then returns
      4. We're using TI's SPI driver for communication with the co-processor and we're running it in blocking mode, the SPI driver also appears to use a HWI and SWI
    3. In addition to the ~30 messages the co-processor sends to the CC2642 every second, the CC2642 also sends certain BLE characteristic writes to the co-processor via SPI
      1. Our central writes to this characteristic about 7 times per second
      2. When the CC2642 App receives this write it transfers the data to the co-processor via SPI
      3. The co-processor will process this data, and send a response message back to the 2642 which forwards the message on BLE as a notification
      4. What this adds up to is ~30 SPI messages per second when BLE is disconnected, and ~37 when BLE is active, when BLE is active all of the SPI messages result in a notification
    4. We're also using 4 PWM / GPT modules to control LED's on our board, but I don't believe this results in any HWI's firing. We do have a separate RTOS task that handles updating the LED duty cycles, and runs with a higher priority than our main task ("main task" is our equivalent to SimplePeripheral_taskFxn())
    5. There's a third task which is lower priority than both the main, and LED tasks, all it does is process UART input. We have a basic CLI implemented for debug and test. This task is almost always pending on, waiting for a UART character to be received

    That's the application at a high-level. Because of the amount of confounding factors I think it might be difficult to reproduce this issue in an example project. If I knew I could do it I'd pursue it, but it seems like something I could sink weeks into without success. 

    Describing the application did give me a few ideas for trying to zero in on the issue. I can disable the SPI interface and just send dummy values to keep the BLE notification rate similar to the "full" application. If the error still occurs then I'd be a lot more confident I could reproduce it on a dev board...since that would indicate that it's not something caused by the interrupt / timing jitter that the co-processor injects. I'll plan to give this a shot early next week and will report back.

    -Josh

  • Josh,

    Thanks for the very detailed description of what is going on in your application. I agree with you that this will be rather difficult to track down.

    Have you been able to make any progress with the tests you've been working on?

    Also, we have released SDK 2.30. May be worth testing to see if the issue is seen on 2.30.

    To point 4, you aren't running this at a higher priority than the Icall enabled tasks correct? What priority are you running your ICall_createRemoteTasks at?
  • Hello Evan,

    Thanks for getting back to me. It looks like we're using the default prioirty of 5 for the remote tasks, from icall_addrs.h: #define ICALL_TASK_PRIORITIES { 5 }

    Our application tasks are running at priorities 1, 2, and 3 for the UART, main, and LED tasks, respectively.

    I have spent some more time working on the issue, here's the update:

    • Overrode our logging function, which effectively takes UART out of the mix
    • Simplified our writeAttrCB(), in the case of our OTA data characteristic we were directly calling an app function from the call back, rather than enqueuing an event. We were careful to make the app function quick to execute and non-blocking, but I was still suspicious of it. All other CB's were enqueuing an event like the example projects
    • Cut our co-processor out of the mix by disabling the SPI interrupts, I have the BLE application simulating the messages from the co-processor instead to keep the BLE throughput about the same 

    Even with these changes I'm still seeing the MIC error. The next step is to build up a version of the simple_peripheral application that is as slim as possible, but implements our GATT profile. If I can reproduce the issue in this application I'm confident you can reproduce it on a LaunchPad with just a few changes to the board file. This work is currently in progress, hopefully will have something in the next couple of days.

    Thanks,

    Josh

    Evan Wakefield said:
    Josh,

    Thanks for the very detailed description of what is going on in your application. I agree with you that this will be rather difficult to track down.

    Have you been able to make any progress with the tests you've been working on?

    Also, we have released SDK 2.30. May be worth testing to see if the issue is seen on 2.30.

    To point 4, you aren't running this at a higher priority than the Icall enabled tasks correct? What priority are you running your ICall_createRemoteTasks at?

  • Hey Josh,

    Thanks for the update and continued effort on your end to isolate. Once you have been able to create a version of it that is as slim as possible on simple_peripheral and is sharable, I'm more than happy to help reproduce on my end and see if I can help debug it as well.

    Do try 2.30 as well maybe. Not sure if there is anything in specific that would fix this in 2.30, but couldn't hurt.

    Let me know and keep me updated!
  • Hello Evan,

    I've spent some more time trying to isolate this issue and it doesn't seem to be going away. I updated to SDK 2.3, and started with a fresh simple_peripheral project. I merged in the bare minimum functionality to get our normal run-time & OTA BLE message exchanges occurring with the central. Most of the extra complexity from our application is gone, no UART, SPI, PWM, etc. I can still generate the MIC failure reliably when the peripheral is connected to our Qualcomm central.

    I also took a fresh simple_central project and set it up with our GATT profile and some minimal business logic. So far I can't generate the MIC failure using the modified simple_central, which is interesting. I'll try to examine differences between the simple_central project vs. our actual central to see if anything jumps out. Hopefully that will turn something up.

    We've ordered a couple of CC26X2 LaunchPad's, and I'll create a build config for our modified simple_peripheral project which runs on that hardware. Hopefully I can get the simple_central project to start tripping the MIC failure as well, and then you should be able to reproduce the problem for troubleshooting.

    Thanks,
    Josh
  • Josh & Readers,

    This thread has now been brought internal for now. When we have finally resolved this, I will post the update on this E2E thread.
  • Didn't mean to mark as TI thinks resolved. Can't undo that now, but I'll make sure to follow up with a resolving post when the issue is identified.
  • I've captured some more sniffer traces of the MIC issue and I've noticed something that's worth passing along. The L2CAP PDU Length field looks like it's corrupted in all of the offending packets. Here's an example:

    The sniffer is also flagging the packet as fragmented, I'm guessing this is caused by the bad PDU length, as our peripheral should never be sending a PDU that's large enough to require fragmentation. I captured 4 of these events, and the corrupted L2CAP header data seems to be random, I've seen lengths of: 54819, 9254, 10293, and 63539.

    In the simple_central project I created, I'm also occasionally seeing HCI_BLE_HARDWARE_ERROR_EVENT_CODE events with error code HW_FAIL_PKT_LEN_EXCEEDS_PDU_SIZE. I tried to correlate when I saw this event occur to a sniffer trace, and sure enough there was a packet the sniffer failed to decrypt with a corrupted L2CAP PDU length around the same time the central logged the error. I'm guessing the same underlying issue that causes the MIC disconnection with our actual central is causing this error on the TI central. On the TI central the packet is probably being discarded for other reasons before decryption, and so it doesn't trigger a MIC failure and subsequent disconnection. 

    -Josh

  • Just posting to keep this thread alive from locking since we are communicating internally.
  • This thread has been taken internal and I'm purposefully clicking TI thinks is resolved so that the thread will close for now and I won't be counted against on internal metrics. Will update the public forum with the solution when we find it.