This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

RTOS/CC2642R: Stack crash in Simple Peripheral when connecting multiple devices

Part Number: CC2642R
Other Parts Discussed in Thread: BLE-STACK

Tool/software: TI-RTOS

Since July 2018 we are already having the below problems, and still no fix has been made by Ti although we have reported the bug multiple times.

Stack crash when connecting with multiple devices

Setup

  • SDK 2.40 (January 2019)
  • Simple_Peripheral example from this release
  • Changed DEFAULT_ADDRESS_MODE to ADDRMODE_PUBLIC
  • Added the below code snipped (borrowed from the forum) at the end of the SimplePeripheral_removeConn function for some memory information (and the .h file on the top of the SimplePeripheral project).

#include <xdc/cfg/global.h> // This is included to access cfg file variables

// Get the HeapSize
ICall_heapStats_t stats;
ICall_getHeapStats(&stats);

if((HEAPMGR_CONFIG & 0x03) == 0x00)
{
 Display_printf(dispHandle, 14,0, "Using Heap: OSAL");
}
else if ((HEAPMGR_CONFIG & 0x03) == 0x01)
{
 Display_printf(dispHandle, 14,0, "Using Heap: HeapMem");
}
else if((HEAPMGR_CONFIG & 0x03) == 0x02)
{
 Display_printf(dispHandle, 14,0, "Using Heap: HeapMem + HeapTrack");
}

Display_printf(dispHandle, 15,0, "Heap Size total: %d", stats.totalSize);
Display_printf(dispHandle, 15,0, "Heap Size free: %d", stats.totalFreeSize);
  • 8 ESP32 modules flashed with the below Arduino code (note: it's very dirty code)
/**
   A BLE client example that is rich in capabilities.
   There is a lot new capabilities implemented.
   author unknown
   updated by chegewara
*/

#include "BLEDevice.h"
//#include "BLEScan.h"

// The remote service we wish to connect to.
static BLEUUID serviceUUID("0000fff0-0000-1000-8000-00805f9b34fb");
// The characteristic of the remote service we are interested in.
static BLEUUID    charUUID("beb5483e-36e1-4688-b7f5-ea07361b26a8");

const int ledPin = 21;
long myTime;

static boolean doConnect = false;
static boolean connected = false;
static boolean doScan = false;
static BLERemoteCharacteristic* pRemoteCharacteristic;
static BLEAdvertisedDevice* myDevice;

static void notifyCallback(
  BLERemoteCharacteristic* pBLERemoteCharacteristic,
  uint8_t* pData,
  size_t length,
  bool isNotify) {
  Serial.print("Notify callback for characteristic ");
  Serial.print(pBLERemoteCharacteristic->getUUID().toString().c_str());
  Serial.print(" of data length ");
  Serial.println(length);
  Serial.print("data: ");
  Serial.println((char*)pData);
}

class MyClientCallback : public BLEClientCallbacks {
    void onConnect(BLEClient* pclient) {
    }

    void onDisconnect(BLEClient* pclient) {
      connected = false;
      Serial.println("onDisconnect");
      ESP.restart();
    }
};

bool connectToServer() {
  Serial.print("Forming a connection to ");
  Serial.println(myDevice->getAddress().toString().c_str());

  BLEClient*  pClient  = BLEDevice::createClient();
  Serial.println(" - Created client");

  pClient->setClientCallbacks(new MyClientCallback());

  // Connect to the remote BLE Server.
  pClient->connect(myDevice);  // if you pass BLEAdvertisedDevice instead of address, it will be recognized type of peer device address (public or private)
  Serial.println(" - Connected to server");

  // Obtain a reference to the service we are after in the remote BLE server.
  BLERemoteService* pRemoteService = pClient->getService(serviceUUID);
  if (pRemoteService == nullptr) {
    Serial.print("Failed to find our service UUID: ");
    Serial.println(serviceUUID.toString().c_str());
    pClient->disconnect();
    return false;
  }
  Serial.println(" - Found our service");

  //    // Obtain a reference to the characteristic in the service of the remote BLE server.
  //    pRemoteCharacteristic = pRemoteService->getCharacteristic(charUUID);
  //    if (pRemoteCharacteristic == nullptr) {
  //      Serial.print("Failed to find our characteristic UUID: ");
  //      Serial.println(charUUID.toString().c_str());
  //      pClient->disconnect();
  //      return false;
  //    }
  //    Serial.println(" - Found our characteristic");
  //
  //    // Read the value of the characteristic.
  //    if(pRemoteCharacteristic->canRead()) {
  //      std::string value = pRemoteCharacteristic->readValue();
  //      Serial.print("The characteristic value was: ");
  //      Serial.println(value.c_str());
  //    }
  //
  //    if(pRemoteCharacteristic->canNotify())
  //      pRemoteCharacteristic->registerForNotify(notifyCallback);

  connected = true;
  digitalWrite (ledPin, HIGH);  // turn on the LED

  ////////////// disconnect ////////////////
  Serial.println("Resetting in 9...");
  delay(1000);
  Serial.println("Resetting in 8...");
  delay(1000);
  Serial.println("Resetting in 7...");
  delay(1000);
  Serial.println("Resetting in 6...");
  delay(1000);
  Serial.println("Resetting in 5...");
  delay(1000);
  Serial.println("Resetting in 4...");
  delay(1000);
  Serial.println("Resetting in 3...");
  delay(1000);
  Serial.println("Resetting in 2...");
  delay(1000);
  Serial.println("Resetting in 1...");
  delay(1000);
  delay(1000);
  ESP.restart();
  //////////////////////////////////////////

}
/**
   Scan for BLE servers and find the first one that advertises the service we are looking for.    
*/
class MyAdvertisedDeviceCallbacks: public BLEAdvertisedDeviceCallbacks {
    /**
        Called for each advertising BLE server.
    */
    void onResult(BLEAdvertisedDevice advertisedDevice) {
      Serial.print("BLE Advertised Device found: ");
      Serial.println(advertisedDevice.toString().c_str());

      // We have found a device, let us now see if it contains the service we are looking for.
      if (advertisedDevice.haveServiceUUID() && advertisedDevice.isAdvertisingService(serviceUUID)) { // Found our server
        BLEDevice::getScan()->stop();
        myDevice = new BLEAdvertisedDevice(advertisedDevice);
        doConnect = true;
        doScan = true;
      }


    } // onResult
}; // MyAdvertisedDeviceCallbacks


void setup() {
  Serial.begin(115200);
  Serial.println("Starting Arduino BLE Client application...");
  BLEDevice::init("");

  // Retrieve a Scanner and set the callback we want to use to be informed when we
  // have detected a new device.  Specify that we want active scanning and start the
  // scan to run for 5 seconds.
  BLEScan* pBLEScan = BLEDevice::getScan();
  pBLEScan->setAdvertisedDeviceCallbacks(new MyAdvertisedDeviceCallbacks());
  pBLEScan->setInterval(100);
  pBLEScan->setWindow(99);
  pBLEScan->setActiveScan(true);
  pBLEScan->start(5, false);

  pinMode (ledPin, OUTPUT);
  digitalWrite (ledPin, LOW);  // turn off the LED

  myTime = millis();
} // End of setup.


// This is the Arduino main loop function.
void loop() {

  // If the flag "doConnect" is true then we have scanned for and found the desired
  // BLE Server with which we wish to connect.  Now we connect to it.  Once we are
  // connected we set the connected flag to be true.
  if (doConnect == true) {
    if (connectToServer()) {
      Serial.println("We are now connected to the BLE Server.");
    } else {
      Serial.println("We have failed to connect to the server; let's reset and try again.");
      ESP.restart();
    }
    doConnect = false;
  }

  // If we are connected to a peer BLE Server, update the characteristic each time we are reached
  // with the current time since boot.

  if (doScan) BLEDevice::getScan()->start(0);  // this is just eample to start scan after disconnect, most likely there is better way to do it in arduino

  if (millis() > myTime + 5000 && connected == false) {
    Serial.println("No device found, resetting...");
    ESP.restart();
  }

  delay(1000); // Delay a second between loops.
} // End of loop

Procedure

  • Connect all ESP32 modules to a 8-port switch
  • Turn on the switch. Randomly turn it off and on. Also, the Arduino's will randomly start and stop by themselves.

Result

  • Within a few minutes, the stack will crash. See the below screenshot.
  • If you let about 5 modules connect, and then disconnect them all, you can see the memory footprint growing and growing.

Stack stuck, no crash (or maybe just not yet)

We tried something else.

Setup

  • We changed DEFAULT_DESIRED_MIN_CONN_INTERVAL to 12, DEFAULT_DESIRED_MAX_CONN_INTERVAL to 12, and BTM_BLE_CONN_TIMEOUT_DEF to 600.
  • We show the parameters where devices initially connect with by adding the following line in SimplePeripheral
// Display the address of the connection update
Display_printf(dispHandle, SP_ROW_STATUS_2, 0, "Link Param Updated: %s __ Interval: %d Latency: %d Timeout: %d",
Util_convertBdAddr2Str(linkInfo.addr), pPkt->connInterval, pPkt->connLatency, pPkt->connTimeout);

  • And also at the end of the AddConn function:
Display_printf(dispHandle, SP_ROW_STATUS_2, 0, "Interval: %d Latency: %d Timeout: %d", linkInfo.connInterval, linkInfo.connLatency, linkInfo.connTimeout);
  • We added The following code in our Arduino project, just before pClient->Connect:
esp_bd_addr_t addr;
memcpy(addr,myDevice->getAddress().toString().c_str(),6);
esp_ble_gap_set_prefer_conn_params(addr, 12,12, 0, 600);

This will cause the Arduino ESP32 example to connect without any parameter updates happening. We were hoping this might fix the issue (we've had issues with parameter updates in the past).

Procedure

  • Let the ESP32 connect for a few times (it resets a few times automatically). You'll notice that sometimes you can see 3 connections on the serial output, while actually there is only 1 devices connecting. Before we reached 2 maximum, but now 3. Probably this happens due to the increase timeout value of 600 instead of 300.

Result

The result of this was that the stack got stuck. It didn't get into an Abort, but it stopped advertising and we were not able to connect anymore.

We think something in handling multiple connection to the same devices it not going well in the stack. Please help us: in 2 weeks we're starting a demo with our customers, and this issue happens often.

  • Hello,
    When you perform these tests, are you simple turning of the power supply for each BLE Central device or do you allow them to gracefully disconnect?

    Do you have multiple BLE Centrals being connected at the same time?

    As a side note, could you please share the previous E2E threads were you have reported this.
  • Hi Joakim,

    Thanks for your swift reply. We turn off the power (or let the devices reset), so not a graceful disconnect. This obviously also happens if people lose connectivity when they are -just- out of range, and then pickup the signal again.

    As said, the first test was with 8 devices, but the second test was only with one device (after changing the timeout from 300 to 600 and a low connection interval) made the stack unresponsive in just a minute (no Abort() but advertising stops).

    We previously suspected it has to do with parameter updates (also considering other bugs there, that were solved in the 2.40 SDK). But I'm starting to believe it might be due to the fact that Android phones simply set different parameters at the start (possibly higher timeout values) - iPhones seem to adhere the advertised preferred connection parameters, while Android and also our ESP32 seem to ignore this.

    Below the list of the previous posts (but tested with older SDKs, and also reported other issues that might be, or not, related to this issue).

    - e2e.ti.com/.../2840209
    - e2e.ti.com/.../2726014
    - e2e.ti.com/.../2714998

    Thanks, hope you can help us out!
  • Hi,
    So let's narrow down the test case. It sounds like the real problem is that the BLE-Stack does not start advertise after the supervision timeout, which is odd (assuming that it's configured to do so). Are you able to sniff the connection (with Frontline or Ellisys) so that we could take a look at the air traffic.

    To confirm, you do not see any issues with iOS devices? (only with Android and ESP32)
  • I just got verification that the referred fix never went into the SDKv2.40.

    Ref. BLESTACK-4564 - Known issue where queued param updates in slave device cause application assert
  • Hi Joakim,

    Just to be clear, I found two problems, that might be related:

    1. An abort() when connecting/disconnecting multiple devices.

    2. A BLE stack that becomes unresponsive even when not sending any parameter updates. I think it's not just the advertising that stops, because in the past one of our other BLE devices was not able to connect anymore, and that device was not scanning for advertisement, but just trying to connect. Also, I see memory growing bigger. So, I'm suspecting something in the stack goes wrong.

    About BLESTACK-4564: I think it did make it into the SDK, because I've seen the code, and we have previously found and fixed that issue ourselves. The function is called SimplePeripheral_clearPendingParamUpdate and it didn't solve our issue unfortunately.

    Also, as I mentioned in my initial post, we have disabled parameter updates, and the issue persists, especially after increasing connection timeout. So testing with iPhones will likely give the same results - it's just harder to reproduce (because I can't power cycle them easily all at once).

    The best way for you guys to replicate the issue is I think to just 8 other Ti BLE devices connect to the CC26x2, and then power cycle them a few times. You'll see a crash or stuck stack soon (without parameter updates).

    In practice (without power cycling devices) it still doesn't make total sense to make, because both the master and slave should have the same timeout, so I wouldn't expect the stack to become unresponsive. But we've seen this issue in the past with Android devices (sometimes it just took an hour before the stack got stuck), so who knows what's going on. Maybe there is even a third bug, but without fixing this one, we'll not be able to pinpoint it.

    Currently I don't have tools to sniff air traffic. Whether iOS devices work or not work: we really need Android and ESP32 devices to work, so I hope you can forward this issue to the development team. Thanks!

  • Hi Jeroen,

    I'll see if I can replicate it here locally. Could you advise on

    • Advertising Interval for the Peripheral Device?
    • Connection Parameters (Conn. Interval, Slave Latency, Supervision Timeout) for the Central Devices?
    • Do you use Pairing/Bonding features?
    • Is Service Discovery performed on each connect? (should not if you use bonding)

  • Hi,

    • Advertising interval peripheral: I didn't change it. It's the SimplePeripheral default.
    • Connection parameters central device
      • Interval min max: 12, 12 (we put min and max to the same)
      • Slave latency: 0
      • Timeout: 600
    • Connection parameters peripheral
      • Interval min max: 6, 12 but also 12, 12 (both resulting in 12). I've tested in the past with the default ones of the SimplePeripheral project as well.
        Note: my goal was to prevent any Parameter Updates from happening, to make sure those were not the issue. Because of a bug in our ESP32 Arduino central device's BLE stack, the central always connects with interval 12 and timeout 600. Therefore I changed it to these settings on the Ti as well. Previously I have tested with interval 24 and timeout 300 (on a ESP32 with non-Arduino code). Also in that case, the firmware would get into an abort(). So I doubt the interval is the issue.

      • Slave latency 0
      • Timeout: 600 (it seems that the error occurred faster then. Yesterday I tried a large setting again and it didn't get stuck as fast). I suspect that the "ghost connections" (one device rebooting and connecting, while cc26x2 thinks 3 different devices are still connected sometimes) might cause some memory not to be freed.

    • No pairing/bonding features used.
    • Service discovery: I'm not entirely sure. We look for a service UUID. 

    The fastest I manage to crash it was by manually cutting power off all centrals, just at the time they are getting connected, and then restore power quickly.

    Feel free to contact me directly via Skype (pierre-oord) or phone (+31 6 23497020). I'm in timezone GMT+1. Thanks!

  • Jeroen,

    Let's keep the Peripheral Connection parameter update out of the picture for a while.

    Do you really need such a small connection interval? I do not see how the Peripheral can be connected to multiple Centrals reliably. The Central mandates the connection event anchor point, which could potentially conflict with other connections as they are established. There is no way the Peripheral can aid in that, per specification, as far as I know.

    Have you tried larger connection intervals, i.e. 200ms?

    The Peripheral device will assume that the Central is connected the entire Supervision timeout period (unless gracefully disconnected), as defined by the core specification. So the Peripheral will keep waking up those connection events listening after the Central device. As you turn supply on and off on multiple Central devices during a short amount of time, I understand you are stressing out the Peripheral device.
  • Hi Joakim,

    Thanks again for the swift reply.

    I've now changed the connection intervals as follows:
    - Peripheral (Ti cc26x2): min 80, max 104, slave 0, timeout 2000
    - Central (ESP32 with Arduino library): min 80, max 104, slave 0, timeout 2000

    Due to a bug in the Arduino library, it will however connect with interval 12 and timeout 600. Then, after 6 seconds, a parameter update will happen (resulting in a 104 latency and timeout of 2000).

    I let 3 ESP32 connect at the same time. Right after the parameter update (104 latency, 2000 timeout), I kill power of the centrals. I restore power. Boom, crash. Sometimes you've to repeat the procedure a few times.

    I hope this proves that even though the ESPs run on a 104 latency, the crash (ABORT - ABNORMAL PROGRAM TERMINATION. CURRENTLY JUST HALTS EXECUTION) will still happen. I'm not sure if the "stopping of advertising" has anything to do with this bug, but I suspect it has. Or it's something else. So after fixing this abort, let's keep trying to connect and disconnect with multiple devices for one hour, to see if the stack keeps responsive.

    Oh when I'm talking about latency and timeout: 104 = 130ms, 2000 timeout = 20 seconds (which is the hard coded default of most Android phones)


    Pierre

  • Hi Pierre,
    We are setting up some tests to reproduce, with short connection intervals. I'll keep you posted on our progress.

    When you say "Abort", I suppose you refer to a condition on the ESPs? I am not familiar with these devices although it concerns me that they are connecting with minimum connection interval per default.
  • Hi,

    I see the same issue as well with devices that initially connect with longer intervals (Android phones, ESP's with non-Arduino SDK). However, it's then much harder to reproduce. If you look at one of my older topics, you will find an Android app that connects and disconnects. If you can find about 7 Android phones (that the app works with, some devices BLE stack don't work well with this app), you should be able to reproduce the crash as well (but it can take 20-60 minutes). 

    When I'm talking about an Abort, I'm talking about the Ti cc26x2 that halts while being in debugging mode (by clicking the "bug" icon in Code Composer Studio). In my opening post of this topic, I have attached a screenshot of this crash (notice the abort() where I'm talking about). The ESP's are doing just fine.

  • Hi Jeroen,
    Sorry for not catching the abort reference.

    We have a setup with 3 centrals connecting to a peripheral device, where we power cycle without being able to get any abort. We are using SDKv2.30, and Rev C versions, although according to your comments it should not matter. At some occasions the Peripheral is not able to advertise after 2 connections, which is probably because it's radio is busy with connection events (those will always be prioritised). Note that we do not perform any data transfers (service discovery etc.), which might be one of the reasons you observe memory challenges.

    Have you been able to reproduce this with TI launchpads as central? It's critical that we are able to reproduce this in order to figure out a solution.

    I am very surprised that longer connection intervals can cause the same problem. Does this problem (unresponsive peripheral) also occur when the peripheral is not in debug?
  • Hi Joakim,

    No worries :)

    You mention that at some point it didn't advertise after 2 connections. Would it ever restart advertising after the connections were dropped? When I see that advertising stops, it never restarts.

    Strange that you're not able to reproduce it. Could I maybe simply send you our CC26x2 Dev Kit (with correct mac address) and some ESP32 centrals so you can reproduce it with those? Please send me the address and I'll send a complete package by mail for you to test. This issue is of huge importance to us, I'd even take a flight to test it together with you, if that would help.

    Today I'm not at the office, tomorrow I can check if debug mode makes any difference. But mostly we see the "abort()" issue now, only sometimes the unresponsiveness. Therefore I'd say let's start with fixing the abort, and hopefully we fix the unresponsiveness with that as well.

  • Hi,

    You are correct that advertising should start as soon as all connections are dropped, which is something we are looking into. I am still uncertain how the stack is able to handle a large set of devices connecting with minimum connection interval. I found this information in the readme for MultiRole project, which might be equally applicable to your setup;

    If the project is configured for too many connections (via the MAX_NUM_BLE_CONNS preprocessor define) and also security, it is possible for heap allocation failures to occur which will break the stack. Therefore, the project should be stress-tested for its intended use case to verify that there are no heap issues by including the HEAPMGR_METRICS preprocessor define. See the Debugging section of the BLE5-Stack User’s Guide for more information on how to do this.

    When at least one connection is already formed, in order to allow enough processing time to scan for a new connection, the minimum possible connection interval (in milliseconds) that can be used is:

    12.5 + 5*n

    where n is the amount of current connections. For example, if there are currently four connections, all four connections must use a minimum connection interval of 12*5 + 5*4 = 32.5 ms in order to allow scanning to occur to establish a new connection.

    By applicable, I mean that the device is able to maintain connections and advertise, while connections are dropped as well. Since the Peripheral have to obey master ancor points, it might be even more crucial.

    You should not have to send us hardware. Could you please confirm the version of the launchpad and chipset? 

  • To confirm our setup, if the device doesn't start advertise after 2 or more connections, it starts to advertise as soon as a connection drops. So this is related to available open radio slots. This is not a real issue, as the BLE Stack is able to recover as time slots open up again. At this point we still cannot reproduce the problem you are seeing.

    If you do not use pairing/bonding, the central devices will (and should) perform service discovery (as mandated by the specification). As you power cycle devices during one or multiple service discovery phases, I'm sure its possible to stress the system enough for an abort. Could you try with disabling the service discovery, and reproduce?
  • Hi Joakim,

    Thanks for the information. Following your formula, an interval of 100ms should be fine for at least 17 devices – and I’ve never had more than 7 devices connected (usually around 4).
    Today, I have tested again. I finally found a way that always makes it crash, no luck required. I tested with:

    • Both Launchpad (rev c? See pictures, on the MCU itself is no Rev number) and Rev E silicon
    • On Rev E, also without debugging mode (I added a counter to see if the board was still alive)

    Board

    Info about our ESP32 devices

    • Configured to connect at a certain (see later) connection interval and timeout. No param updates will happen.

    Info about our Android device #1

    • Model: Xiaomi Mi6
    • Android version: 8.0.0
    • Has BLE5 support (PHY updates)
    • When it connects to SimplePeripheral (as sole central), we see the following happen:
      • Initial android connection: 36 (=45ms) interval, 500 timeout
      • PHY Update to 2M
      • Link param update to interval 6 (=7,5ms), timeout 500
      • Link param update to interval 36 (=45ms), timeout 500
      • Link param update to interval 102 (=127,5ms), timeout 2000
    • I assume it discovers services and characteristics

    Info about our Android device #2

    • Model: Motorola G5+
    • Android version: 8.1.0
    • Has no BLE5 support (no PHY updates)
    • I assume it discovers services and characteristics

    Source files (you can see my changes with WinMerge or so)

    Please read the below test very carefully, I sometimes made slight adjustments to intervals.

    Test 1.0: 5 ESPs with Android, timeout 2000 - abort()

    Setup

    • SDK 2.40
    • SimplePeripheral configured to interval min 80=100, interval max 104=130ms, 2000 timeout.
    • Centrals will connect with interval 104=130ms, timeout 2000.

    Procedure

    • Connect 5 ESP32 centrals, wait 2 seconds before connecting each one. Centrals are NOT discovering characteristics or services.
    • Connect Android phone with NRFConnect app.

    Result

    • Centrals all connect. No parameter updates happen.
    • When connecting Android
      • First we see it connect with interval 36 (=45ms) interval, timeout 500
      • I see the param update of interval 6 (=7,5ms), timeout 500
      • Right after that that moment, or just afterwards, it crashes (abort).

    I did it multiple times, and I can exactly reproduce it. Note that Android completely ignores the peripherals min/max interval (it doesn't care about what's in the advertisement package). So we cannot change the interval of 6 that Android uses.

    Test 1.1: 5 ESPs with Android, timeout 300 - abort()

    Setup

    • Same as test 1.0, but:
    • SimplePeripheral configured to interval min 80=100ms, interval max 104=130ms, 300 timeout. I used this lower timeout setting, otherwise advertising would often stop due to >=8 devices.
    • Centrals will connect with interval 104=130ms, timeout 300 (This I could earlier not show, because we developed non-Arduino code for the ESP32).

    Procedure

    • Connect 5 ESP32 centrals, wait 2 seconds before connecting each one. Centrals are NOT discovering characteristics or services.

    Result

    • Centrals all connect. No parameter updates happen.
    • When connecting Android
      • First we see it connect with interval 36 (=45ms) interval, timeout 500
      • I see the param update of interval 6 (=7,5ms), timeout 500
      • Right after that that moment, or just afterwards, it crashes (abort).

    I did it multiple times, and I can exactly reproduce it. Note that Android completely ignores the peripherals min/max interval (it doesn't care about what's in the advertisement package). So we cannot change the interval of 6 that Android uses.

    Test 1.2: 5 ESPs no Android, timeout 300 - stuck in a simple_peripheral_spin()

    Setup

    • Same as test 1.1, but:
    • Test without Android phone
    • Centrals will connect with interval 104=130ms, timeout 300.

    Procedure

    • Connect 5 ESP32 centrals, wait 2 seconds before connecting each one. Centrals are NOT discovering characteristics or services.
    • Randomly connect and disconnect devices, sometimes all at once. 
      Please see this video of how I do it. I think I've reconnected each device about 20 times.

    Result

    • Centrals all connect. No parameter updates happen.
    • After a while, the counter stops. No abort().
    • When I pause the program, I see it's stuck in a spinner. See image below:

    I guess this bug might be unrelated to our main issue, but please also put it on your list.

    I was actually expecting to see an abort() as in previous tests, but it didn't happen. Maybe the abort() only happens easily when the timings are smaller, like I tested previously with. It's still a bug, as Android does connect with such low timings, and half the world is using it... Read on!

    Test 2: 5 ESPs, 6th ESP with low latency 6=7,5ms like Android param update, timeout 2000 - works fine

    Setup

    • Same as test 1.0, but:
    • I added a 6th ESP32 instead of an Android phone, and changed settings to interval 6=7,5ms, timeout 500.

    Result

    • No abort(). I disconnected and reconnected it, still no abort().
    • I connected my Xiaomi Android phone: directly an abort(). 

    Could it be that the 36=45ms latency causes a problem? Let's test.

    Test 3: 5 ESPs, 6th ESP with low latency 36=45ms like Android param update, timeout 2000 - works fine


    Setup

    • Same as test 2, but now with 6th ESP32 and settings to interval 36=45ms, timeout 500.

    Result

    No abort().

    Test 4: 5 ESPs, 6th ESP with service and characteristics discovery, timeout 2000 - works fine

    Setup

    • Same as test 3, but now with 6th ESP32 configured to discover services and characteristics (I did not verify if this works correctly, I'm not an ESP32 expert)

    Result

    No abort().

    Note: I have not been able to completely verify if the ESP32 is indeed doing service discovery and characteristics discovery, there was no debug output.

    Test 5: 5 ESPs with Android non-BLE5 , timeout 2000 - abort()

    Setup

    • Same as test 2, but now with Android device without BLE 5 (so no PHY update) (Motorola G5+)

    Result

    Directly an abort()

    Test 6: 5 ESPs with Android, timeout 2000, no debugger - abort()

    Setup

    • Same as test 1, but now without running in debugger

    Result

    Directly an abort()

    Test 7: 5 ESPs with/without Android, timeout 2000, old 2.30 SDK - abort()

    Setup

    • SDK 2.30 (!!!!)
    • SimplePeripheral configured to interval min 80=100, interval max 104=130ms, 2000ms timeout.
    • Back ported the issues fixed in 2.40 SDK (ampersand, clear pending parameter updates)

    Result

    • Often when the 5th ESP32 connected, it already crashed. If we were lucky, it stayed stable.
    • Then, when we connected an Android phone, it again crashed like with the 2.40 SDK.

    Conclusions

    • The number of connections seem to matter. I used 5 ESP's and 1 Android in this test case. Likely less will also make it crash. I did one test with 3 ESPs and 1 Android and it crashed but it took a few more seconds.
    • The parameter update on Android seem to cause problems. Or it's just something else that's happening when an Android phone connects, and this happens to be one of the messages. We have used Android 8.0.0 and 8.1.0. It would be interesting to let a parameter update happen like Android phones do (3 times, pretty fast after each other), but I don't have such a setup ready. Interesting is that Android connects with very low intervals initially (like I was previously doing as well). We cannot stop this, as its the Android BLE stack. I also think it's OK with BLE spec (phones don't have to look at the broadcasted suggested parameters, like iOS does).
    • The PHY update (BLE 5) is unrelated to this problem.
    • The timeout of 2000ms or 300ms (both on simple_peripheral and on ESP32s) doesn't seem to influence anything really.
    • Debug mode seems to have nothing to do with it.
    • SDK 2.30 seems to be a lot more sensitive to these crashes, often already crashes without an Android phone connecting, even though I backported fixes of Simple_peripheral.c. So, please use 2.40 as it seems there are differences.
    • Multiple bugs?
      • In my previous posts I also had issues without having seen any parameter updates happening. For example, when I configures very low intervals once. As seen in 1.1, I also had some troubles this time to reproduce this issue with larger intervals, although at some point it did got stuck in a spinner.
      • It could be that stopping of advertising is unrelated to this issue.

    Again, I can send you a few ESP32 modules if you like, or I can send you the source code of the ESP32, that you can use to connect to any board. It's easy to setup and to let it connect to a MAC address. Let me know if you have any questions. Hope you'll be able to reproduce it now!

  • Hi Joakim,

    Any update on this? :)

  • Hi Jeroen,
    I have assigned one of the BLE Stack experts to take a look at this. Sorry for the delay in response.

    Side question; why haven't you migrated to SDK v2.40?
  • Hi Joakim,

    We have migrated to 2.40. But I did one test with 2.30, because you asked me to... If I didn't mention it in my reply, it was SDK 2.40.

    Thanks, hope to hear back from them soon!

    Pierre
  • Hi. Thank you for the very detailed and concise analysis. I agree that the issue(s) appear to be triggered by the Android phone. I can try to reproduce this using TI devices as centrals and whatever Android device I can find. However, there are so many variables here that it is very likely not to be an equivalent test.

    So in parallel, I think you should send us your devices. I will send you a PM with directions.
  • Hi Tim,

    I've sent out a box containing:
    - 5x ESP32 WROVER module
    - 5x USB cable
    - 1x USB hub with external power
    - 1x Android phone (Motorola G5+)
    - 1x printed return address (Loqed B.V. - Donauweg 10 - 1043 AJ Amsterdam - The Netherlands)

    Also
    - At this link you can download the source test software (for Linux) for the ESP32: www.loqed.com/.../connection_tester.zip
    - At this link you can read a quick howto of how to get it running: www.loqed.com/.../Ti-setup-instructions.pdf

    Let me know if there are any questions, hope we have a fix soon!
  • Hi. I received the hardware today. I should be able to set this up tomorrow. I'll let you know if I run into any problems.
  • Well this is not going well.
    I can't install the driver required for these devices because it is blocked by our IT. I'll try to get this resolved but until then am blocked.
  • That's too bad, hope you find a solution soon! It's a major issue for us.

  • I was able to get around the IT limitation. However, the driver doesn't work with the device. Do you know where I find a Windows driver for these boards?
  • Hi Tim,

    The board is a "Wemos LOLIN D32 Pro V2 - ESP32 - CH340C - 4MB Flash - 4MB PSRAM". The Windows driver for the USB->Serial can be found here: http://www.wch.cn/download/CH341SER_EXE.html

    The setup of your ESP32 environment will also be slightly different. This page should get you started.

    Feel free to give me a call if there is anything. +31 6 23497020

  • Hi Tim,

    It's very quiet, how is it going? We're in serious need of a patch for this issue!

  • Well the setup has been quite painful. I've made it to the flash make goal now but am getting the following error:

    C:\msys32\opt\xtensa-esp32-elf\bin\xtensa-esp32-elf-objdump.exe: vfs.o: File format not recognized
    make: *** [esp-idf/make/project.mk:438: /home/a0221118/connection_tester/build/vfs/libvfs.a.sections_info] Error 1

    Honestly, it might be easier for you to just send me pre-flashed devices. I can easily change the address of my peripheral.
  • Hi Tim,

    On the below link you can find the connection tester binaries, ready to flash. I included some spaces in the URL to ensure the message is posted directly, instead of getting into the review process.

    www . loqed . com / Loqed_Uploads / connectiontest_binaries . zip

    Below is how you can flash them (no need to build). I think most steps you have already taken, maybe you only need to do the final step to start flashing. The binaries will try to connect to a device with MAC 80:6F:B0:EE:EA:13.

    1. Install Python
    2. Install PIP for Python (not required for newer pythons): www.makeuseof.com/.../
    3. Install ESPTool pypi.org/.../
    4. Ensure the serial port is available (change the COM number in the below command if required) and then:
    esptool.py --chip esp32 --port COM19 --baud 2000000 --before default_reset --after hard_reset write_flash -z --flash_mode dio --flash_freq 40m --flash_size detect 0xd000 ota_data_initial.bin 0x1000 bootloader.bin 0x10000 LoqedBridge.bin 0x8000 partitions.bin

    As said before: if you see dots while it looks for the device to flash and it doesn't start, then try to connect PIN 0 to ground a few times. The flash process will then start (although on my Windows machine, I actually had no issues at all compared to my virtual Linux machine).

    Let me know if you run into any issues! You can also call to +31 6 23497020, I'm available till 22:00 every day (GMT +1).

    Pierre

  • Thanks, this looks promising. I will try this out on Monday.
  • It appears that I am able to flash with this method. however, I am missing two of the binaries above. Just to be safe can you send me the binaries you are using?
  • Hi Tim,

    Which binaries are you referring to? I've included all you need (I tried flashing myself on a different machine).

    I just tried to call you, but they had some trouble locating you. Maybe it's good if you give me a call, so we can more quickly find a solution. My number is +31 6 23497020

  • Ok I must have deleted them somehow when I was building.

    I will try flashing again just using the original .zip you sent me.

    Assuming this will work, how do I know what address are these precompiled binaries trying to connect to? I will set my peripheral device to this address.
  • I unzipped the connection_tester.zip that you sent me and it does not include the following files:
    - ota_data_initial.bin
    - loqedbridge.bin
  • Hi tim,

    In my post, I have linked to a new file that contains all the files you need. Also, in the same post, I mentioned the MAC address that the devices connect to. Below the copy-paste of this post.

    Hi Tim,

    On the below link you can find the connection tester binaries, ready to flash. I included some spaces in the URL to ensure the message is posted directly, instead of getting into the review process.

    www . loqed . com / Loqed_Uploads / connectiontest_binaries . zip

    Below is how you can flash them (no need to build). I think most steps you have already taken, maybe you only need to do the final step to start flashing. The binaries will try to connect to a device with MAC 80:6F:B0:EE:EA:13.

    1. Install Python

    2. Install PIP for Python (not required for newer pythons): www.makeuseof.com/.../

    3. Install ESPTool pypi.org/.../

    4. Ensure the serial port is available (change the COM number in the below command if required) and then:

    esptool.py --chip esp32 --port COM19 --baud 2000000 --before default_reset --after hard_reset write_flash -z --flash_mode dio --flash_freq 40m --flash_size detect 0xd000 ota_data_initial.bin 0x1000 bootloader.bin 0x10000 LoqedBridge.bin 0x8000 partitions.bin

    As said before: if you see dots while it looks for the device to flash and it doesn't start, then try to connect PIN 0 to ground a few times. The flash process will then start (although on my Windows machine, I actually had no issues at all compared to my virtual Linux machine).

    Let me know if you run into any issues! You can also call to +31 6 23497020, I'm available till 22:00 every day (GMT +1).

    Pierre

  • Hi Tim,

    Any update? Perhaps it would be good to call?

    Pierre

  • Hi sorry I haven't been able to work on this in a while. I see now the link you posted to the connection tester binaries; i missed it before.
  • Hi Tim,

    I'd like to make an appointment with you in Norway/San Diego to debug the issue together there. Our company (startup) will not survive if we don't solve this issue, and we've been working on this for 8 months now. Our investors are giving up on us as we are not able to solve this issue.

    As I told you before, apart from the crash, it's even more severe: if it doesn't crash, the BLE stack simply stops working sometimes. It happens often with multiple phones connected (2 iOS, 3 Android, 1 external device), but -also- with just one iOS phone (just less often). It's clearly something bad happening in the stack - it's happening with the Simple Peripheral project as well. The application is still running, but the BLE stack will never return.

    I'd like to book a flight to Norway/San Diego to fix this issue with you together. Would any day this week work for you?

  • I'm marking this as resolved for tracking purposes for now since we are communicating via email. Once we have final fixes, I will post them here.