This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CC2530: OTA upgrade issues when more end devices connected at the same time

Part Number: CC2530
Other Parts Discussed in Thread: Z-STACK,

Hi everyone,

We've been building up a zigbee system using ESP8266 and CC2530, at coordinator we use ESP8266 as wifi module and SoC and it will download OTA package from cloud server when OTA process begin, then send the file piece by piece to CC2530 zigbee master through UART on the same board, then CC2530 master send to CC2530 zigbee device using Z-stack through zigbee.

So the flow is like below:
Server --> [ESP8266 UART to CC2530] -> CC2530 Zigbee device

After some effort, we are able to run OTA smoothly when only one CC2530 end device connected to the coordinator, it is still fine even there are 4 or 5 end devices connected at the same time. We designed the process to OTA one end device only even if there are multiple end devices connected. All end devices can OTA successfully one by one even with 4 or 5 end devices connected at the same time.

The problem happens when there are more than 6 or 7 devices connected. In that condition we start to see one of below errors randomly and they prevent OTA from finishing. Although we design the OTA process to retry automatically, because of the two problems below, It may take more than 8 hours for one end device to OTA successfully in that situation and sometimes even no success after 12 hours.

Error 1:  After zigbee end device send Image Block Requests, ESP8266 reply Image Block Response through UART, but zigbee device will not receive and eventually reply Abort.
Error 2.  After all OTA Image file transmitted successfully without Error 1 happening, zigbe end device reply invalid image. (#define ZCL_STATUS_INVALID_IMAGE 0x96)

Because it all works perfectly when doing OTA 1 on 1 or with few end devices connected, I am wondering is there any limitation on the Z-stack that may cause above problem? Or does anyway have similar experience that can share some thoughts?

Thank you in advance.

  • I suggest you to use Sniffer to check what happens over the air.

  • Hi,

    As YK mentioned, please attach a sniffer log.

    Which Z-Stack are you using?

    Error 1: Do you mean that for  [ESP8266 UART to CC2530] , the CC2530 is unable to receive the block response via UART from the ESP (and therefore the end device never receives the Image Block Response)? OR that the CC2530 receives block response via UART from the ESP, sends it over-the-air, but the Zigbee end device does not receive it?

    Error 2: Can you debug on the end device to see where exactly in the code that status is being returned?

    Besides Zigbee OTA, how often are the end devices communicating with the coordinator (the CC2530+ESP)?

    Regards,
    Toby

  • Hi Toby, YK,

    Thanks a lot for your reply. Please see my reply below:

    1) We are using Z-Stack-Home-1.2.2a.44539

    2) When Error 1 happening, it is CC2530 master on coordinator not sending data to CC2530 device through zigbee. We can see CC2530 device keep retrying to read data but no response until abort error happen.

    3) For Error 2, we can identify 0x96 happening after CRC download completed. It is in "zcl_ota.c", after running HalOTAChkDL, that's where we get ZCL_STATUS_INVALID_IMAGE 0x96 error.

    4) Communication between end device and ESP+CC2530 master including below:
        a. Keep alive, end device will send keep alive every 20 seconds to CC2530 master and then through UART to ESP8266
        b. End devices other than the one doing OTA, will constantly send req image command every 30 seconds to CC2530 master and then through UART to ESP8266
        c. All end device will be polling from CC2530 master constantly, the one doing OTA will be polling at 2000ms rate, the others not doing OTA will be polling at 100ms rate

    Aside from above information, we are preparing sniffer log now, will provide later for your reference soonest we can, any help will be appreciated.

    BR,
    Jacky

  • Let's wait your sniffer log to check the issue.

  • Hi Toby, YK,

    We've managed to record the log when those two errors happen, please check sniffer log from below link. This is the situation with 11 end devices connected to coordinator.

    From the log please filter out panid = 0xd035, that's the system we ran the OTA process. In the log you can find 4 OTA process, short address ac0f / 95cf / 95cf / ac0f respectively, and you can check those two OTA logs of 95fc, because those 2 OTA attempts in the log contains complete process from req image to the end MT_OTA_STATUS_IND. One of the 95cf OTA result is 0x96, the other one is 0x95 (Abort), as we described in our problem earlier.

    From the log of panid = 0xd035, you can also see what communications were being made between coordinator and other end devices who were not doing OTA. As mentioned, we design to do OTA one end device at a time to increase success rate, but we are not sure are those regular system communications between coordinator and other end devices will affect the OTA process too.

    Another point we like to mention is, the whole data transfer process becomes very slow with 11 end devices connected, we ran overnight with this setup and only 2 complete OTA attempts could be finished with above 0x95 and 0x96 error respectively.

    We'd be very grateful if any help can be provided to solve the problem, thank you.

  • According to your sniffer log, both ac0f and 95cf are not finished requesting/receiving OTA image completely. I see your ZED's polling rate usually less than 1 seconds. I would suggest you to increase POLL_RATE on your device and do OTA to one device at a time to see if the OTA process gets improved.

  • Hi YK,

    Thanks for quick reply. In fact we suspect this factor too so we've tested earlier already. We compared the poll rate at current 100ms and default 8 sec., however this doesn't make much difference to the OTA result.

    It maybe slightly increase the overall process speed, but when connect with more devices like 11 or max 15 then all problems remain happening, 0x96, 0x95, slow file transfer all happened still.

    Sorry for pursuit, but can you find any other suspicious point from the sniffer log we provided?

    Thanks for your help.

    BR,

    Jacky

  • Are you able to increase the heap size?

    See section 3.3 Heap Size of this: http://www.ti.com/lit/wp/swra635/swra635.pdf

  • Hi, our current heap size is as below:

    • coordinator(zigbee master): INT_HEAP_LEN 3072
    • end device: INT_HEAP_LEN 2048

    We will try adjust this and see if it helps the situations, thanks for feedback.