This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

SIMPLELINK-CC13X2-26X2-SDK: Firmware not stable since SDK 6.20.00.29

Part Number: SIMPLELINK-CC13X2-26X2-SDK
Other Parts Discussed in Thread: Z-STACK, ARM-CGT, CC2652RB, SYSCONFIG, CC2652P, CC2652P7, CC1352P

With recent SDKs, many users a reporting stability issues (total crash, mac errors, nwk table full errors, devices dropping off). This seems to be caused from 6.20.00.29 and up since 6.10.01.01 works well for many users. To figure this out, I've compiled 2 firmwares where only the SDK version was different. See an overview of all the results in this spreadsheet

I do not see any significant changes in the changelog of 6.20.00.29.

My question: what has been changed in 6.20.00.29 and up which could cause these issues?

  • Hi Koen,

    I have reviewed the Z-Stack source code and found no significant changes between the v6.10 and v6.20 SDK, as you've already noticed is reflected in the changelog.  I have also not been observing similar reports from other customers using the newer SDKs.  The v6.20 SDK Release Notes reflect many global changes which affect Z-Stack:

    • Removed support for TI ARM-CGT compiler examples and libraries, in favor of the TI Clang Compiler
    • Deprecated and removed support for PIN Driver in favor of GPIO++
    • Deprecated and removed support for UART driver, in favor of UART2 in all cases except in BLE5-Stack’s NPI.

    What dependencies are you using to build the v6.20 Z-Stack project, how did you migrate your resources from v6.10, and what small changes have persisted?  Note that ZNwkTableFull typically relates to MAX_RTG_ENTRIES/MAX_RREQ_ENTRIES/MAX_RTG_SRC_ENTRIES and recall that there have been changes to default heap allocation during these SDKs.

    Regards,
    Ryan

  • - How can I see the dependencies? I've compiled both the working (6.10) and non working (6.20) firmware with CCS 12.4.0

    - I've migrated the changes from 6.10 to 6.20 by creating a patch file of all the changes

    - Here is a link to this patch (for just the CC26XR1, but the patch for CC2652RB/CC1352 is the same): patch . Here you can review all the table sizes. Apart from some file hashes that changed, the patch for 6.10 is the same.

    - I'm aware of MAX_RTG_ENTRIES and MAX_RTG_SRC_ENTRIES, but not of MAX_RREQ_ENTRIES! Thanks for pointing me to it. Could you also review the other table sizes I'm using? You can find them in the preinclude.h in the patch I linked. 

          - I'm for example still unsure what value to use for SRC_RTG_EXPIRY_TIME/ROUTE_EXPIRY_TIME, I know 255 disables the route removal, but what happens if the table gets full? On the other side, a value of 10 would expire the route in 10 seconds but AFAIK it will not be removed until the table gets full. So I don't understand why someone would ever use 255 here.

    - Regarding the default heap allocation, this was already changed in 6.10 (which works fine), so this cannot be the issue.

    - The ARM-CGT compiler also cannot cause this issue, I was already using the TI clang compiler for my 6.10 firmware

  • Project Properties -> CCS General -> Project/Products tabs (for compiler, SDK, and SysConfig dependencies).  Also CCS Build -> Environment tab.  I don't think MAX_RREQ_ENTRIES should make a difference, I just listed all definitions which could return a NWK table full error.  You may also consider CONFLICTED_ADDR_TABLE_SIZE.  We have reviewed most, if not all, of the table sizes listed in the patch previously, but the sheer number of changes does make this difficult to further track and quantify.  It is true that expired active routes are not removed until necessary, and if route removal is disabled then a typically-expired route would remain even if the table gets full.  This would be the implicit decision of the developer to enact, regardless of purpose.

    Is the most prevalent commonality in each instance that they NWK table is full, cannot join/rejoin new devices, and perhaps crashes? Can any sniffer logs (with important packets highlighted) or debugging logs be provided?  I will try to ask the Software Development Team for their opinions but the best support will be available if the issue can be replicated with a reduced Z-Stack patch which still causes the issue.

    Edit: Software Development also does not know of any significant differences between the SDK versions.

    Regards,
    Ryan

  • I've compiled a new firmware for users to test with the table sizes adjusted as you mentioned: Z-Stack_3.x.0 coordinator 20231111/20231112 feedback · Koenkk/Z-Stack-firmware · Discussion #483 (github.com)

    if route removal is disabled then a typically-expired route would remain even if the table gets full.

    If route removal is disabled (SRC_RTG_EXPIRY_TIME = 255), the table gets full (MAX_RTG_SRC_ENTRIES) and a new route is discovered, will the table overflow and cause e.g. the fw to crash? Or will new routes not be added anymore?

    Hereby all the dependencies I use to compile the 6.20 SDK:

  • You may consider using the dependencies listed in the Z-Stack Release Notes for v6.20, but I do not expect this would make a significant difference with the topic at hand 

    If MAX_RTG_SRC_ENTRIES is met, the routing layer will exit with RTG_SRC_TBL_FULL without further action.  Please let me know the consensus of the new image which is being tested when it is available.

    Regards,
    Ryan

  • Hi Ryan,

    A user captured a log + sniff of a crash with a 6.20 fw with a minimal amount of changes.

    - All changes of this fw: diff.patch (since these changes looks so minimal and standard to me, I think we can rule out of this being the culprit)

    - Link to the log + sniff: link, some things that caught my attention:

      - Various MEM_ERROR (0x10) can be seen in the log (zigbee-herdsman:adapter:zStack:znp:SRSP <-- AF - dataRequest - {"status":16})

      - Various BUFFER_FULL (0x11) can be seen in the log (zigbee-herdsman:adapter:zStack:znp:SRSP <-- AF - dataRequest - {"status":17})

      - At some point it completely crashes (failed (SRSP - AF - dataRequest after 6000ms))

      - I've asked for the network key such that the sniff can be decrypted

  • Hello Koen,

    Ryan is out for the moment (holidays), and I wanted to let you know that I took a look at your links and your second link of the log + sniff (to the google drive), we can't open (not a problem on your end, we can't open that format). Could you include your results in a zipped file perhaps so we can take a look at it? We may be a bit delayed due to the holidays; I apologize for any inconvenience. 

    Thanks,
    Alex F

  • Hi Alex

    Attached the files, I think the first pointer is the MEM_ERROR.

    - Why does the SDK generate this error?

    - What can be done to prevent it?

    Archive.zip

  • Hey Koen,

    ZMemError can be returned if there is not enough heap memory to complete the request, and ZBufferFull could indicate that the NWK/MAC buffers are temporarily full.  Both can be searched within the znp project.  I have provided some comments in this relevant E2E response concerning ways to alleviate these issues.  I apologize if we had not addressed this previously, or perhaps it was missed in the minimal patch.

    Regards,
    Ryan  

  • we are also facing unstability with sdk 7.10 and with cc2652p, primarily ZCL report command api not working reliable ..sometime it works and sometime it stuck in policy error for loop. 

  • Hi Dhanraj,

    Please start a new thread with all details including your observations, stack changes, sniffer/debug logs, and versions tested.

    Regards,
    Ryan

  • It took some time but I finally managed to get a sniff + log of a crash. This user tested various firmwares:

    - 20230922 (= 6.10 SDK + all my changes): This firmware works fine, no crashes and performs good

    - 20230923 (= 6.20 SDK + all my changes): This firmware crashes and performs bad

     - 20231221 (= 6.20 SDK + minimal changes): This firmware crashes and performs bad

    Links to my changes:

    All changes

    Minimal changes

    To make sure the crash is not because of "all my changes", the sniff + log below is from 20231221, so a firmware with the minimal changes.

    Log + sniff: nick_crash_sniff_log_20231221.zip

    Notes:

    - I see that there are a lot of route requests on the network, maybe this contributes to the crash?

    - The last message send by the coordinator is #2992071

    - In the log, the ZNP stops communicating at "2023-12-23T22:24:00.450Z", (search for "failed (SRSP - AF - dataRequest after 6000ms)")

    - To get all the communication between ZNP and Z2M, filter on "zigbee-herdsman:adapter:zStack:znp"

  • Hello E2E community member,

    Thanks you for asking your question concerning TI's SimpleLink Devices on the E2E Forum! The subject expert who can best address your inquiry is out of office for the holidays. After returning in early January, they will review your post and provide an initial response within 24 hours.

    Regards,
    Jan

  • Hi Koen, 

    Thank you for the sniffer and host logs.  How many devices are actively communicating on the network when the failure occurs, and is there a correlation between the number of active devices and the stability of the ZNP?  This is a lot of information to process, however it does appear to reinforce the idea that MTO route requests are leading to a heap memory overflow.  Have you been able to debug an active session in which the ZNP has crashed in order to review the call stack?  And does the device recover is soft reset?  Compiler migration or changes to the lower-level 15.4-Stack MAC could be resulting in larger heap requirements.  Excuse me if I do not remember, but have you evaluated v7.10 yet?  You could try increasing the heap again (buffers should be sufficient at their current size), however this will reach a ceiling soon with only 88 kB of RAM available.  Some updated devices have more RAM, such as the CC2652P7 with 144 kB of RAM, which may be worth evaluating.  There are other considerations which may not have been accounted for involving routing and discovery times, see Table 1 of SWRA650 for details.  I understand that the greatest difficulty is replicating and observing the issue given the requirement to have many devices connected and communicating.  Is there a specific reason that you need to upgrade SDKs?  The Z-Stack solution on v6.10 is stable, and newer versions do not include many memorable updates or bug fixes to be concerned with.

    Regards,
    Ryan

  • Hi Koen and Ryan,

    Happy New Year.

    A 30+ devices(cc1352p) network of  ZNP  SDK v4.40 is more stable than that of ZNP SDK v6.40.00.13 from testing.

    Best regards,

    David

  • Hi David,

    Happy New Year to you as well.  Can you please share your observations, including stack changes and sniffer/debug logs, which resulted in this conclusion?

    Regards,
    Ryan

  • Hi Ryan,

    Happy new year!

    I understand that there are a lot of variables, however I don't expect increasing the heap will fix it. This issue also occurs on a CC2652P7 running 7.10, I would like to stick to 6.10 but then we cannot support the P7 (since it is not supported by 6.10).

    Given the 6.10 vs 6.20 diff you sent me earlier, it's unlikely that any of these changes causes this issue. Therefore I expect there is a bug in one of the lower-level libraries. Is it possible to use e.g. the 6.10 15.4 Stack MAC with 6.20?

  • I would like to stick to 6.10 but then we cannot support the P7 (since it is not supported by 6.10)

    Can you please explain this further?  For the v6.10 SDK, the CC2652P7 is listed in the Release Notes and there are CC1352P7 examples which directly support the CC2652P7.

    Is it possible to use e.g. the 6.10 15.4 Stack MAC with 6.20?

    I will ask the Software Development Teams about this, I appreciate further patience as several experts are still out of office at the moment.

    Regards,
    Ryan

  • Hi Ryan,

    I dove a bit deeper into this and found out it's possible to use the libs from 6.10. First I tried with the 15.4 stack from 6.10. I did this by replacing the contents of 'simplelink_cc13xx_cc26xx_sdk_6_20_00_29/source/ti/ti154stack/lib/ticlang/m4f' with 'simplelink_cc13xx_cc26xx_sdk_6_10_01_01/source/ti/ti154stack/lib/ticlang/m4f'. The firmware still crashes after this.

    Then I tried using the closed source zstack libs from 6.10 (under 'source/ti/zstack/lib/ticlang/m4f'), after this the firmware does not crash anymore! So it seems one of the changes in the closed source libs of zstack causes this regression. What changed in these libs compared to 6.10?

  • Thanks for the update Koen.  I've alerted the R&D Teams accordingly.  Are you able to replace just the zstack source libraries or are both (i.e. zstack and ti154stack) required?

    Regards,
    Ryan

  • Only zstack is enough. I'm now going to test the same for the latest SDK (so 7_10_02_23 + 6.10 zstack libs).

  • Hello Koen,

    I have aligned with R&D on this issue and submitted a bug ticket so that this can be further explored internally.  However I cannot currently provide a timeline for the results of such investigation

    Regards,
    Ryan

  • Hi Koen,

    TI R&D has requested that you replicate this issue using SimpleLink F2 SDK 7.10.02.23 with TI ARM Clang Compiler v2.1.2.LTS in accordance with the Release Notes, and in doing so confirm that a compiler version difference (albeit minor) does not cause the issue.

    Thanks,
    Ryan

  • Hi,

    Maybe you could try the dynamic heap configuration used in the 4.40 SDK in the app.cfg file instead of the static configuration being used in 6.10+


    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    /*
    * Heap Configuration defines the type of Heap you want to use for the system (application + Stack)
    * Only one Heap buffer will be allocated. This heap will be shared by the system and the stack through
    * one manager (HeapMem, HeapMem+HeapTrack or OSAL)
    * You can still decide to create several heaps if you want, but at least one heap needs to be created.
    * The stack must have a Heap to run.
    * The different Heap manager available are :
    * OSAL HEAP: legacy Heap manager provided with all BLE sdk. By default, this Heap manager is used.
    * HeapMem:� heap manager provided by TI-RTOS (see TI-RTOS user guide for properties)
    * HeapTrack: module on top of HeapMem allowing an easy debugging of memory allocated through HeapMem.
    *
    * The heap manager to use is selected by setting HEAPMGR_CONFIG to the corresponding value (see below)
    * 0 = osal Heap manager, size is static.
    * 0x80 = osal Heap manager, with auto-size: The remainning RAM (not used by the system) will be fully assign to the Heap.
    * 1 = HeapMem with Static size
    * 0x81 = HeapMem with auto-size. The remainning RAM (not used by the system) will be fully assign to the Heap.
    * 2 = HeapTrack (with HeapMem) with fixe size
    * 0x82 = HeapTrack (with HeapMem) with auto-size: The remainning RAM (not used by the system) will be fully assign to the Heap.
    *
    * If HEAPMGR_CONFIG is not defined, but the configuration file ble_stack_heap.cfg is used, then the value
    * HEAPMGR_CONFIG = 0x80 is assumed.
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    Thanks,

    Akhilesh

  • Hi Koen,

    TI's Zigbee Test Team applied your minimal changes to run a network for 24 hours with 43 devices. This included a mix of ZR's/sleepy ZEDs/nonsleepy ZEDs (maximum 6 children per router and 4 hops total), default poll rates, and sending a custom "Large Network Test" packet with various payload sizes every 60s. The ZC (or other devices for that matter) never crashed.  Do you have any suggestions which could help cause the issue within this setup?

    Regards,
    Ryan

  • Hi Ryan,

    I've now compiled a 7.10.02.23 fw + TI ARM Clang Compiler v2.1.2.LTS + CCS 12.2.0, the firmware still crashes somewhere between a couple of hours / couple of days. I've verified this with 2 users.

    Regarding the reproducibility, I still don't know what triggers the crashing. Given that there are many variables, I think this is very complex to figure out. As noted before, 7.10 also works stable for many people (including me).

    Previously I mentioned in order to get 6.20 SDK stable (the SDK where this issue was first introduced), it was possible to get it stable by using the 6.10 zstack libs. It turns out this was not true, it seems the combination of 6.20 SDK with just 6.10 ti154 libs gets it stable. I'm testing more with users if this is indeed the case.

    Once this is confirmed (6.20sdk + 6.10 ti154 libs to get it stable), would it be possible to get some insights in the changes between 6.10 <-> 6.20 ti154 libs, my expectation is that the bug is still present in 7.10 ti154 libs.

  • Hi Koen,

    I will update the TI Software Development Team with your latest feedback and ask about the difference between 15.4-stack libraries between SDK versions 6.10 and 6.20

    Regards,
    Ryan

  • Hi Ryan,

    It seems the crash occurs right after the `AssocGetWithAddress` call. Note that many calls to `AssocGetWithAddress` were made before the crash (3000+). Could this function maybe have a memory leak?

  • Hi Koen,

    That API returns the association table entry, and there haven't been any assoc/neighbor table updates in years (especially around v6.10 <-> v6.20+).  So the issue is likely unrelated to those tables specifically.

    Regards,
    Ryan

  • Hi Ryan,

    I see, but isn't it very suspicious that it out of the many different request, it consistently crashes on this one? Could it be that e.g. another method writes a corrupt entry to this table and upon retrieval causes the crash?

    What I will do next is disable this call from z2m and see if the crash still occurs.

  • I have sent your observation to the Software Development Team for further review.

    Regards,
    Ryan

  • We tried running the firmware without the assoc get calls, it seems the firmware stays up longer (couple of days) but in the end still crashes. Do you have some more insight in the ti154 changes between 6.10 en 6.20?

    I still really would like to start using the 7.10 sdk, since we can currently not support the new P10 chips.

  • There is no insight to share concerning TI 15.4-Stack changes between SDK versions.  The Test Team also has not been able to replicate the behavior with the test conditions provided.  Are you not able to use the v6.10 ti154 source on v7.10 project builds?

    Regards,
    Ryan

  • ti154 from 6.10 is not compatible with 7.10 (getting invalid param errors). 

    What do you mean with “no insight to share”, isn’t TI willing to share the changes to debug this issue or are there no changes?

  • I apologize for the vague response, I mean that these representatives do not see any differences between the source code versions which could explain the behavior you are observing.

    Regards,
    Ryan

  • Could you benefit from the investigation by ?  The R&D team will further investigate possible SRC match table changes which could be causing this.

    Regards,
    Ryan

  • Yes definitely! Although I think it won't fix the issue crashes.

  • Hi Ryan, I noticed the 6.10 and 6.20 SDKs have been removed from the download site: SIMPLELINK-LOWPOWER-F2-SDK Software development kit (SDK) | TI.com 

    Why has this happened and can they be put back?

  • Hi Koen, thanks for reporting this.  I've notified the correct stakeholders so that they can resolve this error.

    Regards,
    Ryan

  • The missing versions have been restored to https://www.ti.com/tool/download/SIMPLELINK-LOWPOWER-F2-SDK 

    Regards,
    Ryan

  • Many thanks!

    In the meantime we did some more testing. It turns out my previous statement about using 6.10 TI154/zstack libraries with 6.20 fixes the 6.20 stability issues is wrong, unfortunately this does not fix the issue.

    We also tested using the UART (instead of UART2 driver) with 6.20, this does also not fix the issue.

    Do you or the R&D team have any clue in what direction to look next? I'm still committed to find the root cause of this, since I don't want to get stuck on the 6.10 SDK. I'm compiling the firmware on a Mac, in the release notes Windows is recommended, could this potentially cause this issue?

  • Hi Koen,

    The OS should not matter so long as you are following all of the dependencies listed in the Release Notes.  I will message you privately to continue our discussion.

    Edit:

    Koen was able to resolve the issue offline:

    After a lot of trial and error, we finally found out what causes the 6.20 firmware to break, it's not 1 but 2 things. This + the fact that it takes 3-7 days until the firmware crashes is the reason why it took so long. The following 2 changes make the 6.20 firmware stable:

    - Define `NVOCMP_RECOVER_FROM_COMPACT_FAILURE` (this was added in 6.20).

    - Reverting to the UART driver (instead of UART2).

    Regards,
    Ryan