This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CC2530: osal_malloc buffer released in NLDE_DataReq suffered from bufferoverrun - ZSTACK 3.0.2

Part Number: CC2530
Other Parts Discussed in Thread: Z-STACK

Hi

I instrumented the ZNP code to detect buffer overruns on the ZNP.  See OSAL_Memory.c in the attached ZIP file for the code.

To know when it happens, put a breakpoint on i++ as shown in the screenshot.

The screenshot shows where it happens through the Call stack.

The memory capture shown can also help understand which Zigbee communication caused this, together with the sniffer log.

The serial log is also attached, the serial communication stopped in the extract below:

[2020-11-24 15:27:46.805410] > [ZDO/AREQ] **END_DEVICE_ANNCE_IND** <132F-000d6f000e8a756f> Src:132F     (SYS:5/TYPE:40/CMD:C1) ( 18)::fe 0d 45 c1 2f 13 2f 13 6f 75 8a 0e 00 6f 0d 00 80 f5
[2020-11-24 15:27:46.807380] > [ZDO/AREQ] **END_DEVICE_ANNCE_IND** <132F-000d6f000e8a756f> Src:132F     (SYS:5/TYPE:40/CMD:C1) ( 18)::fe 0d 45 c1 2f 13 2f 13 6f 75 8a 0e 00 6f 0d 00 80 f5
[2020-11-24 15:27:46.809369] > [ZDO/AREQ] **MSG_CB_INCOMING**   (SYS:5/TYPE:40/CMD:FF) ( 25)::fe 14 45 ff 2f 13 01 13 00 00 b2 00 00 2f 13 6f 75 8a 0e 00 6f 0d 00 80 72
[2020-11-24 15:27:46.812482] > [ZDO/AREQ] **MSG_CB_INCOMING**   (SYS:5/TYPE:40/CMD:FF) ( 25)::fe 14 45 ff 2f 13 01 13 00 00 b3 00 00 2f 13 6f 75 8a 0e 00 6f 0d 00 80 73
[2020-11-24 15:27:47.691534]  3217        0x132f → 0x0000       IEEE 802.15.4 12 Data Request^M

And the last packet sent by the ZNP heppened somewhat later:

[2020-11-24 15:27:56.758657]  3282               →              IEEE 802.15.4 5 Ack^M
[2020-11-24 15:27:57.060278]  3283        0xc9a3 → 0x0000       IEEE 802.15.4 12 Data Request^M
[2020-11-24 15:27:57.060760]  3284               →              IEEE 802.15.4 5 Ack^M
[2020-11-24 15:27:57.750243]  3285        0x132f → 0x0000       IEEE 802.15.4 12 Data Request^M
[2020-11-24 15:27:57.750741]  3286               →              IEEE 802.15.4 5 Ack^M
[2020-11-24 15:27:57.759978]  3287        0x0000 → 0x132f       ZigBee ZDP 57 Extended Address Response, Status: Success, Nwk Addr: 0x0000 = TexasIns_00:10:22:82:77^M
[2020-11-24 15:27:57.760107]  3288               →              IEEE 802.15.4 5 Ack^M
[2020-11-24 15:27:57.768251]  3289        0x132f → 0x0000       ZigBee ZDP 58 Extended Address Request, Nwk Addr: 0x0000^M
[2020-11-24 15:27:57.768373]  3290               →              IEEE 802.15.4 5 Ack^M
[2020-11-24 15:27:57.772629]  3291        0x132f → 0x0000       ZigBee ZDP 58 Extended Address Request, Nwk Addr: 0x0000^M
[2020-11-24 15:27:57.772752]  3292               →              IEEE 802.15.4 5 Ack^M
[2020-11-24 15:27:57.776888]  3293        0x132f → 0x0000       ZigBee 45 APS: Ack, Dst Endpt: 0, Src Endpt: 0^M
[2020-11-24 15:27:57.778522]  3294               →              IEEE 802.15.4 5 Ack^M
[2020-11-24 15:27:58.756343]  3295        0x132f → 0x0000       IEEE 802.15.4 12 Data Request^M

The ZNP stopped communicating because of the breakpoint.

The method for detecting the overflow/overrun is the following:

1. In osal_mem_alloc reserve 1 extra byte and put a known value in the last byte of the buffer.

2. In osal_mem_free, check the last byte of the buffer.  Enter an infinite loop if the value is not the expected value.

I encourage you to do further tests using this method ;-) , and of course share the fixes as soon as possible.

ZNPBufferOverflow.zip

  • Hi Mario,

    Thank you for the detailed explanation and providing the results of your development efforts.  You could use HAL_ASSERT to call halAssertHandler if a buffer overflow is detected.

    Regards,
    Ryan

  • I know about HAL_ASSERT but it was faster to write the loop than look it up and that way the debugger stays in the context as well (easier to look at variables) without adding another item on the stack.

    I sincérely hope that this will help to put an end to the "insufficient memory" issues.

  • There is more than HAL_ASSERT than meets the eye, even more than what was said in a past reply .

    In fact the compile options for the ZNP mention  ASSERT_RESET which is without effect, but could have been usefull.  Right now all those asserts are NOOPs.  Resetting the ZNP in case any of the asserts fail is cerrtainly a more secure option.

    Enabling the asserts will require 7 bytes to CODE memory when reset is enabled, 

  •  I can't fix or examine this myself any further because it is in the nwk_* , NDLE* or APS* functions.

    I hope that you understood that.

    So please do examine this and provide a fix .

  •  

    Will the TI Team investigate this ?  If yes, what is the expected delay?

    Or, should I interpret your answer as "TI suggests that a ZNP reset in case of a detected overflow is the appropriate fix" ?  That implies that I need to add this buffer overflow to the ZNP production code, and use up extra HEAP memory.

    Complementary question: to my knowledge the Watchdog is not enabled in the ZNP.  As its duration is limited, this would only be applicable for Coordinators and routers and always on ZEDs.  Has TI made a proposal of how to use the Watchdog in the ZNP?  If yes, can you provide a pointer to this suggestoin?

  • Hi Mario,

    My apologies with the response delay.  You have provided ample debug information but unfortunately it is not enough for me to find the root cause of the memory corruption.  As already stated the problem lies inside the Z-Stack pre-compiled libraries which you do not have access to in order to further determine the source of the issue.  I am attempting to find time with which to replicate and further debug this behavior.  Do you have any recommendations on how it could be most easily reproduced (preferably with as few devices possible and using modified TI examples)?

    You are correct that the ZNP does not incorporate the WD peripheral and TI has not made any considerations towards adding WD functionality to the ZNP project.  As such there are no recommendations currently available.

    Regards,
    Ryan

  • Hi  

    Currently I cant give very precise instructions on how to replicate this.

    I did some further analysis of the logs, and I provide an updated ZIP file - I amended my scripts with filtering based on a start and end time, so the combined log is trimmed as well as the capture files that are snipped down to a "pcapng" file limited to the selected time frame.
    For this test case, there is also "full.log" which corresponds to the detailed decoding of the RF packets with hex bytes.

    The sequence "48 02 2f 13 00 00 1e " in memory looks very much like what we find for several frames.
    Given the last RS232 communication from the ZNP, the issue could have been triggered by Frame 41 and onwards (counter from the attached log):

    Frame 41: 45 bytes on wire (360 bits), 45 bytes captured (360 bits) on interface unknown, id 0
    Interface id: 0 (unknown)
    Interface name: unknown
    Encapsulation type: IEEE 802.15.4 Wireless PAN (104)
    Arrival Time: Nov 24, 2020 16:27:47.699921000 Paris, Madrid
    [Time shift for this packet: 0.000000000 seconds]
    Epoch Time: 1606231667.699921000 seconds

    Here is some "unformatted" analysis of the information:

    The memory check fails at address 0x123D, which is a detection byte.
    hdr is set to 0x1219 and the allocation has a length of 0x25 (header, requested memory and guard byte included).
    So that memory that is being freed ends just before 0x1219+0x25=0x123E (consistent with the guard byte at 0x123D).
    The header is 0x25 0x80, the latter byte indicates that this memory is used.
    The guard byte is 0xDC, which helps find allocated blocks.

    It is filled with (hex):

    25 80 - Header, 0x25 bytes allocated, including header and guard byte.
    48 02 2f 13 00 00 1e 34 34 34 cd cd cd cd cd cd # 0x0248 NWK:0x132f
    22 12 cd d2 22 12 c1 39 f9 8e 82 8f 83 22 12 cd
    e0 a3 22
    f0 - Violated guard byte @0x123D
    85 08 82 85 09 83 a3 a3 a3 a3 a3 22 ee 12 d0 76 - Possible other written bytes
    If 0x0885 would the length, then the header at the of the heap would be 0x123E+0x0885 = 1AC3.
    That's beyond the memory allocated for the heap.

    It was probably the first block filled so far down the heap, I cant find another 0xDC marker.

    Corresponds partially to:
    Arrival Time: Nov 24, 2020 16:27:48.700276000 Paris, Madrid

    Frame (45 bytes): xx xx xx xx xx xx xx
    0000 61 88 cb 5c ec 2f 13 00 00 48 02 2f 13 00 00 1e a..\./...H./....
    0010 1e 28 f0 46 01 00 77 82 22 10 00 4b 12 00 00 0b .(.F..w."..K....
    0020 f8 3b b2 54 c9 2a fe 8f 5f a7 a0 0d ec .;.T.*.._....
    Decrypted ZigBee Payload (8 bytes):
    0000 02 01 00 05 04 01 01 57 .......W

    The previous block (prior to the above) holds:
    24 80 - Header, 0x24 bytes allocated.
    2f 13 00 00 00 00 00 00 00 02 5c ec 02 e4 00 00 # 0x2f13 = NWK 0x132F
    00 00 00 00 00 00 00 00 00 00 38 5d 06 9b 47 00
    61 d0
    16 - Guard byte, violated!

    The one before that was freed:
    06 00
    00 20 96
    dc - Guard byte

    My guess is that the ZNP fails on sequence 38 or 39 of the report attributes


    With improved decoding of the serial
    [2020-11-24 15:27:29.344217] > [AF/AREQ] 40|04:81 **INCOMING_MSG** <EB68/GRP:0000/CLSTR:0500/EP:01> Seq:0 DstEP:01 SrcMAC:0619 LQI:118 SECURE:0 TIMSTMP:1597083 DATALEN:9 RADIUS:25 ( 34)::fe1d44810000000568eb01010076009b5e18000009190600000000ff000068eb1d82
    [2020-11-24 15:27:46.594626] > [ZDO/AREQ] 40|05:CA **TC_DEVICE_IND** <132F-000d6f000e8a756f> Parent:0000 ( 17)::fe0c45ca2f136f758a0e006f0d00000043
    [2020-11-24 15:27:46.802572] > [AF/AREQ] 40|04:81 **INCOMING_MSG** <132F/GRP:0000/CLSTR:0001/EP:01> Seq:0 DstEP:01 SrcMAC:2718 LQI:63 SECURE:0 TIMSTMP:1651458 DATALEN:7 RADIUS:24 ( 32)::fe1b4481000001002f130101003f0002331900000718270a2000201c2f131dfb
    [2020-11-24 15:27:46.805409] > [ZDO/AREQ] 40|05:C1 **END_DEVICE_ANNCE_IND** <132F-000d6f000e8a756f> Src:132F ( 18)::fe0d45c12f132f136f758a0e006f0d0080f5
    [2020-11-24 15:27:46.807379] > [ZDO/AREQ] 40|05:C1 **END_DEVICE_ANNCE_IND** <132F-000d6f000e8a756f> Src:132F ( 18)::fe0d45c12f132f136f758a0e006f0d0080f5

    Based on the Sequence numbers, these are the device announcements.

    [2020-11-24 15:27:46.809369] > [ZDO/AREQ] 40|05:FF **MSG_CB_INCOMING** #178 Src:132F Clstr:0013 BROADCAST NOSEC MacDst:0000 Data:2f136f758a0e006f0d0080 ( 25)::fe1445ff2f1301130000b200002f136f758a0e006f0d008072

    => From full sniffer log
    Decrypted ZigBee Payload (20 bytes):
    08 # Frame control
    00
    13 00 # Cluster
    00 00 #
    00 55 # Counter
    b2 # Sequence
    ## MGG_CB_INCOMING payload starts here.
    2f 13 # Src
    6f 75 8a 0e 00 6f 0d 00 # Extended address
    80 # Capability Information

    [2020-11-24 15:27:46.812482] > [ZDO/AREQ] 40|05:FF **MSG_CB_INCOMING** #179 Src:132F Clstr:0013 BROADCAST NOSEC MacDst:0000 Data:2f136f758a0e006f0d0080 ( 25)::fe1445ff2f1301130000b300002f136f758a0e006f0d008073

    #179 - Similar as before with counter 0x56


    After this, communication ends on the serial interface.

    I believe memory might have been prefilled with CD during this test, which means that CD likely corresponds to unfilled locations.

    4405.ZNPBufferOverflow.zip

  • Hi Mario,

    Looking through the sniffer log, I have noticed some odd behavior. The ZED (Ember device) sends a Match Descriptor Request with APS ACK requested, two back-to-back Device Announces with incremented counters/sequence numbers, and four ZCL messages all with APS ACK requested. This all happens within a short (ms) period of time. The ZC proceeds with a re-broadcast of both Device Announces then an APS ACK of the Match Descriptor (out of original order), and one more APS ACK to the first of the four ZCL Messages. Then it drops off without the three other APS ACKs. The call stack in the image provided confirms that the memory error occurs during an APS ACK. Meanwhile the Ember device starts re-sending the ZCL messages two seconds after the first try and not receiving the APS ACK.  The MAC buffer sizes might be exceeded and APS ACK messages are cancelled without deallocating memory inside the pre-compiled Z-Stack libraries.  I recommend testing with MAC_CFG_RX_MAX/MAC_CFG_TX_DATA_MAX (located in mac_cfg.c) increased beyond the default value of two.

    Regards,
    Ryan

  •  It looks like you are on a good track here.

    I'll probably first try to reproduce this keeping the current configuration - hopefully it is somewhat repeatable and then I can validate that it does not occur again.

    However the firmware should not have buffer overruns and deallocate memory and degrade gracefully in these cases.

    The detected issue is a buffer overrun - maybe there are also unreleased memory buffers too.  Both should be fixed.

    Increasing MAC_CFG_RX_MAX/MAC_CFG_TX_DATA_MAX may help for another issue you seem to have identified (unreleased memory), but does it help with the buffer overrun?

    Hopefully you can get to the bottom of this and propose a library update.

    Thank you

    Mario

  • I've sent you a Friend Request so that we can further discuss next steps privately.

    Regards,
    Ryan