CC2530: ZNP Memory Guard Bytes / Memory reorganisation / Safety checks / Recovery from Fragmentation

Mario DW

Part Number: CC2530
Other Parts Discussed in Thread: Z-STACK

Analysis of misbehaving or irresponsive ZNP devices have shown that buffer overruns, etc happen. Suppose that there could be even insufficient memory causing this, or whatever.

Once dynamic memory has been corrupted, depending on what got corrupted, NV memory could eventually also be impacted.

Therefore, I propose the following to limit the impact of invalid memory write accesses by:
1. Implementing guard bytes;
2. Modifying memory layout.
3. Safety checks on osal_mem_alloc
4. Reset after N failing mallocs, to recover from fragmentation and other issues

1. Guard bytes

I define a guard byte as a byte that can be monitored for corruption.
We could have fixed location guard bytes and dynamic location guard bytes.

Fixed location guard bytes could be located in between the different "memory types": stack, heap and variables. A dynamic guard byte could be located at the end of the used heap (osal_malloc), using one on the stack would be more complex and ineffcient.

On system initialisation and memory allocation (osal_mem_alloc), known values have to be written to the guard bytes.
They should then be checked on regular occasions: main loop and/or ISR. Using an ISR would be safer.

Whenever a guard byte does no longer have the expected value, a system reset should be performed for a production setup, and an infinite loop could be entered when debugging (HAL_ASSERT).

Guard bytes should be initialised to a known value that is unlikely to be written to memory - which excludes 0x00 and 0xFF, small values, 0xFE, etc. Doing statistics on an actual memory dump can give an idea about "the best" vale.

The dynamic guard byte(s) in the heap helps detect failure well before the heap is actually filled. The idea is not to add a guard byte in every malloc chunk, but just one past the (currently) highest allocated memory location.

2. Memory Layout

Currently the memory layout in the CC2530 ZNP and likely also in the other ZNPs is the following : STACK - HEAP - VARIABLES.

I observed in one of the situations that there was a memory overrun at the end of the heap, writing over critical variables (OSAL_Timer management variables).

The risk of the effects of a break down could be limited by reorganising memory as follows:
STACK - VARIABLES - HEAP .

Buffer overruns on the heap are IMHO not detected in the current ZNP implementations because they most often happen in the most recently allocated, highest memory chunk. On the next osal_mem_alloc, that chunk is coalesced with the next free block, which is the rest of the heap or what the code calls "the wilderness". Therefore in case of a buffer overrun, the computed free heap is somewhere between the size of the last allocated block incremented with whatever was found as the lenght value for the len value of the corrupted wilderness block header.

The buffer overrun issue is therefore only visible when:
- The end of the heap is reached;
- Permanent mallocs (=ending up in a linked list for instance) are added on the heap after dynamic mallocs sufficiently big to hold future mallocs that are subject to overruns (and corrupting the "permanent mallocs").
This may explain issues that I observed while building a network or removing devices. (Removing devices frees up "permanent mallocs").

With regards to the proposed memory organisation, as stack grows downwards, it is less prone to overwriting the variables. As the heap grows upwards, it is less likely that a buffer overrun corrupts variables.
It is also likely that variables are well verified as the impact of an overrun on a variable will impact its upper neighbor (whereas in the heap, there is likely no upper neighbor).

3. Safety check on osal_mem_alloc

a. Verify that 'hdr+len' (of a coalesced block) is not beyond the end of the heap (if not, HAL_ASSERT or reset).
This could also be implemented as a guard check in an ISR routine (on ff1).

b. Implement many "almost free" guard bytes Use the free header as a guard check by writing a guard byte just after the header (this could also be considered a dynamic guard byte). Verify this byte when coalescing blocks or when using free blocks.

4. Reset after N failing mallocs, to recover from fragmentation and other issues

As memory can get fragmented, especially during the process of adding and removing devices, a simple method to recover from that is to implement a reset after N failing memory allocations.

The reset will recompact the heap as the network configuration will be reconstructed from scratch.

The downcounter could be implemented in osal_mem_alloc such that when it reaches 0 either HAL_ASSERT or a user-definable function is called. The user could then implement some functionnality to ensure a clean shut down (e.g., there could also be a new MT message to notify the host, reset to counter to a non 0 value again and therefore let the host decide on the reset).

In case one want to allow a given number of failing mamory allocations over a given time period (e.g., to protect from invalid incoming packets), then the counter can be incremented in an osal task up to the preset maximum.

Implementation

It would be nice that TI proposes the "official" implementation (linker configuration + OSAL_Memory.c updates).

If that is not going to be the case, I'll implemente a private one.

over 3 years ago

0 Ryan Brown1 over 3 years ago

TI__Guru**** 202106 points

Hello Mario,

Thank you for providing a summary of your recent Z-Stack 3.0.2 CC2530 ZNP investigative efforts. Unfortunately, as you are most likely aware by now TI has no resources to commit towards improvement of this solution so any further implementation would need to be accomplished from your end. You are welcome to publish the results of your development efforts for other users and community members to benefit from.

Regards,
Ryan

0 Mario DW over 3 years ago in reply to Ryan Brown1

Genius 3145 points

Hello Ryan

Understood, but I am not working on an opensource project and this is not an opensource community and I have put a lot of unpaid effort in this project because I have to make it work for my customer. I do have opensource projects on sourceforge and github.

It doesn't mean that I do not contribute here, but IMHO it is TI's job as a corporation to support its solutions, not the responsibilty of its customers. I've worked for semiconductor companies for years making sure that ASICs worked perfectly well.

Anyway, I have implemented most of my proposals above in my private version of OSAL_Memory.c In that process I found that some settings of the official ZNP project are wrong and I corrected them which IMHO is a bug fix as well (which doesn't fix the ZNP misbehavior I reported though).

The good news is that the heap memory never got full in my new tests and that I may be able to allocate 0x200 bytes from the stack to the heap or to the serial buffers. And I am using a one byte overhead for each heap allocation at this moemnt.

The serial buffers are not really an issue neither. After fixing the SRSP issues in the gateway, I can send tons of requests from the NWKMGR to the ZNPZLS and to the NPI while the system continues to work. Sure, the ZNP has "ZBufferFull" reply status for some requests, but once the packet storm has gone by, I can still control the objects.
That is of course, when I was able to build the network properly.

It is also fairly good news that I was not able to reproduce a buffer overrun since these changes, but my previous version of OSAL_Memory.c did allow me to capture one.

I'll be able to transfer these findings to the ZED device that I've developped the code on CC2530 and make sure that it self-recovers from memory faults (and hopefully detect them before going in production).

I hope to hear from you on the subjects that I can't do much about.

Kind regards

Mario

Zigbee & Thread

Zigbee & Thread forum