Other Parts Discussed in Thread: Z-STACK
Analysis of misbehaving or irresponsive ZNP devices have shown that buffer overruns, etc happen. Suppose that there could be even insufficient memory causing this, or whatever.
Once dynamic memory has been corrupted, depending on what got corrupted, NV memory could eventually also be impacted.
Therefore, I propose the following to limit the impact of invalid memory write accesses by:
1. Implementing guard bytes;
2. Modifying memory layout.
3. Safety checks on osal_mem_alloc
4. Reset after N failing mallocs, to recover from fragmentation and other issues
1. Guard bytes
I define a guard byte as a byte that can be monitored for corruption.
We could have fixed location guard bytes and dynamic location guard bytes.
Fixed location guard bytes could be located in between the different "memory types": stack, heap and variables. A dynamic guard byte could be located at the end of the used heap (osal_malloc), using one on the stack would be more complex and ineffcient.
On system initialisation and memory allocation (osal_mem_alloc), known values have to be written to the guard bytes.
They should then be checked on regular occasions: main loop and/or ISR. Using an ISR would be safer.
Whenever a guard byte does no longer have the expected value, a system reset should be performed for a production setup, and an infinite loop could be entered when debugging (HAL_ASSERT).
Guard bytes should be initialised to a known value that is unlikely to be written to memory - which excludes 0x00 and 0xFF, small values, 0xFE, etc. Doing statistics on an actual memory dump can give an idea about "the best" vale.
The dynamic guard byte(s) in the heap helps detect failure well before the heap is actually filled. The idea is not to add a guard byte in every malloc chunk, but just one past the (currently) highest allocated memory location.
2. Memory Layout
Currently the memory layout in the CC2530 ZNP and likely also in the other ZNPs is the following : STACK - HEAP - VARIABLES.
I observed in one of the situations that there was a memory overrun at the end of the heap, writing over critical variables (OSAL_Timer management variables).
The risk of the effects of a break down could be limited by reorganising memory as follows:
STACK - VARIABLES - HEAP .
Buffer overruns on the heap are IMHO not detected in the current ZNP implementations because they most often happen in the most recently allocated, highest memory chunk. On the next osal_mem_alloc, that chunk is coalesced with the next free block, which is the rest of the heap or what the code calls "the wilderness". Therefore in case of a buffer overrun, the computed free heap is somewhere between the size of the last allocated block incremented with whatever was found as the lenght value for the len value of the corrupted wilderness block header.
The buffer overrun issue is therefore only visible when:
- The end of the heap is reached;
- Permanent mallocs (=ending up in a linked list for instance) are added on the heap after dynamic mallocs sufficiently big to hold future mallocs that are subject to overruns (and corrupting the "permanent mallocs").
This may explain issues that I observed while building a network or removing devices. (Removing devices frees up "permanent mallocs").
With regards to the proposed memory organisation, as stack grows downwards, it is less prone to overwriting the variables. As the heap grows upwards, it is less likely that a buffer overrun corrupts variables.
It is also likely that variables are well verified as the impact of an overrun on a variable will impact its upper neighbor (whereas in the heap, there is likely no upper neighbor).
3. Safety check on osal_mem_alloc
a. Verify that 'hdr+len' (of a coalesced block) is not beyond the end of the heap (if not, HAL_ASSERT or reset).
This could also be implemented as a guard check in an ISR routine (on ff1).
b. Implement many "almost free" guard bytes Use the free header as a guard check by writing a guard byte just after the header (this could also be considered a dynamic guard byte). Verify this byte when coalescing blocks or when using free blocks.
4. Reset after N failing mallocs, to recover from fragmentation and other issues
As memory can get fragmented, especially during the process of adding and removing devices, a simple method to recover from that is to implement a reset after N failing memory allocations.
The reset will recompact the heap as the network configuration will be reconstructed from scratch.
The downcounter could be implemented in osal_mem_alloc such that when it reaches 0 either HAL_ASSERT or a user-definable function is called. The user could then implement some functionnality to ensure a clean shut down (e.g., there could also be a new MT message to notify the host, reset to counter to a non 0 value again and therefore let the host decide on the reset).
In case one want to allow a given number of failing mamory allocations over a given time period (e.g., to protect from invalid incoming packets), then the counter can be incremented in an osal task up to the preset maximum.
Implementation
It would be nice that TI proposes the "official" implementation (linker configuration + OSAL_Memory.c updates).
If that is not going to be the case, I'll implemente a private one.