CC2530: ZNP stuck inside osalTimerUpdate - ZStack 3.0.2

Mario DW

Part Number: CC2530

The serial interface of the ZNP was not responding, and as IAR was open in a debug sessin, I have more information.

My interpretation of the timerHead LinkedList is that it is corrupted. A clean linked list has only one bit in the event_flags .
The screenshot should help understand that. I also saved memory from using Debug > Memory > Save . I do not know what is required exactly to be able to reload it.

I join the logs. The last communications of the ZNP are shown here:

[2020-11-22 17:11:58.996636] > [AF/AREQ] **DATA_CONFIRM** ZSuccess EP:01 Seq:6 (SYS:4/TYPE:40/CMD:80) ( 8)::fe 03 44 80 00 01 06 c0
[2020-11-22 17:11:58.987306] 19618 0x0000 → 0xfec7 ZigBee HA 51 ZCL: Discover Attributes, Seq: 6
[2020-11-22 17:11:58.993800] 19621 → IEEE 802.15.4 5 Ack
**** Not sure that this was a valid communication from the ZNP, but just in case:
[2020-11-22 17:12:04.972770] 19638 → 0xec5c IEEE 802.15.4 52 Reserved, Dst: 0xec5c, Bad FCS

For the future, I am casting the type for "next" so that I do not need to cast it in the debugger.

typedef struct osalTimerRec_s
{
struct osalTimerRec_s *next;
osalTime_t timeout;
uint16 event_flag;
uint8 task_id;
uint32 reloadTimeout;
} osalTimerRec_t;

This is a normal linked list:

I was able to save a bit of RAM as well by changing the storage class of a big variable in MT.c:

Before:
239 887 bytes of CODE memory
32 bytes of DATA memory (+ 70 absolute )
7 935 bytes of XDATA memory
192 bytes of IDATA memory
8 bits of BIT memory
426 bytes of CONST memory

After:

239 887 bytes of CODE memory
32 bytes of DATA memory (+ 70 absolute )
7 891 bytes of XDATA memory
192 bytes of IDATA memory
8 bits of BIT memory
470 bytes of CONST memory

Here is the seriallog and gateway log and sniffer log. Hopefully that heps to identify what caused this I can send a download link to the project (with logs, etc) - as it contains the source, I can do it by private mesage.

ZNP_StuckInosalTimerUpdate.zip

over 5 years ago

0 YiKai Chen over 5 years ago

Guru 735695 points

For your reference, I will reset CC2530 if my host application cannot get MT command response from serial interface contineously.

0 Mario DW over 5 years ago in reply to YiKai Chen

Genius 3145 points

You are right. In my case that should be done in the gateway.

Teh problem is also that the watchdog of the CC2530 is maximum 1 second or so. In case the ZNP is a controller or a router, it could be set though and triggered every second, and "software" watchdogs could be added in the ZNP as well.

Of course, the bug resulting in this is better solved...

0 Ryan Brown1 over 5 years ago in reply to Mario DW

TI__Guru**** 219597 points

Hey Mario,

Did you happen to capture the call stack during this time? It is apparent that you are encountering the RAM limitations of the CC2530, which is noted in SWRA635. CC2530 ZNP resets may be necessary based on MT UART/network traffic taxing the RAM availability.

Regards,
Ryan

0 Mario DW over 5 years ago in reply to Ryan Brown1

Genius 3145 points

I didn't look in the right place for the call stack. There was not much to make of the call stack.
I do not know if we could make anything form that.

There's no more than 6 devices in the network.

The logs show that there isn't a lot of traffic.

#define MAXMEMHEAP 3017 /* isn't enough? */

In General options:

Normally the compiler should check how much stack it needs, I do not know if it does that.

osal_mem_alloc checks if there is enough memory .

So you confirm that I do not have enough memory?!

0 Mario DW over 5 years ago in reply to Ryan Brown1

Genius 3145 points

Dear Ryan

Apparently the pasted images are not visible on the site, it showed that XDATA stack size is 4D0.

We can compare the screenshots and also read the hex file dump of XDATA memory.

In the hex dump we can find the Timerhead value, so this is the correct dump:

This value is invalid as it is inside the STACK which ends at 0x4D0. Unfortunately it is not initialized, so visual inspection does not allow to know if there was a lack of STACK -> I add initialisation in the code so that we can see this next time.
For the moment the stack pointer did not go below 0x03f0.

Local variables are mostly mapped to registers so we can't find them for the function in use.

I examined the variables located near Timerhead, and extracted their values from the XDATA dum:

osalMemStat 25  -> can only be 0 or 1, so is corrupted.
pgOff   01012502012504012506012509
pgLost  0801250A01250C01250E012510
pgRes   01
hotPg   251201
hotOff 250002180402
pwrmgr_attribute 1B000330010322 -> These are all 0 in my case.
TimerHead 0203  -> 0x0302
OsalSystemClock 26030318
tasksEvents 040318060318070342080342AB

So it looks like the pwrmgr_attribute was also corrupted, and we are near address 0x1800.

But the end of the heap is not too far away:

XDATA_Z
  Relative segment, address: XDATA 00000C19 - 000017E0 (0xbc8 bytes), align: 0
  Segment part 7.             Intra module refs:   osal_mem_alloc
                                                   osal_mem_free
                                                   osal_mem_init
           LOCAL                   ADDRESS         
           =====                   =======         
           theHeap                 00000C19

A copy of the last DATA_CONFIRM (or an earlier one) can be found in the heap, that memory would be available for reservation as it should be freed.

:1006A000000000FE034480000106C0C7FE0E0100EA

Starting from address 0x17Db in the XDATA dump there seems to be some kind of pattern. That might give a clue about what caused the data overwrite:

Starting from 0x17DB
000025
010025
000125 # osalMemStat is the first fixed variable and it is 0x25 on this line
010125
020125
040125
060125
080125
0a0125
0c0125
0e0125
100125
120125
000218
04021b
000330
010322
020326
030318
040318
060318
070342
080342

This seems too regular to me.

It should be possible to have an idea about what this is.

0 Mario DW over 5 years ago in reply to Mario DW

Genius 3145 points

I got it, this is from a communication after the failure:

[2020-11-22 18:51:56.768371] 20822        0xfec7 → 0x0000       ZigBee HA 124 ZCL: Discover Attributes Response, Seq: 5^M
[2020-11-22 18:51:56.768433] 20823               →              IEEE 802.15.4 5 Ack^M
[2020-11-22 18:51:56.783757] 20824        0x0000 → 0xfec7       ZigBee 45 APS: Ack, Dst Endpt: 14, Src Endpt: 1^M
[2020-11-22 18:51:56.784238] 20825               →              IEEE 802.15.4 5 Ack^M
[2020-11-22 18:51:56.790531] >  ( 48)::fe 63 44 81 00 00 02 07 c7 fe 0e 01 00 17 00 d6 ad 25 00 00 4f 18 05 0d 01 00 00 25 01 00 25 00 01 25 01 01 25 02 01 25 04 01 25 06 01 25 08 01
[2020-11-22 18:51:56.797722] >  ( 56)::25 0a 01 25 0c 01 25 0e 01 25 10 01 25 12 01 25 00 02 18 04 02 1b 00 03 30 01 03 22 02 03 22 03 03 18 04 03 18 06 03 18 07 03 42 08 03 42 00 04 2a 0e 04 2a c7 fe 1d ac

And this was received near the time the last communication occurred:

$ grep "Discover Attributes" ZNP_StuckInosalTimerUpdate.log
[2020-11-22 17:11:58.266573] 19606        0x0000 → 0xfec7       ZigBee HA 51 ZCL: Discover Attributes, Seq: 3
[2020-11-22 17:11:58.308222] [GATEWAY/HNDL] MISC1  : Processing Discover Attributes Response
[2020-11-22 17:11:58.599316] 19612        0x0000 → 0xfec7       ZigBee HA 51 ZCL: Discover Attributes, Seq: 5
[2020-11-22 17:11:58.653239] [GATEWAY/HNDL] MISC1  : Processing Discover Attributes Response
[2020-11-22 17:11:58.987306] 19618        0x0000 → 0xfec7       ZigBee HA 51 ZCL: Discover Attributes, Seq: 6
[2020-11-22 17:12:05.017988] 19640        0xfec7 → 0x0000       ZigBee HA 124 ZCL: Discover Attributes Response, Seq: 6

Next step is to analyse the memory reservations in Discover Attributes.

Ryan, I trust that you can look into that part.

0 Ryan Brown1 over 5 years ago in reply to Mario DW

TI__Guru**** 219597 points

There is nothing particularly concerning about an AF_INCOMING_MSG Discover Attributes Response unless your UART TX/RX buffers are not large enough to handle the message size. The repetition of 0x0125 between the AF_INCOMING_MSG and XDATA variables indicates an overlap with the heap. It can be assumed that MAXMEMHEAP is being surpassed, can you clarify what information is supposed to exist in the Discover Attributes Response from the sniffer log?

Regards,
Ryan

0 Mario DW over 5 years ago in reply to Ryan Brown1

Genius 3145 points

If MAXMEMHEAP would be surpassed than the memory allocations should return NULL and the system should continue to fonction.

So if MAXMEMHEAP can be actually surpassed, than there is a problem with osal_malloc not verifying the available memory correctly.

I suggest that you look at my latest R&D work on the Z-STACK3.0.2 - by instrumenting the code, memory issues are detected early. I can't do much about it as it is in the library code AFAICS.

This issue may have a different cause but with my modification of Osal_Memory.c we will know that fairly easily.

Zigbee & Thread

Zigbee & Thread forum

CC2530: ZNP stuck inside osalTimerUpdate - ZStack 3.0.2