CCS/CC1310: FH: Network stability and rejoin issue

Slev1n

Part Number: CC1310

Tool/software: Code Composer Studio

Hey guys,

we use FH and have about 60 channels enabled and we do not use long range mode. We have one collector and about 50 sleepy sensors.
The sensors have a 20min polling and 20min reporting interval.
The sensors are basically TIDA 00489 boards, though we have adapted the RTC capacitors. Currently the RTC frequency of the collector (CC1352R1 Launchpad)
is 32768.1 Hz and the sensors are between 32768.7 - 32769.1 Hz. Trickle timer are the default values from sensor and collector example of SDK3.1 and SDK 3.2, respectively. Note, that in our code, the sensors disconnect and try to rejoin if two consecutive messages are not acknowledged and try to rejoin afterwards every 10mins. At the beginning, all sensors join the network as expected without greater issues.
However, we noticed an interesting behavior of the network afterwards

1. If the parameter CONFIG_MAX_RETRIES is set to 0, eventually (one after the other within 24 hours) all sensors disconnect but are not able to rejoin although I can
see, that the collectors responds to PAS and PCS messages wit PA or PC messages, respectively.

2. If the parameter CONFIG_MAX_RETRIES is set to 1 the network is always full and the transmission rate of the sensors is 100%.
According to our logfile, the sensors very rarely disconnect but are able to rejoin.

3. If the collector is restarted, all sensors are immediately able to join again.
.

We are quite at a loss what could be the reason, that
1. Why are the sensors loosing connection although the intervals are very short and
2. why are they not able to rejoin?
3. Why can they join after restarting (no factory reset!) the collector?
4. Is there a way to delay the message retry of the sensor to more than the current 50ms? E.g. 100ms or more?

Any advice or hint what to check would be useful.

best wishes
Slev1n

over 4 years ago

0 AB over 4 years ago

TI__Mastermind 23556 points

He slev1n,

We are quite at a loss what could be the reason, that
1. Why are the sensors loosing connection although the intervals are very short and

They could be loosing sync due to clock drift, if you lower the interval, does it keep happening?

2. why are they not able to rejoin?

This I am not sure, I will look into this.
3. Why can they join after restarting (no factory reset!) the collector?

This could be because the collector might send out async messages or something else. I will look into this more.
4. Is there a way to delay the message retry of the sensor to more than the current 50ms? E.g. 100ms or more?

Since the stack uses CSMA, it will not retry instantly it will calculate a random value using the CSMA/CA formula. if you want to increase the time, you can increase the min and max Back off exponent values. Another option is to handle the retries yourself at the application level.

0 Slev1n over 4 years ago in reply to AB

Genius 5680 points

Hey AB,

thanks for the fast reply. First of all, one more information for you. The 50 sensors are all within one box, thus within 1 square meter. Maybe this could explain the bad rejoin behavior, too?

AB said:
They could be loosing sync due to clock drift, if you lower the interval, does it keep happening?

Since the maximum difference between the clocks is about 1 Hz, the clock drift has a maximum of 36.6 ms per 20 min. For a dwell time of 250s, there is an 85% chance that the sensor transmission starts on the correct channel where the collector is currently on. Correct me if I am mistaken here. Hence, you are correct, that there is room for optimization, but we can never bring the failure chance to zero and I think if the sensor clock is outside the collector's timetable once, the subsequent messages will fail as well and in our code, the sensors automatically disconnect and try to rejoin.

AB said:
Since the stack uses CSMA, it will not retry instantly it will calculate a random value using the CSMA/CA formula. if you want to increase the time, you can increase the min and max Back off exponent values. Another option is to handle the retries yourself at the application level.

Thank you for the hint regarding the backoff exponents, I think I was still misunderstanding their meaning. However, I implemented the retry by myself into the application level. By setting the code to two retries and a Task_sleep() for 100ms, I achieved a delay of 150 ms (checked with sniffer in single channel mode) but the delay between the second transmission attempt and the third is 1.1s. Basically, I just call processSensorMsgEvt(); within dataCnfCb() if the status is "no ack". Any idea why the second delay is so big?

Regarding the rejoin behavior, I have to revise my statement, that ALL sensors can join after resetting the collector. But I noticed that in a very short period (1-3s) 8 sensors joined the last time, so maybe its too crowdy. I have already halved the trickle timers on both sides (collector and sensor) since only 60 channels are in use, but this did not improve the rejoin behavior. Again, I could see that a lot of async messages are being transmitted from the sensors and I can see that the collector responds, however, it looks like that no association is triggered, though I dont know if this problem is on the sensor or collector side. If I use a second launchpad as sensor, I cannot reproduce this behavior and since the sniffer is quite useless with 60 channels enabled I am having a hard time to debug this. Can I use the assocIndCb() function on the collector side to see if the sensors are sending the association request properly or could there be something wrong between the association request from the sensor and the execution of this function?

EDIT: I have created a separate question regarding the join behavior. Thus we can focus here on the on the clock drift and message retry issues.

best wishes

Slev1n

0 Slev1n over 4 years ago in reply to Slev1n

Genius 5680 points

Update:

We have further optimized and adapted our RTC frequency and due to the tolerance of +-20ppm of the oscillator, we cannot further optimize this part. Besides, we have lowered the data transmission interval to 10mins.

--> According to the stats "messages attempted" and "messages transmitted successfully", the successful transmission rate lies between 90 - 75%

Furthermore, we have adapted the backoff exponent limits and set the number of retries to 1.

--> We are now facing a phenomenon which we have not in non-beacon mode or if we keep the backoff exponent limits at its initial values. The sensors sometimes transmit a message and do not recognize the acknowledge from the collector. This results in a retry of the sensor transmitting the same data again, hence the collector receives the data twice, which is not desired. The RSSI quality is quite good, thus I dont understand why the sensor is not recognizing the acknowledge?

Any ideas how to improve the stability of the network?

best wishes

Slev1n

0 AB over 4 years ago in reply to Slev1n

TI__Mastermind 23556 points

is the ack problem easily reproducible?

if so, can you go ahead and route the radio TX/RX signals to GPIO and use a logic analyzer to see if the sensor is turning ON the receiver fast enough to receive the ack.

Follow the link below, and head to the Debugging chapter, section DEBUGGING RF OUTPUT(this is the ble user guide, but this applies to any stack)

0 Slev1n over 4 years ago in reply to AB

Genius 5680 points

Hey AB,

unfortunately it is not reproducible and fortunately not happening too often, however, it also looks like not all boards are affected by this issue, some are more some are less and on some boards it never occurs.

best wishes

Slev1n

Sub-1 GHz

Sub-1 GHz forum

CCS/CC1310: FH: Network stability and rejoin issue