This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CC1352R: Device sometimes 'looses' short address and becomes unreachable

Part Number: CC1352R

We are finding a rather difficult to debug (hard to reproduce) issue with our device nodes.

Sometimes a devices in a network simply looses their connection to the gateway. The device is still responsive (reacts to sensor inputs) but we don't get any data through our network.

Because it happens only once in a while (maybe even weeks) it is quite impossible to debug because you never know which device is going to fail in a network (mostly 20-30 devices in a network). Resetting the device is the only way out of this mode and then it works ok again.

Now what I did try today with a device that was stuck in this 'mode' was look at the network traffic with a sniffer and found out that it was still sending out data but with an 'invalid' destination PAN (0xFFFF) and invalid short address (0xFFFF). Destination itself was still correct. (see screenshot from Wireshark). Before this happened these values were correct of course (0x6833 and 0x0001).

So it looks like either the MAC layer or the sensor application loose the values for destination PAN and short address.

Our application is based on DMM sensor OAD with SDK5.10 but this was also happening in SDK4.30.

any suggestion as to why a device can 'loose' the association data?

The only two places I found in code where these values are set are: disassocCnfCb and disassocIndCb but there all the data gets reset and then destination address would also be cleared which is not the case.

  • Hi Marjin,

    Hm, that's strange. I understand that you find this hard to debug!

    Since you say this is on an OAD enabled device, did you see any relation to a performed OAD? (Could it be the PAN ID and short address are overwritten as part of OAD?)

    Can you attach a debugger to your failing sensor device and read out the sensor statistics? 

    Cheers,

    Marie

  • Hi Marie,

    Yes, one direction of thinking was indeed that it get overwritten but I'm not even sure if the invalid data is within the application or that the MAC layer itself has invalid data. I mean if I was to accidentally overwrite

    devInfoBlock.devShortAddr

    I don't even think it will we updated to the MAC layer, only after rebooting the device as far as I can see.

    So I suspect that maybe the MAC layer itself might have an issue.

    Like you mention it is hard to debug especially because when I want to attach a debugger the device will get rebooted and the issue is gone then. And to let the debugger connected and 'hope' that the issue will appear on that specific device could take weeks or even months. Most devices do never show this issue.

  • Hi Marjin,

    It's possible to connect the debugger to a running target without resetting. Please see the following.

    https://e2e.ti.com/support/wireless-connectivity/bluetooth-group/bluetooth/f/bluetooth-forum/882926/faq-ccs-cc2640r2f-cc26x2-how-to-connect-the-debugger-to-a-running-target

    I'm not sure if I will be able to help you pin-point the root cause since I'm not able to reproduce. Do you have a work-around that lets you reset your device when this happens?

    Cheers,

    Marie H.

  • Hi,

    Thanks, I was able to connect to a running device using this solution but only for a non production device. All our production devices have a locked flash so it won't help there. Forgot to mention that earlier.

    I have found a workaround for now which also points in the direction of the error.

    I now check the current PanId and source address in the Mac layer at an interval using:

    	reqStatus = ApiMac_mlmeGetReqUint16(ApiMac_attribute_panId, &currentPANId);
    

    and:

    	reqStatus = ApiMac_mlmeGetReqUint16(ApiMac_attribute_shortAddress, &currentShortAddr);

    When either one of these is 0xFFFF I log that and reboot the device.

    From the log I already saw the issue happening on one device over the weekend. So this at least points in the direction of the MAC layer because that's what I'm checking and it is not being changed anywhere in the application besides the association process.

    regards,

    Marijn

  • Hi Marijn,

    Thank you for posting.

    Do the devices show any other strange issues? Could this be a general memory corruption issue?

    Cheers,

    Marie H.

  • Hi Marie,

    The device behaves normally for all functions except this MAC issue so a general memory corruption seems not likely. I thought the MAC RF core has it's own memory isn't it?

    If I understand the topology correct, then if the MAC returns 0xFFFF then an eventual memory corruption would be located in the RF core?

    What I did notice on the failing device was that the general radio 'statistics' was not really good like other devices.

    It had 3 macAckFailures, 4 channelAccess failures, 2 synclossIndications, worstCaseE2EDelay of 1032 etc. 

    All devices are in the same room (testlab) so should have the same conditions.

  • Hi Marijn,

    The TI 15.4-Stack uses the M4 RAM but you are correct that the Radio Core has its own RAM. If there was a memory corruption in the M4 RAM for the TI 15.4-Stack I'm not sure how long it would take for the wrong address to be read into RAM.

    Interesting to hear about the statistics. It sounds like the device has tried to go into orphan mode but failed half way through the process?

    I need to get some input from the software development team. Can you let me know whether you are using beacon, non beacon or FH mode? Also what SimpleLink CC13x2/26x2 SDK you are using.

    Cheers,

    Marie H.

  • Hi Marie,

    Yes we are using beacon mode with SDK5.10.

    I am also seeing other strange networking behaviour.

    Sometimes a device also gets into some 'stuck' mode were it does not respond anymore to incoming messages.

    I checked with the sniffer and the gateway puts the address in the pending message list but the device never performs a data request. 

    It does however send out messages correctly.

    This may also point in the direction of the device not failing to process orphan mode?

    I already tried to, instead of doing orphan scan, only do sync requests but no improvement there.

    any other suggestions?

    thanks,

    Marijn

  • Hi Marie,

    Ok I did some further debugging and found something interesting.

    I added logging in our status message to include the current network state and the current state of autoRequest.

    This because like I mentioned in the previous message the device is still able to transmit when the network issue happens.

    Now I found that when the network issue appears that the value of autoRequest is false while the network state is 'rejoined'. This is of course not good as this should only be false when doing sync requests.

    So I think some better checks are needed to be implemented to manage the network better. I will do some further tests in this area.

  • Hi Marjin,

    Is it possible to see from the sniffer log whether the ack failures and sync losses happen before or after the device short address and PAN ID changes? As you say, it could be the device is trying to go into orphan mode but somehow not 100% successful.

    Cheers,

    Marie H.