This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CC1310: TI 15.4 Stack Linux Gateway - Gateways suddenly stop working after a while

Part Number: CC1310

Hello,

As you can see from the title, I have multiple TI 15.4 Stack Linux Gateways that suddenly stop working after a while. I have been looking for the cause of this issue for some weeks, but so far I am not able to find it, so I was hoping you could guide me in this.

Gateway Setup

The setup of the gateway is the following: The CC1310 acts as the CoProcessor and has the standard TI CoProcessor firmware. CC1310 is connected to the Raspberry Pi, which acts as the collector. The CC1310 and the RPI communicate through UART. The collector project has been adjusted, but minimal (what has been adjusted mainly is the communication protocol between the RPI and webclient)

Problem

The problem I encountered with every Gateway so far is that they stop working after a while. It seems to be a pattern: after approximately 30-45 days it suddenly stops working properly. I then reboot the gateway which makes the gateway work normally again, but sometimes it is still very unstable. I am only able to monitor these gateways remotely. I was not able to reproduce this problem at home so far. So I turned every logging flag on in the collector, and decided to wait for the next time a gateway becomes unstable. So far I have only found one possible cause for this problem: the UART log reports a checksum error which indicates that there is a corrupted byte in the communication between the CoProcessor and the Collector. After looking further in one of the gateways that suddenly stopped working, I found that after the checksum error is reported, the UART log only reports 0-bytes which makes me think that the CoProcessor may have crashed leading to this problem. I was not able to reproduce this yet.

I am still not entirely sure whether it is only this checksum error problem, or that there may be more issues causing the Gateways to stop working. What other steps(remotely) should I take to investigate this problem? What is the best way to reproduce a crash on the CoProcessor(an error that is not being caught/handled), so that I could test my assumptions from home (This one doesn't have to be  a remote approach)? 

I hope to hear from you soon.

Yours Sincerely,

Mohamed

 

  • Hi Mohamed,

    Remote debugging sure is a tricky thing. 

    It is hard to say what might be wrong or where to start looking but if we consider the checksum error case. You say that the device after that reports UART logs of 0-byte. I would assume that to trigger such a log statement, you actually have to receive something over the UART (it makes no sense printing "No UART activity right now" messages right?). 

    If we assume the "0 length" logs is connected to a UART communication attempt (i.e. something was seen on the RX lines of the RBP), then the question is why the host is seeing 0 bytes as that would be an empty frame. This does however rule out that the co-processor would have crashed (at least fully) as it would not be able to send out even empty frames if that was the case. 

    In that case, it would be interesting to see what the actual serial data communicated to the host side is. You got any way to insert a serial sniffer to sample the raw incoming data?

    As for your local debugging, my best advice would be that, assuming you can get the device to crash, connect the debugger to the co-processor and "Attach to a running target". What that means is basically connecting to the device without forcing a reset so that you can enter debugging mode in the current device state. This would allow you to use for example the CCS "Runtime Object Viewer" tool to check the RTOS state. It would also give you a first impression on if the device seems to be healthy or if it is in an exception loop. 

    As for re-producing it, the only thing I could think of is to "fake" a message from the co-processor to the host with a similar checksum error. This would however require some co-processor work which I guess you want to avoid. 

  • Hi M-W,

    Thank you for your response. I will see about the serial sniffer. Is it OK if we keep this topic open till I collect more information?

    Kind regards,

    Mohamed

  • Sure, good luck with the sniffer!