This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

IWRL6432: CAN bus error passive only happens when experiencing traffic from other devices

Part Number: IWRL6432
Other Parts Discussed in Thread: TCAN334, , MMWAVE-L-SDK

Tool/software:

Hello experts, 

I posted this question to the CAN transceiver people but was rightly directed here. I have a couple of clarification questions about how MCAN operates. 

Context: 

We have the TCAN334 transceiver in use with the IWRL6432 on a custom board. We have done extensive testing on our CAN communications and the devices work as far as we can tell, indefinitely on a network with a small number of other nodes. These controlled networks have relatively low traffic. 

We recently received a trace containing typical network traffic. This trace is replayed onto our network with a single device on the other end. The device operates as expected for what seems to be a random amount of time usually from 5-20 minutes. At some point in the trace the device enters the error passive state. 

Questions: 

Based off of my investigation and reading through the TRM I have the following clarification questions. 

  1. We are using the RX FIFO mode with 32 available configured elements in each buffer. When a full buffer receives another message, the new message is ignored, and the message lost interrupt is triggered. Does the message lost interrupt result in an error message on the bus? In other words, does a full FIFO have any ripple effects on other parts of the CAN system? 
  2. We have observed an ACK error frame on the bus when we are getting into the passive error state. To my understanding this error occurs when the rest of the bus does not ACK a message TX'd from the device. What might randomly cause a single message to not be ACK'd by the other node? If this was the problem, it would indicate that 128 messages are not ACK'd, so the problem would seem to be something persistent rather than a single bad message (The error counters go from 0 to 128 all at once, not over time). 
  3. The device does not need to be power-cycled to recover, simply disconnecting the CAN bus and reconnecting it will solve the problem. This indicates that we might have a more graceful method of recovery. Do you have any recommendations on what we might leverage in the MCAN library to monitor and recover from this and other related problems in the future? 

Thank you for your time. 

  • Hey Daniel,

    Thanks for reaching out regarding these questions on CAN. I have addressed your questions below, but I will note that I'm only familiar with the TCAN1042 platform, so there might be some differences in behavior.

    We are using the RX FIFO mode with 32 available configured elements in each buffer. When a full buffer receives another message, the new message is ignored, and the message lost interrupt is triggered. Does the message lost interrupt result in an error message on the bus? In other words, does a full FIFO have any ripple effects on other parts of the CAN system? 

    There shouldn't be an error message on the bus, and the FIFO will not have a ripple effect on other parts of the CAN system. Messages should still be acknowledged by a full CAN controller since the ACK is only sent after verifying the integrity of the message and not based on the controller state itself. The controller will discard the message, the message lost condition will be signaled by RXFnS.RFnL = ‘1’, and the interrupt flag IR.RFnL will be set for the controller. Note: n is the FIFO number - i.e., either 0 or 1. 

    We have observed an ACK error frame on the bus when we are getting into the passive error state. To my understanding this error occurs when the rest of the bus does not ACK a message TX'd from the device. What might randomly cause a single message to not be ACK'd by the other node?

    This may be hard to pinpoint since it seems like its a rare occurrence in your system. Typically, failed ACK would be caused by incorrect bit timing or external factors such as a short. Checking the Data Phase Last Error Code (DLEC) or Last Error Code (LEC) bitfield of the Protocol Status Register (PSR) might be able to give you a better idea of what could be causing it - see the LEC description in the following document. You can get the status using the MCAN_getProtocolStatus function from the MMWAVE-L-SDK and can see its usage in the SDK MCAN External Read Write example.

    If this was the problem, it would indicate that 128 messages are not ACK'd, so the problem would seem to be something persistent rather than a single bad message (The error counters go from 0 to 128 all at once, not over time). 

    Now this to me sounds like a short. A short can very rapidly increment the error counters since up until error passive which then causes the CAN transceiver to wait an additional suspension time before interacting with bus again. While a short is held during the error passive state, the error counters will slowly increment until it hits busoff which will require an external host CPU to initiate a busoff recovery for the transceiver which leads me into your next question.

    The device does not need to be power-cycled to recover, simply disconnecting the CAN bus and reconnecting it will solve the problem. This indicates that we might have a more graceful method of recovery. Do you have any recommendations on what we might leverage in the MCAN library to monitor and recover from this and other related problems in the future? 

    Similar to busoff recovery, you could monitor the Error Counter Register (ECR), specifically the Transmit Error Counter (TEC) and Receiver Error Counter (REC) bitfields, which will increment by 8 for every error and decrement by 1 for every success. Note: There are exceptions to counter behavior, but I won't get into the details here. This can be done using the MCAN_getErrCounters function. Alternatively, you can enable the error passive interrupt (IR.EP) using the MCAN_enableIntr function which will trigger when the error passive state has changed. Note: Since the interrupt triggers when the state changes, it will also be triggered if PSR.EP changes from 1 to 0, so its a good idea to check the PSR register to determine if the device is in error passive.

    When either of the error counters reaches 128 or error passive is triggered, you should be able re-initialize the CAN using the following sequence:

    • Set CCCR.INIT to 1 using MCAN_setOpMode to enter software initialization mode
    • Wait for CCCR.INIT to be set 1 due to clock sync delay using MCAN_getOpMode
    • Re-initialize MCAN using MCAN_init and previous parameters given upon first initialization
      • Execute any other configuration functions such as MCAN_config, MCAN_setBitTime, MCAN_msgRAMConfig
    • Set CCCR.INIT to 0 using MCAN_setOpMode to enter normal operation mode

    Hopefully this answers all your questions, but let me know if there's anything else we can help you with.

    Regards,

    Kristien

  • Kristien, 

    Your answer is wonderful, thank you! I will take some time over the next few days to implement your recommendations and do some more investigation. Once I have the next stage of findings I'll reply here with the solution or more questions. 

    Regards, 

    ds

  • Hey Daniel,

    I'm glad you found this helpful! Feel free to post any other questions you have after investigating this more on your side.

    Regards,

    Kristien

  • Reinitializing the CAN when an error is detected solved the problem. I am not sure if that is the "intended" solution but it solved the problem we were experiencing. Thank you. 

  • Hey Daniel,

    I'm glad you were able to resolve this issue on your side. This does seem strange that you need to reinitialize when the error count is at 128. One thing you could also check for is if the lack of automatic retransmission may be affecting your issue here. If you are using MCAN_initOperModeParams to populate the MCAN_InitParams structure for the CAN driver, the Disable Automatic Retransmision (DAR) field is set to TRUE which prevents the controller from automatic resending a message upon transmit failure. If that is the case, you can initialize an MCAN_InitParams and then change darEnable to FALSE before passing the struct to MCAN_init. Afterwards, re-test the issue and see if anything changes such as counter behavior.

    Regards,

    Kristien