This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CAN bus problem with CRC, Bit Stuff, Form errors

Other Parts Discussed in Thread: CONTROLSUITE

Visio-CAN block diagram 20160429.pdf

MY engineering group is fighting a CAN bus problem with one of the bus nodes using a TI TMS320F28335PGFA
and CAN dual transceiver ON Semi AMIS42700WCGA4H detecing CRC errors, Stuff errors and Form Errors on the bus.
The errors seems to be detected infrequently roughly every 15 seconds during a constantly repeating cycle of
communication. We are using a 1Mbit/s baud rate but we have tried half that speed as well of 500Kbit/s with
this change not yielding an perceptible difference in error generation.

I have attached a block diagram of the network topology. The other devices being used on the bus are some
ELMO whistles (MCU and CAN transceiver not known to us) and other boards using a ST Micro STM32F103T6U6A
and SN65HVD231D.

Here are some details from another engineer in our group:

-----------

The goal is to run this configuration at 1MBit/sec, with a bus utilization of a little over 50%.  In my testing,
I have found that I consistently get CAN bus errors with this configuration, usually Stuff errors and Form errors.

If I remove a device from the bus, the errors go away
If I decrease the bus utilization, the errors go away
If I decrease the bus speed to 500 kBit/sec, the errors do NOT go away

The wires are all hand-twisted pair, each cable is less than 6 inches long, no shielding
The transceivers are a mixture of 3V and 5V
I have scoped the CAN H and CAN L signals, and the difference is consistently either 0V or 2.1V to 2.5V, which is well over the CAN spec of 1.5V difference
I have confirmed that the 120 Ohm  termination resistors are in place as shown in the diagram

-----------

We have also tried removing the ON Semi AMIS42700WCGA4H abd just using a single CAN driver SN65HVD231D off of
the TI TMS320F28335PGFA with no improvements. We also tried removing the ELMOs from the bus, just TI TMS320F28335PGFA
and ST Micro STM32F103T6U6A nodes with no improvements.

I am wondering if there is not a problem with either how the CAN controller on the TMS320F28335PGFA is being
configured or if the controller is getting false detects. We have tried looking at the CAN controller detects for the ST Micro
part and we are not getting any errors on it, understanding of course that, that the TI DSP is receiving most of the
traffic as it is acting as the initator for all the messaging on the bus.

I have read some of the related threads in this forum but I couldn't find anything that pinpointed the issue:

e2e.ti.com/.../481812
e2e.ti.com/.../492272
e2e.ti.com/.../1736817

Can anybody help?

  • Do you by any chance route RX pin through input qualification? Could the GPIO qualification errata cause the errors?

  • What do you mean by "input qualification"? Do you mean pin configuration such as debouncing?

    Is the pin configuration not automatically configured by its association with eCAN peripheral?

  • Jup, the device has a "debouncing" functionality that can be programmed and turned on or off. But it has some problems (look it up in the errata). The functionality can be enabled even for pins that are muxed to eCAN(or any other) peripheral. Whether this is the case, you will have to check with your code.

    If you have input qualification enabled, I would disable it (sync to CPU clock) and test the system again. If faults don't happen anymore good for you.

    If it does I'd wait for a fault event, and trip a free GPIO in order to trigger the scope. And I would connect the scope to CAN+, CAN-, DSP CAN_Rx and DSP CAN_Tx pins and look for a reason.

    Do the faults happen on a specific message? Does the fault happen only on messages that are long? From specific node only? Can you tell which message is from which node? Look at bit speeds from each node. It is possible that one node nit speed is slightly different. And with long messages you can get an error.
  • Mitja's suggestion to check the bit-width in every node is good. Ensure it is as close to 1 uS as possible. Have you tried moving the sampling point of the bit and also played with the SJW value? Are the errors happening only when the 28335 is receiving? If so, are the errors tied to a particular transmitting node?
  • I am working with Dan on this CAN issue.  Thank you for your suggestions.  

    I checked the input qualification setting and changed it to 0 (synchronous with SYSCLKOUT).  That did not solve the problem with the CAN errors.  

    I have tried different values for TSEG1, TSEG2, and SJW, but none of the valid combinations solved the problem.  Is there any reason to not set SJW to the largest possible value: min(4TQ,TSEG2)?

    All the messages received by the 28335 are 8 bytes long.  They come from several different nodes.  I have not been able to identify a specific node as the source of the problem.

    I am toggling a debug output point on the 28335 when I detect a CAN Rx error and using that edge to trigger a capture of the CAN Rx/Tx data stream with a USBee logic analyzer

    In this data capture, the previous CAN message (MsgID 412) is valid and the following CAN message (MsgID 281) is valid.  The data in between these two messages does not appear to be valid.

    The CAN TX line shows a repeating pattern of 6 usec low, 19 usec high, 6 usec low, etc.  

    The CAN RX line also has a repeating pattern: 11 usec high, 1 usec low, 0.3 usec high, 0.6 usec low, 0.3 usec high, 11.5 usec low, repeat.  This pattern continues for 280 usec, until MsgID 281 starts.

    Do these patterns mean something?  Are the nodes trying to resynchronize?  What would cause the very short pulses (0.3 usec, 0.6 usec)?

    In other error instances, I see different sequences of high and low pulses, but there is still a repeating pattern that lasts for more than 100 usec.

  • Few more things to check:

    1. CANRX pin should not use i/p qualification. See code snippet from Controlsuite.i

    /* Set qualification for selected CAN pins to asynch only */
    // Inputs are synchronized to SYSCLKOUT by default.
    // This will select asynch (no qualification) for the selected pins.

        GpioCtrlRegs.GPAQSEL2.bit.GPIO30 = 3;   // Asynch qual for GPIO30 (CANRXA)
    //  GpioCtrlRegs.GPAQSEL2.bit.GPIO18 = 3;   // Asynch qual for GPIO18 (CANRXA)

    2. Please try DLC=0. i.e. "zero-byte" frames.

    3. Are 32-bit R/W operations used while accessing the eCAN registers? (See TRM for example)

    4. In the Logic-analyzer plot you had sent, CANTX & CANRX correspond to which node(s)? Shouldn't the waveforms look identical?

    5. Is the "problematic" waveform generated by a node other than the 28335? I am trying to ascertain the source of the problem. Is it the 28335 or the other nodes on the system? Based on what you said, the problem is seen only when the 28335 is receiving. Is it correct to say that the erratic pulse train is generated by some other node on the network?

  • Thank you for making these suggestions.

    1.  I have tried both input qualification settings: set to 0 (synchronous with SCLKOUT) and 3 (asynchronous):  both settings seem to result in the same behavior

    2. I have not tried this yet.  Are you suggesting sending only messages that have a 0 byte data component?

    3. I am going through my code again to check on the 32 bit R/W operations

    4. The logic analyzer plot showed CAN RX and CAN TX at the 28335 pins, before the CAN transceiver.   These are the Rx and Tx signals, not CAN-H and CAN-L.  The Rx and Tx signals should not look identical, but the CAN-H and CAN-L should be mirror images of each other, correct?

    5. I have tried different combinations of CAN devices on my CAN network to try and isolate the source of the error.  I have found that I can run a system with multiple STM32 processors and an Elmo drive, and I do not see any CAN errors.  The only type of device missing from this configuration is the 28335.  When I run a system that includes the 28335 and the other devices, I do see CAN Rx errors reported by the 28335, but only when the bus utilization is above a relatively low threshold (a little over 50%).  

  • What is the width of each bit you measured in the oscilloscope?

    For #2, yes. Please send only 0-byte messages as an experiment

    For #4, CAN_H & CAN_L should indeed be mirror images of each other. For a transmitting node, TX and RX waveform will/must be identical except during the arbitration phase and during the ACK phase. In the Logic Analyzer plot you had sent, the waveforms are not identical.  That is why I asked. Could you attach  a scope plot of the CANTX & CANRX pins of 28335 during the error phase?

    Please ensure the bus is terminated on either ends and either ends only.

  • I have attached some USBee logic analyzer plots that show activity on the CAN bus when a CAN error has been detected by the 28335.  My CAN network consists of the 28335, 4 STM32 processors (NODE1 - NODE4) and two motor drives.  On the motor drives, the CAN controller is integrated into the drive, and I do not have access to the CAN TX signal.  

    In the plot below, I am showing the CAN TX signal at each node, and the CAN RX signal at the 28335.  The bottom trace, CAN_ERR, is an I/O point on the 28335 that toggles when the 28335 detects a CAN RX error. I am checking the CAN_ES register in the CAN ISR on the 28335.  

    The middle of the plot shows a long period (300 usec) where the CAN signals look strange.  The message before this (msgID 504) is also not quite right- the data is off by 1 bit.  The msgID should be 282 (0010 1000 0010) not 504 (0101 0000 0100), the DLC should be 0x08 not 0x10, etc.  This message came from one of the motor drives.

    I have included a screen capture of the suspect msgID 504 message and the end of the previous msgID 453 successful message below:

    The pulse widths are all even multiples of 1 usec, up to the the synchronization pulse at t = -548 usec.  At that point, 28335_CAN_RX shows the following sequence of pulses:

    high 0.938 usec

    low 1.06 usec

    high 11.18 usec

    low 1.0 usec

    high 0.25 usec

    low 0.75 usec

    following this, everything is an even multiple of 1 usec again

    Questions:

    - why didn't NODE1_TX go low at t = -548 usec like all the other nodes?

    - what is the low pulse from NODE1_TX at t = -537 usec?  Did that cause the corruption in the message that had just started to transmit?

    - the short low pulse (0.75 usec) and very short high pulse (0.25 usec) at t = -537 usec can't be right, is this where the message gets off by 1 bit?

    I set up the scope to capture on a 28335 CAN RX positive pulse width less than 0.8 usec, and I got this result.  There is a high pulse of 0.3 usec, a low pulse of 0.54 usec, a high pulse of 5.4 usec, then a low pulse of 12 usec.  

    Green = 28335 CAN RX

    Yellow = 28335 CAN TX

    I have checked the termination resistors again, and everything appears correct to me.

  • It appears things go south with the appearance of MSGID 504. It is interesting the data is off by a bit. I think this triggers error frames on the bus, which is what that strange looking pulse train lasting 300 uS may be. While the smallest error frame is 17-bits wide (6-bit error flag + 8-bit Error delimiter + 3-bit Interframe space), bear in mind that the error-flag could be up to 12-bits long, since not all nodes may initiate the transmission of the error-flag at the same time, the resulting error-flag being a superposition of all the flags transmitted by every node on the bus. That being said, any pulse smaller than one bit-time is suspect and needs to be investigated.

     

    I think what we see at -548 usec is the ACK phase. It is not the synch pulse.

    • why didn't NODE1_TX go low at t = -548 usec like all the other nodes?

      Node1, being the transmitted node, puts out a high ACK bit and expects it to be driven low by other (receiving) nodes. This explains why you see a low-pulse from all other nodes ,except NODE1.

    • what is the low pulse from NODE1_TX at t = -537 usec? Did that cause the corruption in the message that had just started to transmit?

      That appears to be a SOF bit, but I am puzzled a MSGID does not follow. Since it violates the bit-stuffing rule, this can certainly trigger the error frames. But what is interesting is the long gap before the error frames are seen on the bus.

    • the short low pulse (0.75 usec) and very short high pulse (0.25 usec) at t = -537 usec can't be right, is this where the message gets off by 1 bit?

      Possible. I am surprised that pulses shorter than the bit-time even appear on the bus.

       

      Bottomline: It appears the 28335 is reacting to the anomalous pulses it sees on the bus, but it is not the source of the error itself.