TMS570LC4357: Unreliable CAN initialization due to stuck canIsTxMessagePending()

Peter Lu

Part Number: TMS570LC4357
Other Parts Discussed in Thread: HALCOGEN

I'm experiencing a lot of unreliable CAN start-up. There are times when CAN starts up nicely, and messages flow between my two Launchpads without problems. Then there are times when CAN does not start up nicely and nothing flows between the two Launchpads.

The code on each Launchpad node, 1 and 2, calls canInit() to initialize the communication. What I find is that if I kick off canInit() on the two nodes within a certain "time window" of each other, then traffic flows fine. However, if the canInit() is outside this window, then both nodes get stuck indefinitely in canIsTxMessagePending(), wherein a transmit messagebox is filled (pending), but no transmit happens. I've tried both just ignoring the pending situation and calling canTransmit() with another message or clearing the transmit pending bit, but neither way "unsticks" the messagebox. I've also tried setting Halcogen to use "Enable Auto Bus On" to no avail.

I'm pretty sure this has to do with the inane CAN protocol flaw wherein a transmitter "aborts" its transmit if no receiver is there to pick up the message (receiver clears a "I'm listening" bit in the message on the bus), and perhaps the TI CAN implementation. If I put a Kvaser Leaf Lite analyzer probe on the CAN bus, then data flows fine, since the analyzer actively clears the bit in the message on the bus. What a flawed protocol, wherein an analyzer has to be _intrusive_ for data to flow and be monitored.

Are there any ways around this major bug, or am I dead in the water?

Thanks for all help.

over 2 years ago

0 Peter Lu over 2 years ago

Prodigy 220 points

The Tech Ref Manual says there's a TxRqst bit that can be used for restarting a (failed) transmission. This is apparently only used if Automatic Retransmission is OFF. I don't know what auto retransmission does, and don't know if this could save the day.

Grateful for any help. Thanks.

=== From Tech Ref Manual...

If Automatic Retransmission mode is disabled by setting the DAR bit in the CAN Control Register, the
behavior of bits TxRqst and NewDat in the Message Control Register of the Interface Register set is as
follows:
• When a transmission starts, the TxRqst bit of the respective Interface Register set is reset, while bit
NewDat remains set.
• When the transmission has been successfully completed, the NewDat bit is reset.
When a transmission failed (lost arbitration or error) bit NewDat remains set. To restart the transmission,
the application has to set TxRqst again.

0 Peter Lu over 2 years ago in reply to Peter Lu

Prodigy 220 points

Looks like there's a potential fix/hack around the clogged tx messagebox. Hope clearing the TxRqst bit will unclogged the messagebox and allow canTransmit() to work (again). I don't see any TI CAN controller feature to ignore the "receiver listening" bit for messages on the bus.

27.17.15 Transmission Request Registers (DCAN TXRQ12 to DCAN TXRQ78)
These registers hold the TxRqst bits of the implemented message objects. By reading out these bits, the
CPU can check for pending transmission requests. The TxRqst bit in a specific message object can be
set/reset by the CPU via the IF1/IF2 Message Interface Registers, or by the Message Handler after
reception of a remote frame or after a successful transmission.

0 Peter Lu over 2 years ago in reply to Peter Lu

Prodigy 220 points

Simple enough fix, just check to see if tx pending is active when it shouldn't be, as for instance if some amount of time has elapsed since tx message notification was seen...

. // should not be any tx pending at this point, so clear any
uint32 txrqx = node->TXRQX;
if (txrqx & 0x3) { // check only the 16 messageboxes used
cnc_TxMessagePendingClogs++;
canClearTxMessagePending(node, messageBoxHi);
canClearTxMessagePending(node, messageBoxLo);
}

=== Added Halcogen function...

uint32 canClearTxMessagePending(canBASE_t *node, uint32 messageBox)
{
uint32 success = 0U;
uint32 regIndex = (messageBox - 1U) >> 5U;
uint32 bitIndex = 1U << ((messageBox - 1U) & 0x1FU);

/** - Check for pending message:
* - no pending message, return 0
* - pending message, clear pending
*/
if ((node->TXRQx[regIndex] & bitIndex) == 0U)
{
success = 0U;
}
else
{
/** - Wait until IF1 is ready for use */
/*SAFETYMCUSW 28 D MR:NA <APPROVED> "Potentially infinite loop found - Hardware Status check for execution sequence" */
while ((node->IF1STAT & 0x80U) ==0x80U)
{
} /* Wait */

/** - Terminate Transmission by clearing TxRqst in message box */
node->IF1CMD = (uint8) 0x00U; /* Note 0x84U == Busy+TxRqst */

/** - Trigger Remote Frame Transmit from message box */
/*SAFETYMCUSW 93 S MR: 6.1,6.2,10.1,10.2,10.3,10.4 <APPROVED> "LDRA Tool issue" */
node->IF1NO = (uint8) messageBox;

success = 1U;
}
/** @note The function canInit has to be called before this function can be used.\n
* The user is responsible to initialize the message box.
*/
return success;
}

0 Peter Lu over 2 years ago in reply to Peter Lu

Prodigy 220 points

The code needed to clear TxRqst is actually as below, with CMD write of 0x90U. The node->TXRQX and node->TXRQx[] registers show the bit cleared, and then a subsequent canTransmit() actually does something.

Unfortunately, this doesn't solve the problem, since the messagebox does get unclogged and tx data starts flowing, but the data rate is less than 10% of what it should be. The TxRqst (pending) problem also keeps recurring. There's either some other action that is needed to recover/fix the Tx engine, or there may be no fix at all possible.

Please help with suggestions. Thanks.

=== Code in canClearTxMessagePending() to clear TxRqst...

/** - Terminate Transmission by clearing TxRqst in message box */
node->IF1CMD = (uint8) 0x90U; /* ProtectedWrite+Control+No_TxRqst */

/** - Trigger Clear Transmit Pending from message box */
/*SAFETYMCUSW 93 S MR: 6.1,6.2,10.1,10.2,10.3,10.4 <APPROVED> "LDRA Tool issue" */
node->IF1NO = (uint8) messageBox;

0 Peter Lu over 2 years ago in reply to Peter Lu

Prodigy 220 points

It seems that the clearing of TxRqst via IF1CMD in (my) canTxMessagePending() completes properly, but the ensuing call to canTransmit() does all the "right" things except that the transmit on the bus just doesn't happen.

I don't know what it takes to get the 0x87 command issued in canTransmit() to actually send the message object onto the bus.

0 QJ Wang over 2 years ago in reply to Peter Lu

TI__Guru**** 197576 points

Hi Peter,

I am sorry for late response. Do you add CAN transceiver to launchpad and termination resistor (120ohms) to the two ends of CAN bus?

0 Peter Lu over 2 years ago in reply to QJ Wang

Prodigy 220 points

Yes.

As I mentioned, if I start the two Launchpad nodes within a certain time window (about 8 seconds), then the transmit-receive works fine. If the nodes are started outside a time window (say after 15 seconds), then the transmitter is jammed. Also, if I insert a Kvaser Light Leaf V2 probe on the bus, then everything works fine; the Light Leaf V2 sets the ACK Slot bit to dominant on all CAN frames transmitted on the bus.

In looking at the Tech Ref Manual section 27.17.1, there is mention of the TI controller deactivating the controller (transmitter?) once 11 recessive bits are seen (I'm assuming this means the ACK slot bit). The rest of the section is Greek to me at the moment, as I can't tell what error(s) I should be looking for, or how to recover from this error(s) if indeed the controller (transmitter) has gone inactive.

I'm not even clear how the TI controller does re-transmission. Not sure if that's a CAN spec thing or a TI controller feature, and what specs to look at.

Thank you so much for help.

0 Peter Lu over 2 years ago in reply to Peter Lu

Prodigy 220 points

I have re-verified that if a Launchpad is in the transmitter-blocked condition (no receiver powered on, or Kvaser Light Leaf v2 probe on-bus), and then a Kvaser probe s put onto the bus, then the transmitter starts transmitting. Hence, it seems it's the TI controller receiver (if/when available) that is _not_ ACKing the (re-)transmitted frame(s).

Any insight on this would help greatly.

Thanks.

0 Peter Lu over 2 years ago in reply to Peter Lu

Prodigy 220 points

Interestingly, if I turn _off_ Automatic Retransmission on the port, putting the Kvaser probe on-bus will still get the Launchpad controller transmitting. So, somehow the TI controller will kick off transmission if the "CAN bus state" changes (irrespective of Automatic Retransmission).

Of course, I don't really know the exact details of what the Kvaser probe actually does when it goes on-bus.

Thanks.

0 QJ Wang over 2 years ago in reply to Peter Lu

TI__Guru**** 197576 points

I think that the Kvaser probe has built-in CAN transceiver, and CAN bus terminator. The CAN network has to be connected from one node to the other with a bus termination (120ohms) for each of the two end points.

During startup, if only 1 CAN node is online, and if this node transmits message, it will not get ACK. It can become "error passive" but not "bus off".

0 Peter Lu over 2 years ago in reply to QJ Wang

Prodigy 220 points

Hi, QJ,

As I mentioned, the CAN communication has this "timed window" behavior, so bus termination (in this case) is not an issue since whenever the communication handshake has been established, data will successfully flow between two TMS570-LC43s forever. We do have some termination issue that is likely related to transceiver hook-up that I'll mention below later, that is unrelated to this particular handshake issue.

Because the Kvaser acted as a "message sink" (asserting dominant ACK-slot), I decided to see if hooking up another CAN port configured only as a data sink would replace the Kvaser effectively. We run CAN1 at 1M bps on the two TMS570 boards, so I hooked up (previously unused) CAN4 ports running at 1M bps on that bus. CAN4 isn't used to transmit or received messages at all. As somewhat expected, CAN4 was able to perform the data sink functionality as well as the Kvaser, so this became a fall-back (but ugly) solution.

However, in doing the CAN4 experiment, I noticed that the "timed window" had some relationship to the 1M bps speed we use. We have been doing a fix for the 1M bps BTR Halcogen bug using some dynamic patching that happens after the call to canInit(). You had offered some calculator to our hardware guy to fix the BTR value. Our code looked like:

canInit();

.... some time passes ...

#define FIX_CAN_1000

#ifdef FIX_CAN_1000

canBASE_t *cnc_port = canREG1;
canSetSpeed(cnc_port, 1000);
canBASE_t *cnc_sink_port = canREG4;
canSetSpeed(cnc_sink_port, 1000);
timer_udelay(750000);
#endif

So, between the time canInit() and canSetSpeed() [home brewed function using same code as Halcogen generated] is called, and the 3/4 second needed for the fixed speed to take effect, the CAN bus gets screwed up. Serial communication protocols are notoriously bad because the transmitter and receiver ends have no way of "negotiating" or "detecting" the running bit rate. CAN is even worse since the receiver can go around "clobbering" bits on the bus even when the receiver can't be sure it's running the same bit rate as the transmitter. I realized this when I initially did not configure the data sink CAN4 to the right/fixed 1M bps.

Hence, the experiment I needed to do is to post-edit the Halcogen Code Generation of HL_can.c and get rid of the FIX_CAN_100 section above.

This experiment worked perfectly, although it involved an ugly code-edit that I was able to use the (Unix/Linux) stream editor (sed) to do.

== Original bad Halcogen generated code

canREG1->BTR = (uint32)((uint32)0U << 16U) |
(uint32)((uint32)(1U - 1U) << 12U) |
(uint32)((uint32)((8U + 1U) - 1U) << 8U) |
(uint32)((uint32)(1U - 1U) << 6U) |
(uint32)6U;

== Fixed code

canREG1->BTR = (uint32)((uint32)0U << 16U) |
(uint32)((uint32)(5U - 1U) << 12U) |
(uint32)((uint32)((8U + 1U) - 1U) << 8U) |
(uint32)((uint32)(4U - 1U) << 6U) |
(uint32)6U;
/* fix_hlcg script corrected canREG1->BTR to be set to 0x48c4 */

== Sed script

F=HL_can.c

if [[ "`grep 'fix_hlcg' $F`" == "" ]]
then sed -i -e '/canREG1->BTR =/{
n
s/1U/5U/
n
n
s/1U/4U/
n
a\
/* fix_hlcg script corrected canREG1->BTR to be set to 0x48c4 */
}' "$F"

Bottom line is trying to modify CAN speed _after_ the canInit() call with some canSetSpeed() call can/will cause all kinds of nasty CAN protocol errors, including locking up the controller in a transmit request pending lock. I don't know the minute details of the CAN bus interactions since I don't have a microscopic CAN bus analyzer, but it's likely the transmitter notices some catastrophic failure due to bad receiver clobbering of bits and gives up, Not sure how re-transmissions (or lack of such) and the presence of data sink plays into this.

Anyway, my problem is fixed using the fix_hlcg script.

As for the termination issue, in one of our TMS570 hardware setups (not Launchpad), adding CAN4 data sink to the CAN1 bus worked fine as expected. However, in our TMS570 Launchpad setup, the CAN4 addition to the (working) CAN1 bus did not work at all. Somehow, some difference between our different use of transceivers and/or terminators on those setups must have caused big problems for the Launchpads.

Thank you very much.

0 QJ Wang over 2 years ago in reply to Peter Lu

TI__Guru**** 197576 points

Thanks Peter!

For a given bit rate, there may have several possible settings (seg1, seg2, prop, SJW). It is recommended to choose the configuration that allows the highest oscillator tolerance range. The HALCOGen may not choose the best combination. The maximum tolerance df is the defined by two conditions.

The TRM suggests that the SJW may not be longer than either phase_seg1 and phase_seg2. But the SJW of your working bit timing parameters is bigger than phase_seg1:

SJW=4, phase_seg2=5, phase_seg1=1, prog=8

0 Peter Lu over 2 years ago in reply to QJ Wang

Prodigy 220 points

Hi, QJ,

Thank you so much for this information. I'm conferring with our hardware guy to see what needs (or can) to be done.

Could you describe what the likely result of SJW > Phase_Seg1 would be? Does it result in more message errors, and if so, is there any characterization (such as more errors as bus test run time increases)?

My data ping-pong test does show some small percentage of dropped messages (that seems random, over a long time). I haven't identified the cause, but the CPU is not doing much other than feeding the transmitter or unloading the receiver (takes about 5 usec interrupt handling out of a 112+ usec message time) and each transmitter is only sending for less than 20% of total bus bandwidth, with no simultaneous transmissions (no collisions).

Thanks for any insights.

0 QJ Wang over 2 years ago in reply to Peter Lu

TI__Guru**** 197576 points

SJW (Synchronization Jump Width) limits how far the bit length is allowed to wiggle. If a very accurate crystal is used, SJW can be very small. If a temperature sensitive resonator is used, SJW can be very big.

The CAN spec says the maximum oscillator tolerance is 0.158% for 125kb/s baudrate. If clock difference between 2 nodes is 0.158%, the SJW > 20*0.158%* bit_time = 4*bit_time.

SJW can't be larger than Phase_Seg2. If SJW is larger than Phase_Seg2, it could jump beyond the frame border and into the next bit during resynchronizing.

There is no bit field in the config register for Phase_Seg1 or Prop_Seg. The Prop_Seg + Phase_Seg1 is used for CAN bit timing setting. In your setting, it is 8+1. The CAN controller doesn't care if it is 8+1 or 1+8.

The Prop_Seg parameter is derived from the bus length and the delays through the transceivers. The message has to get from one node to the other, and the "ACK" bit returned within the time assigned for Prop_Seg. For example, at 1MHz that's 1us/bit, and if you set Prop_Seg to be 8/18*1us = 0.44us, your bus can be (0.44us * 300 / 2 ) = 66 meters long. If the cable length is only 6 meter, 1Tq is bigger enough for Pro_Seg, and the rest of (8+1) is then used as Phase_Seg1.

So SJW=4 in your setting should work if the cable is that long.

0 Peter Lu over 2 years ago in reply to QJ Wang

Prodigy 220 points

Hi, QJ,

I'm not quite sure what you are saying. You seem to say SJW=4 is used when the cable is about 66 meters, but we're using a 1 meter or so cable with SJW=4 and things are running well, with the Kvaser able to properly monitor and interact with the 1 Mbps CAN bus. I'm not sure what the Kvaser probe actually offers, but it has a "SJW" setting that only offers 1 or 2 for choices (we're using 1, but don't know what difference this setting makes).

Our hardware guy indicates that the BTR value he calculated is simply with the utility program you provided him. We don't have much appreciation for the formulas.

Please elaborate on what your concern is with SJW=4. Our current BTR code is as below.

Thank you very much.

=== Current relevant code

/** - Setup bit timing

* - Setup baud rate prescaler extension

* - Setup TSeg2

* - Setup TSeg1

* - Setup sample jump width

* - Setup baud rate prescaler

canREG1->BTR = (uint32)((uint32)0U << 16U) |

(uint32)((uint32)(5U - 1U) << 12U) |

(uint32)((uint32)((8U + 1U) - 1U) << 8U) |

(uint32)((uint32)(4U - 1U) << 6U) |

(uint32)4U;

/* fix_hlcg script corrected canREG1->BTR to be set to 0x48c4 */

0 QJ Wang over 2 years ago in reply to Peter Lu

TI__Guru**** 197576 points

Hi Peter,

The Phase Buffer Segments (Phase_Seg1 and Phase_Seg2) and the Synchronization Jump Width (SJW) are used to compensate for the oscillator tolerance. The Phase Buffer Segments may be lengthened or shortened by synchronization.

SJW should not be larger than the smaller of the Phase Buffer Segments (Phase_Seg1 and Phase_Seg2). In your setting, SJW=4, and Phase_Seg2=5, and Prop_Seg + Phase_Seg1=9.

(SJW = 4) < (Phase_Seg2=5) ---> meets the requirements

If cable is very long, and Prop_Seg is >6, the Phase_Seg1<3 --> SJW > Phase_Seg1 --> doesn't meet the requirements

If cable is short, and Prop_Seg is < 6, the Phase_Seg1>= 4 --> SJW <= Phase_Seg1 --> meets the requirements

where tPROP_SEG = 2*( tBus + tTransmitter + tReceiver), the signal speed on cable is about 5 ns/m, and assume tTransmitter + tReceiver = 100 ns

If cable length=1 meter, then tPROP_SEG = 2*(5+100)=110 ns which about 2tQ where tQ=1000ns /19 = 53 ns (1000ns is bit time of 1MB baudrate)

tPROP_SEG = 2 < 6 ---> meets the requirement

If cable length=10 meter, then tPROP_SEG = 2*(50+100)=300 ns which about 6tQ

tPROP_SEG = 6 --> Phase_Seg1 = 9-6=3 = 4 (SJW) ---> doesn't meet the requirement

Hope this helpful

Arm-based microcontrollers

Arm-based microcontrollers forum

TMS570LC4357: Unreliable CAN initialization due to stuck canIsTxMessagePending()