TMS570LS10106: Device Errata Question: DCAN#22

Amanda Ross95

Part Number: TMS570LS10106
Other Parts Discussed in Thread: HALCOGEN

Hello there,

I have a customer who is currently witnessing behavior similar to that which is listed in the DCAN#22 Errata for the device but their situation is a bit different and I just wanted to clarify a couple thing to make sure this could be attributed to that. Here's what they're seeing as well as their set-up..

1. Another module will send a can frame that matches one of the filters we have configured

2. Then the controller will notify that it has received a message

3. The message within the controller has the right ID and frame length but an incorrect payload

4. there might be a pattern of the data that is corrupted but we havent identified it yet

So this matches the errata description to a T...However, they do not place the module into INIT other than the CAN_INIT call which is generated by Halcogen.

From the note, it seems like this problem only manifests itself if you were running and then placed the module into INIT mode so (run -> INIT). Yet the customer's controller does the following: boot (assuming we are in init mode for the CAN module) -> CAN_INIT (so place into INIT) -> running and from here on out we never request to go to INIT. In other words the CAN module from boot goes from INIT->INIT -> run. So not the exact same situation.

Perhaps this is related in the sense that the controller doesn't boot into INIT on the CAN module. Is this possible?

I appreciate any feedback you can offer!

-Amanda

over 8 years ago

0 Chuck Davenport over 8 years ago

TI__Guru 59540 points

Hello Amanda,

We have never seen this behavior on a clean boot up. You mention the sequence is boot --> CAN_INIT call. We have seen instances where customers will toggle the INIT bit to try and release from the bus off state which then results in the issue. We would need to confirm that the INIT bit is not set anywhere in the code without the work around in place even in the case of a warm reset condition.

Often this takes some creative use of conditional breakpoints to indicate when the INIT bit is written/touched.

0 Chuck Davenport over 8 years ago in reply to Chuck Davenport

TI__Guru 59540 points

Amanda,

Another thing to check is the startup sequence. If they are somehow initializing CAN very early and then executing the LBIST function, there is a CPU reset that could cause the code to re-execute and, in turn, cause another write to the init bit without a reset of the peripheral with ongoing CAN traffic just as it is described in the advisory.

0 Nick Schaeferle over 8 years ago in reply to Chuck Davenport

Prodigy 120 points

Hello Chuck

I am the customer having this issue. Unfortunately we cannot debug this on the bench or in a way in which we can easily attach a debug setup in order to break on this issue

We have only seen this when installed in the full system and it happens rarely. Until recently it was a 1 in maybe a 1000 event, but we now have found a vehicle that exhibits this problem about 1 in 20 to 1 in 100 attempts.

We are not using the self test controller if that is what you are inferring with the LBIST function unless it came along with the standard Halcogen output?

There is one other location that manipulates the init bit other than the supplied halcogen can_init() function. We have found in practice that if there is an event that causes the controller to go into an error mode that it will stay off. We cant allow this happen, so for that reason we check if the node is in init mode periodically and clear it if it is to restore it to "normal" mode. This function only clears the init bit and does not set it. Also the function performs no other operations. Perhaps this is part of our issue? Is there anything else we need to do if we get into an error state and attempt to get it back into a normal mode?

0 Chuck Davenport over 8 years ago in reply to Nick Schaeferle

TI__Guru 59540 points

Hello Nick,

This is a very similar issue that I have seen with several other customers. Essentially, they were clearing the init bit our of an abundance of caution to ensure the continued operation of the CAN communication. The issue is that it is the clearing of the init bit that causes the misbehavior. I would strongly recommend that you implement the work around suggested in the advisory whenever you clear the init bit. In the past, adding this step to any code that clears the init bit, also fixed the issues that were being seen.

0 Nick Schaeferle over 8 years ago in reply to Chuck Davenport

Prodigy 120 points

Hello Chuck

Thanks for the response, so does this mean the errata applies to whether you set the bit or clear it? Is our issue with clearing the bit different than the errata?

Also we suspected the clearing of the init bit might be the issue so we added a can frame that is sent out whenever this function does clear the init bit. We got another reproduction of this issue and the can frame was not sent out. Im not sure what this means because two things are possible: our issue is not caused by this clearing of the init bit or you cannot transmit a can frame unless you wait a certain number of cycles after clearing the init bit. Since we attempt to transmit just after clearing the bit then maybe the peripheral ignores the request?

0 Chuck Davenport over 8 years ago in reply to Nick Schaeferle

TI__Guru 59540 points

Hello Nick,

In regard to the Advisory wording, I think there is some assumed process associated with the wording that makes it not so clear. It references setting the INIT bit as the root cause, but realistically the assumption is that whenever the INIT bit is set it is also cleared since the CAN module can't operate when if the INIT bit is not cleared. So, in reality it is the process of clearing the bit that causes the issue since clearing the bit causes everything (configuration) to be reloaded and start a new.

As for the transmission immediately after clearing the bit, there is some latency for the write to occur to the register and, again, some latency for the configuration to get propagatedand/or some latency while the module synchronizes with the CAN bus, so I would expect there to be some delay before a transmission can occur. Normally, you would have a while loop checking for the CAN_INIT bit to go to 0 before proceeding but in your case, the CAN_INIT bit isn't 1 to begin with, most likely, so checking this may won't show the transition/completion of all the internal workings. If you stick a delay in the code (temporarily) to account for some time for all the latencies or soureces of latency does the transmission happen?

Also, just as a reminder, if there is an error condition on the bus that results in a bus OFF condition for the DCAN module (this can be a result this of any node on the CAN bus), the device logic will set the INIT bit and all transmission will stop. This is a result of the HW design and there is nothing that can prevent it deom happening. So, the clearing of the INIT bit as you are doing is a good practice, but you would need to follow the recommended work around in order to do this smoothly.

0 Nick Schaeferle over 8 years ago in reply to Chuck Davenport

Prodigy 120 points

Hello Chuck

We further refined our diagnostic on the idea that the clearing of the init bit when the code detects the module is in error state is the cause of the problem. Our function to do this "return from eror state" is called from the main while loop and the code looks like this

if (hal_chassis_can_reinit()) can_init_counter++;

this is the function body

bool tms570_can_restart_if_in_init_mode(canBASE_t* node)
{
if (node->CTL & CAN_CTL_INIT)
{
node->CTL &= ~CAN_CTL_INIT;
return true;
}
else return false;
}

The can_init_counter is sent over can at a regular period; both before and after the event was reproduced the counter stayed at 0. We also scoped the power line into the micro and it did not dip or brown out. This leads me to think there is an issue in the code or controller that is causing a partial reset and then a double call of the halcogen can_init function? The odd thing here is that we issue a can frame on boot of this controller so if the previous theory is true then this means that the controller is resetting, but only in a partial way since it does not fire the boot can frame again? Does the double call of the can_init still seem like the issue here?

My apprehension to this fix is that I would like to identify the issue before applying the fix, as the problem is hard to reproduce. So if we applied this 'fix' without identifying the problem we will not be able to tell if the problem was resolved or was just made alot harder to reproduce

0 Chuck Davenport over 8 years ago in reply to Nick Schaeferle

TI__Guru 59540 points

Hello Nick,

I understand the desire to drive to a root cause of the issue as the fix may be masking another unexpected behavior with bigger implications such as a brown out condition. For sure, there are mechanisms in the microcontroller for some level of brownout protection. Specifically, the VMON will issue a reset to the device if voltages drop sufficiently to trigger it. The problem is that the VMON is a very gross protection mechanism and can't be relied upon by itself. Do you use any other type of external voltage monitoring or voltage supervisor? Even with this, I don't think a dip in voltage is likely to cause an issue of this type because of the digital nature of the device. It could, however, be causing an issue on the CAN bus leading to the CAN errors you are seeing.

The CAN bus is generally a 5V bus feeding into the transceivers then going out as 3V signals to the CAN modules in the network. Do you know what types of errors your device is seeing causing it to stop transmitting? Usually, if there is an accumulation of errors, the CAN bus will transition to a bus OFF state and this is when the init bit will be set by the hardware as part of the recovery process.

For sure, if the device were truly going through a reset of any kind, it would go through the initialization code all over again unless you had some protection to prevent it. You would also expect this to be seen through inspection of the SYSESR register within the code. There are situations, that should also be accounted for where an soft reset may be triggered if an error isn't serviced. Please review the TRM for sources of reset including exception handling but, again, if this were happening, you would see your startup/boot message being transmitted as the boot code is executed. Also, if a reset is triggered, the CAN module is reset just as the SWR does for the DCAN#22 work around.

Is there anyway to capture the error data from the CAN module when you see the CAN module stop communicating? i.e., I know you stated you can't look at it via a debugger, but can you get access via another serial line such as SCI or SPI? If nothing else, can you store the data off in the Flash emulated EEPROM (FEE) area? The end goal is to understand what is happening to the CAN module causing the need to write to the init bit to get it going again.

0 Nick Schaeferle over 8 years ago in reply to Chuck Davenport

Prodigy 120 points

Hey Chuck,

I have attempted to get some kind of consistent reproduction of this issue in the controller. I have tried various techniques to get this to happen including setting the init bit and delaying configuration of the module or calling the larger init function multiple times with no luck in making this more reproducible. I think at this point we are going to try the fix then increase the cycle attempts via other means to see if this change has a measurable impact on this problem. I do have another question. Why isnt this work around from DCAN #22 included in the Halcogen code? Here is my sample code from the 1st suggestion in the errata

static void _tms570_can_reset(canBASE_t* node)
{
node->CTL &= ~(CAN_CTL_IE1 | CAN_CTL_IE0); // disable interrupts
node->CTL |= CAN_CTL_INIT; // place module in init mode
// wait 6 + 1 can clock cycles to avoid phantom interrupts, can clock = system clock
asm(" nop"); asm(" nop"); asm(" nop"); asm(" nop"); asm(" nop"); asm(" nop"); asm(" nop");
node->CTL |= CAN_CTL_SWR; // reset module
while (node->CTL & CAN_CTL_SWR); // wait for module to come out of reset
}

0 Chuck Davenport over 8 years ago in reply to Nick Schaeferle

TI__Guru 59540 points

Hi Nick,

Appreciate the implementation of the work around. I think you should see the issue go away but it will definitely be difficult to measure given the low occurrence rate under normal conditions.

In regard to why the work around isn't included in the HalCoGen code, I am not entirely certain. Given the need is based on the setting of the init bit, it could be the software team simply didn't foresee error condition arising as a possibility of the init function being recalled. In the end, it is a corner case that seemingly has not been covered.

For sure, at this point, I have doubt that anything will get changed given the NRND status of the device and limited Software resources available for updates of this nature. For certain, on our newer devices that are not NRND the DCAN#22 issue has been repaired and is no longer a concern.

0 Amanda Ross95 over 8 years ago in reply to Chuck Davenport

TI__Intellectual 2740 points

Hello Chuck,

Thanks so much for all your help thus far! Nick recently went about implementing the DCAN #22 fix as previously discussed and they're fairly certain it resolved the recieve function's memory corruption issue during boot. See fix below...

However they recently recieved back a unit which falls into an error mode after receiving too many CAN bus errors. Nick thinks this forces the TMS570 CAN driver back to INIT mode as it doesn't have a special error mode for this.

NIck says that they've been running with their current mechanism for error detect and reset for a while now and it used to just clear the INIT bit in the NODE-->CTL register (this worked fine before). Yet after they implemented the DCAN #22 fix, they now call the INIT function which reinitializes the driver. Problem is when the driver attempts this while there is still traffic of the bus, the driver hangs and never comes around.

The customer provided a log as well (7484.test_out.txt). During all this, their system is listening for three IDs (0x180, 0x186 and 0x500) while it also relays information back from these IDs using 0x691. Whenever it tried to reinitialize the driver from errors, it will send 0x692 while flagging 0xAB. If you take a look at the end of the log, it seems to reinitialize the driver but only the transmit function is returned.

Please see the function they're using to try to get the CAN driver back into running mode...

also, when it calls hal_chassis_can_reinit(), it is mapping to this function below...

The customer has the following questions based on all this:

"1. Is there some issue with commanding this reset and init mechanism while the bus contains traffic from other modules?

2. What is the recommended way to get the driver out of error (init) and back to running with traffic on the can bus after booting? Is this different that the initialize method during boot?

3. Is there something about initializing the can driver with traffic that causes it to wait for a "quiet" window on the bus before going into "run"? If so what can we do to get the module back from an error mode to running while the bus contains can traffic"

We'd appreciate any input you may have. Thank you for taking the time to look at this with us.

Best Regards,

-Amanda

0 Chuck Davenport over 8 years ago in reply to Amanda Ross95

TI__Guru 59540 points

Hi Amanda,

A colleague and I reviewed the detailed description and code that you have posted and we don't see any specific issue in how the work around for DCAN#22 has been implemented. Certainly, it would be expected that during a CAN initialization there might be traffic on the CAN BUS. This is a normal condition and shouldn't prevent the init routine from completing.

Also, can you clarify the meaning of these two statements?

Amanda Ross95 said:
Problem is when the driver attempts this while there is still traffic of the bus, the driver hangs and never comes around.

and

Amanda Ross95 said:
Whenever it tried to reinitialize the driver from errors, it will send 0x692 while flagging 0xAB. If you take a look at the end of the log, it seems to reinitialize the driver but only the transmit function is returned.

These statements seem to contradict. i.e., It never completes CAN_Init, but it resumes transmit functions? Generally, if it isn't receiving anything, can they check to see if there are packets on the bus that it is ignoring/not receiving? Is it possible that the transmitter of the CAN message that needs to be received has gone to bus-off? If no, and there is a message on the bus that is simply not being received, has the Rx mask been setup correctly? If it hangs in the driver, where does it hang?

0 Nick Schaeferle over 8 years ago in reply to Chuck Davenport

Prodigy 120 points

Hello Chuck,

Sorry for the conflicting statements. Here is what I meant to say: The can initialization seems to not hang and it does complete; however after completion only the can transmit is functional. The receive function is no longer operational.

The bizarre thing here is that this exact same function is called on boot and whenever the main code loop detects the module is init mode (which we interpret as the can driver going into an error mode). Perhaps the mailboxes that handle receiving the can messages need an extra step or process to reset them as well? This seems almost like the last issue we had except instead of the receive mailboxes have bad data on boot they have no data if you ever try to reset the can driver after boot

the rx mask and filter settings are exactly the same as when we call this function at boot so I dont think the issue is there unless there is something different about setting up the mailboxes when there is traffic on the bus or the mailbox memory already contains some messages?

The transmitter of the can message has not gone bus off. You can see in the log that its still sending the messages. In this case there are two different nodes; one sending the 0x500 message and another sending the 0x180 and 0x186. The log shows that after the inverter trys to pull the can driver out of init using the init function that it no longer relays those messages because it doesn't receive any of them

0 Chuck Davenport over 8 years ago in reply to Nick Schaeferle

TI__Guru 59540 points

Hi Nick,

First, in regard to their being traffic on the bus, this is irrelevant to the issue since we expect that there could be the possibility of traffic on the bus even at initial boot time so I don't think this is any influence on the issue at hand. It is stated that there is some sort of catastrophic error event causing the system to go offline or initiate its error handling logic including the DCAN#22 work around.

Have you been able to capture any log files that demonstrate the event? i.e., the series of errors that lead up to the final death of the CAN drivers? When the error recovery occurs, have you confirmed the ID Masking settings to make sure the received messages aren't being turned away due to a change in the configured mask?

0 Nick Schaeferle over 8 years ago in reply to Chuck Davenport

Prodigy 120 points

Hello Chuck

We figured out the issue with our CAN receive not coming back after a re-init of the driver. It turns out that the can mailbox setup was done after init when it should have been done within the init function. Calling the setup after init was kind of a by product of the way halcogen presented the functions in the can module to the user. What I mean is that the can.c module gives you an init method and some user code blocks but those blocks are before and after the init function puts the node into and out of init. So if you were to setup can mailboxes and use the halcogen init method you would have to either setup mailboxes before or after the init and since the init now contains a reset of the can node we placed it after the init function.

I believe this issue is alluded to in the TRM. The TRM says this in 16.9 "The whole Message RAM should be configured before the end of the initialization, however it is also possible to change the configuration of message objects during CAN communication." We found that the latter statement in that quote is not true in our case, for some reason.

Placing the can mailbox setup after the module is put into init and before its taken out of init has resolved our issues

Thanks for the help

0 Chuck Davenport over 8 years ago in reply to Nick Schaeferle

TI__Guru 59540 points

Hello Nick,

Glad to hear the issue is resolved. As always, let me know if any additional issues come up and if there is any way I can help.

Arm-based microcontrollers

Arm-based microcontrollers forum

TMS570LS10106: Device Errata Question: DCAN#22