DCAN FIFO reading while receiving

Michael Sch,

Other Parts Discussed in Thread: TMS570LS3137, HALCOGEN

Hey guys,

we're currently experiencing some troubles with reading a DCAN Rx FIFO on a TMS570LS3137 while having bursts transfers on the bus. It seems as if messages are getting out of order or even getting lost. The protocol above is not working reliably therefore.

I already had a very careful look at the TRM for the TMS570 but there are still some details I'd need more information about. From my point of view, the TRM is not detailed enough regarding the FIFO operation mode.

The main question is, what the CAN core is doing when the software is reading out a FIFO while new frames are getting from the bus, targeting the same FIFO?

Does the CAN core place any new message to the starting position of the FIFO?
This would indicate, that we need to read out the FIFO as fast as possible without allowing anything to interrupt us to be sure the reading happens faster than the writing.
Does the CAN core place any new message to the next higher free slot after the highest used one as long as the FIFO was not totally emptied (no active NewDat flags)?
Would a loop based reading mechanism until no "NewDat" flag is found automatically prevent this?

To make it even more difficult, we'd like to use the DCAN in two different ways (application dependent):

Having Rx interrupts enabled (for every single message object in the FIFO).
Using polling mode only

Is there any different software handling needed to guarantee a perfectly fine in-order reception?

Independent if using interrupt or polling mode: Can there be any situation where we need to start reading from another position than 1?

The only partly useful information, I could find in the TRM is the flowchart on page 1187. But this does not contain any text as an explanation (like polling mode, sequences, usage of multiple FIFOs, etc.).

Thank you very much in advance.

Regards,
Michael

over 9 years ago

0 Michael Sch, over 9 years ago in reply to Bob Crosby

Intellectual 800 points

Bob,

have you been able to reproduce the out-of-order reception with the updated code on your side?

Regards,
Michael

0 Bob Crosby over 9 years ago in reply to Mr user4642853

TI__Guru 72500 points

Sorry, got a bit behind last week and have not had a chance to look at this again yet. Are you using the optimizer? I was using "-o2". That may explain the difference in execution time.

0 Bob Crosby over 9 years ago in reply to Bob Crosby

TI__Guru 72500 points

With the change you made and the optimizer turned off, I can reproduce the problem. With the optimizer at -o2, I no longer reproduce the problem. Looks like we have a timing issue. Let me work on it a bit.

0 Michael Sch, over 9 years ago in reply to Bob Crosby

Intellectual 800 points

Bob,

interesting to hear. But this would indeed indicate that there might be something like a race condition. I already mentioned this possibility at e2e.ti.com/.../1718735 after trying out almost everything what came to my mind.

Between your original code and our slightly modified one, there are only a handful of instructions difference, resulting in a timing difference well below the microsecond granularity. Our production code has more checks and function calls in between. So it will read out the FIFO slower for sure, which might cause the glitch to be present even more likely.

Please keep us up-to-date. We're fighting this trouble since weeks.

Regards,
Michael

0 Bob Crosby over 9 years ago in reply to Michael Sch,

TI__Guru 72500 points

Zhaohong and I discussed the possibility of a conflict with reading the CAN RAM and the loading of the FIFO. Since the CAN transmission is asynchronous to the reading of the FIFO we theorized that if it is a simple race condition, it would eventually occur on the optimized code example as well. My optimized code example has now run 48 hours without an error. It must somehow be also dependent on the sequence in the un-optimized example. I will keep looking into it.

0 Mr user4642853 over 9 years ago in reply to Bob Crosby

Intellectual 425 points

Hi Bob,

Thank you for your confirmation.
Yes this race condition issue is what we tried to communicate over the past replies.

My observations so far:
a) TI project, extended ID: it is running fine
b) TI project, standard ID: issue reproducable but only with optimization
c) TI project, standard ID, new FIFO entry registers (arb, mctl, data0-1) are memcopied to a buffer and later processed -> this works even with opt=off

Although it seems that optimization solves the issue, actually in our production code it does not.
Even with the same functionality (I copied the "canGetData" line by line) the result is the missplaced message in the buffer,
the difference between the TI project and our production code is that we use DMA in the background so that the bus may be used by other peripheries as well when the CAN registers are access in the atomic iteration.

I even tried out to implement obs c) in our case. To just iterate over the FIFO and copy the registers in an atomic way, and process them after the atomic block, hoping that this fast access will solve our issues (in production code we use optimization level=2 AFAIK). Unfortunately it only increased the robustness of the message processing, but didnt solve it.

From my point of view the optimization just accidentally organized the instruction so that the timings of readouts and arriving new can messages somehow was always timed correctly, but I think this is just a matter of time or setup (temperature, bus length, ecu oscillator, etc varying parameters) that will sooner or later cause the same issue. Note that your setup is 2 identical boards, used on room temperature with probably a short can bus wire.

Tamas

PS: I have no idea why the extended/standard ID influences the outcome of the execution but as written in a) and b) I had no problems when using extended format (although it does not necessarly mean that it is working fine, I just didn't have any problems, maybe you could have a look at this)

0 Mr user4642853 over 9 years ago in reply to Mr user4642853

Intellectual 425 points

Hi Bob,

It's been over a week now, do you have any update on this issue?

Tamas

0 Michael Sch, over 9 years ago in reply to Bob Crosby

Intellectual 800 points

Bob,

I hope you didn't forget about us.
The topic is getting more and more urgent in our project. We need a workaround for this supposed silicon glitch very soon.

Also a code piece that is guaranteed to work under all circumstances would be great.
Under all circumstances means from my point of view:

independent of the FIFO's size
independent of any compiler optimization level or toolchain version
independent on the length and type of a CAN frame
independent of the internal SoC bus load (given by any other master on the bus, like DMA)

Thank you very much in advance.

Regards,
Michael

0 Bob Crosby over 9 years ago in reply to Michael Sch,

TI__Guru 72500 points

I am not having much luck narrowing this down. Too often when I try to instrument the code, it stops failing. Since this is taking so long, would you consider using a single message box and IF3 registers with DMA?

0 Mr user4642853 over 9 years ago in reply to Bob Crosby

Intellectual 425 points

Hi Bob,

On our side it is quite the opposite, whenever we extend the code - of the TI project example - we run into the error.
Using our production code it is inevitable to run into, even with taking most purified example it happens.

We have already considered using the DMA but there is no guarantee that it wont happen again since TI has not yet released the root-cause of this issue and a statement that DMA approach is a known workaround. From our side this would require a complete redesign of our base library so we either need a workaround for polling the FIFO or an errata like release that assures of using DMA.

Please if you that you can not proceed further with the analysis, escalate the CAN bug to higher levels.

Best Regards,
Tamas

0 Mr user4642853 over 9 years ago in reply to Mr user4642853

Intellectual 425 points

Any update on this would be very much appritiated.

0 QJ Wang over 9 years ago in reply to Mr user4642853

TI__Guru**** 197306 points

DCAN core places the new message to the lowest free slot until the last message object of this FIFO Buffer is reached. All further messages will be written into the last message object of the FIFO buffer (EoB=1) and therefore overwrite previous messages in this message object.

I tried interrupt mode and didn’t see the issues found in polling mode. The RX interrupt of the last message object is enabled. Whenever the RX interrupt is triggered, CPU reads all the message objects starting from the first message object.

My measurement shows that reading 1 message object with 8 bytes data takes about 4.5us. If DCAN baudrate is 500kbps, the shortest period between 2 adjacent DCAN messages is about 270us (8-byte data and the control bits), so if the FIFO size is set properly, there will be no problem for FIFO operation in interrupt mode.

Regards,
QJ

0 Mr user4642853 over 9 years ago in reply to QJ Wang

Intellectual 425 points

Thank you for your answer.

Unfortunately we have a system running using the IRQ method and there we see that it is not the first mb of the FIFO that triggers the interrupt but sometimes the 2nd or the 3rd(!), of course this is at least a deterministic way of handling the incoming messages and keeping them ordered.

We can confirm that the IRQ method is also not a solution but a more or less working workaround.

Still awaiting a solution or errata for the polling method. If errata is going to be released please attach the root cause and a statement that using the proposed workaround (other sequence of reading, IRQ, DMA) is design proven and not experimental proven.

Thank you,
Tamas

0 Mr user4642853 over 9 years ago in reply to Mr user4642853

Intellectual 425 points

Any update?

0 Mr user4642853 over 9 years ago in reply to Mr user4642853

Intellectual 425 points

Again a month has passed since the last update request. Any update, information, is anybody working on a solution or the issue and the processor is not supported anymore?

0 Charles Tsai over 9 years ago in reply to Mr user4642853

TI__Guru**** 191886 points

Hi,
By no mean the issue you found with the DCAN is ignored. We are trying to replicate the issue in simulation and hopefully we will be able to identify the root cause in the next few days. We will update this post as soon as possible. Again we apologize for all the inconveniences you have incurred and thank you for your patience.

0 Mr user4642853 over 9 years ago in reply to Charles Tsai

Intellectual 425 points

Any progress here?

0 Charles Tsai over 9 years ago in reply to Mr user4642853

TI__Guru**** 191886 points

Hi,

Our design team has spent a lot of efforts and time in order to reproduce the problem in both the simulation and QuickTurn environments. Simulations have been run for many days non-stop to no avail. Sorry to say that we have yet to reproduce the problem in either environments.

We will discuss what the next step to take.

As QJ has suggested to use interrupt instead of polling with FIFO mode. If this is acceptable with you then I think this is a workaround for now. Or if you can employ information redundancy technique in your payload. Example would be to use portion of the DCAN payload to encode sequence information of each frame. When the CPU receives the data it will decode the sequence information to reorder your frames. You will lose some bandwidth but perhaps this is a tradeoff you could consider.

0 Mr user4642853 over 9 years ago in reply to Charles Tsai

Intellectual 425 points

Hi,

As mentioned before neither the polling nor the IRQ approach works..

I tried using the DMA which seemed to work fine but we are developing API to our customers and we have a certified code which would need complete redesign to use the DMA instead. Meaning that we need an official statement (with a supporting document perhaps) that the DMA can work as expected and it is proven by tests, and those tests reveals the problem with the polling and interrupt mode. Otherwise I can not just say to our development team to rewrite our codebase based on a suggestion read on a forum chat.

If I remember right you were able to reproduce the issue using the TI Hercules board, isn't it possible to attach to this (or an opened) processor and check if it is working as in your simulation?

0 Charles Tsai over 9 years ago in reply to Mr user4642853

TI__Guru**** 191886 points

Yes, we were able to reproduce on our boards. We actually use the same testcase to run on the simulation and it took days non-stop on simulation and we were unable to reproduce the problem. Bear in mind that even on the board it took several real time seconds after many CAN frames have been received to produce the problem. To simulate real time in seconds in simulation takes a lot of time.

How did you use the DMA to read the DCAN FIFO? Do you use the RTI to generate a periodic DMA request for the DMA module to read the DCAN FIFO?

0 Mr user4642853 over 9 years ago in reply to Charles Tsai

Intellectual 425 points

DMA method setup:

- using the DCAN module's IF3 as source (this will point to a FIFO element, and will jump automatically to the next non-read)
- 2 buffers[0..1] allocated, DMA points to one of them meanwhile we read the other one
- DMA is periodically checked if it transferred a FIFO element
- if DMA transferred something it gets pointed to the other buffer once when finished the current and gets restarted, we process the other buffer

I used this implementation along with a software delay between checking the DMA status. It seemes to work right (didnt get any 'lost' or missplaced messages).

0 Mr user4642853 over 8 years ago in reply to Mr user4642853

Intellectual 425 points

Any official update regarding the issue? Errata?

0 Brian Silverman20 over 8 years ago in reply to Mr user4642853

Prodigy 160 points

I'm also concerned about receiving CAN messages using a "FIFO". Is there any update to this?

0 Chuck Davenport over 8 years ago in reply to Brian Silverman20

TI__Guru 59540 points

Hello Brian,

No, there are no further updates. We were never able to reproduce the issue during design simulations that can give detailed look at the internal interactions keeping in mind that these models execute in a very controlled environment.

We did, however, recreate the similar issue during bench testing as noted in this thread, but this can't be used to determine if there is a silicon bug or not since this method is also subject to potential software constraints that could impact the testing. If you have experienced similar problems, can you create a new post describing the issues as you have observed them and any sample code which you might have for this issue?

0 Mr user4642853 over 8 years ago in reply to Chuck Davenport

Intellectual 425 points

Dear TI Support,

We have detected the problem originally and sent you exact description of the problem (see posts above with screenshots and data extract from the Lauterbach debugger you have plenty of data available),
We also recreated the issue using DCAN example from HalCogen generated code, Hercules prototype board and the TI CCS v6.0, we sent you the projects at least via emails (not sure if this pops up here).
You have the example codes and proof that it is not just the code as we communicated it several times - when detected the problem first - we used our own written HAL layer (similar to HALcogen).

Also with the support of your experts we narrowed it down that it is not the configuration neither the timing of the code that causes the problem.
Also on this forum you can find similar reports, I at least found problems regarding other processors - using the same DCAN module - the exact problem was reported, I believe that was also admitted long time ago.

Question is: why do we see a similar statement from TI that you are still unsure if this is a code timing issue or silicon bug? Almost 2 years passed and the only answer we have is: we admit that the problem exists we just don't know what it is because the problem _can_not_be_recreated_with____simulations____.

0 Chuck Davenport over 8 years ago in reply to Mr user4642853

TI__Guru 59540 points

Hello user4642853,

Apologies for any confusion. For sure, I have just joined the discussion so I am not 100% aware of the effort or verification of the issue that had been done at the time of the last inquiry that I responded to. Also, I am not aware without further detail if each poster is identifying the same issue or has the same code set. In short, my comments were relative to my awareness and response to the poster that made the last post. For the overall issue of the FIFO and ordering of the received messages, I am aware of some other devices that have indicated that the ordering of the information within the FIFO cannot be guaranteed. IN the case of the Hercules implementation, we will be including an errata in our next release of the errata documentation. For the time being, you can reference the below early copy of the specific erratum applying to this issue:

Hopefully, this will satisfy your needs for documentation of the issue and the proposed work around.

0 Mr user4642853 over 8 years ago in reply to Chuck Davenport

Intellectual 425 points

Hi,

Sorry for the late reply I just noticed a key word in the "Expected Behavior" description.
Your errata says that the DCAN might missplace message with "same arbitration and mask IDs" (at least the expected is that it is not missplaced when same arbitr and mask is received) but this is just a subset of the actual truth.
We have proven and indicated that the arbitration or mask has nothing to do with the missbehavior of the DCAN.
When used in FIFO the DCAN missplaces the messages regardless of mask,id or data size!

Please confirm that you have analysed the general case (issue is valid for any configuration) and not just the case of "messages with same arbitration and mask ID" case.

Thank you.

0 NeilBerry_at_Parker over 8 years ago in reply to Mr user4642853

Expert 1995 points

Just a question from a random observer: did the failures happen in a way that was not in accordance with CAN 2.0b? Under CAN 2.0b, priority bits should actually allow some frames to appear before other frames.

0 Chuck Davenport over 8 years ago in reply to Mr user4642853

TI__Guru 59540 points

Hello User,

The quoted text is from the "expected behavior Section" I think it is quite clear in the "Issue" section that we are not stating anything relative to the ID or arbitration impact to the ordering. It simply and clearly states that the messages can be out of order and if order is important to the application, please use the DMA.

Arm-based microcontrollers

Arm-based microcontrollers forum

DCAN FIFO reading while receiving