This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

DCAN FIFO reading while receiving

Other Parts Discussed in Thread: TMS570LS3137, HALCOGEN

Hey guys,

we're currently experiencing some troubles with reading a DCAN Rx FIFO on a TMS570LS3137 while having bursts transfers on the bus. It seems as if messages are getting out of order or even getting lost. The protocol above is not working reliably therefore.

I already had a very careful look at the TRM for the TMS570 but there are still some details I'd need more information about. From my point of view, the TRM is not detailed enough regarding the FIFO operation mode.

The main question is, what the CAN core is doing when the software is reading out a FIFO while new frames are getting from the bus, targeting the same FIFO?

  • Does the CAN core place any new message to the starting position of the FIFO?
    This would indicate, that we need to read out the FIFO as fast as possible without allowing anything to interrupt us to be sure the reading happens faster than the writing.
  • Does the CAN core place any new message to the next higher free slot after the highest used one as long as the FIFO was not totally emptied (no active NewDat flags)?
    Would a loop based reading mechanism until no "NewDat" flag is found automatically prevent this?

To make it even more difficult, we'd like to use the DCAN in two different ways (application dependent):

  • Having Rx interrupts enabled (for every single message object in the FIFO).
  • Using polling mode only

Is there any different software handling needed to guarantee a perfectly fine in-order reception? 

Independent if using interrupt or polling mode: Can there be any situation where we need to start reading from another position than 1?

The only partly useful information, I could find in the TRM is the flowchart on page 1187. But this does not contain any text as an explanation (like polling mode, sequences, usage of multiple FIFOs, etc.).

Thank you very much in advance.

Regards,
Michael

  • Bob,

    have you been able to reproduce the out-of-order reception with the updated code on your side?

    Regards,
    Michael

  • Sorry, got a bit behind last week and have not had a chance to look at this again yet. Are you using the optimizer? I was using "-o2". That may explain the difference in execution time.
  • With the change you made and the optimizer turned off, I can reproduce the problem. With the optimizer at -o2, I no longer reproduce the problem. Looks like we have a timing issue. Let me work on it a bit.
  • Bob,

    interesting to hear. But this would indeed indicate that there might be something like a race condition. I already mentioned this possibility at e2e.ti.com/.../1718735 after trying out almost everything what came to my mind.

    Between your original code and our slightly modified one, there are only a handful of instructions difference, resulting in a timing difference well below the microsecond granularity. Our production code has more checks and function calls in between. So it will read out the FIFO slower for sure, which might cause the glitch to be present even more likely.

    Please keep us up-to-date. We're fighting this trouble since weeks.


    Regards,
    Michael
  • Zhaohong and I discussed the possibility of a conflict with reading the CAN RAM and the loading of the FIFO. Since the CAN transmission is asynchronous to the reading of the FIFO we theorized that if it is a simple race condition, it would eventually occur on the optimized code example as well. My optimized code example has now run 48 hours without an error. It must somehow be also dependent on the sequence in the un-optimized example. I will keep looking into it.
  • Hi Bob,

    Thank you for your confirmation.
    Yes this race condition issue is what we tried to communicate over the past replies.

    My observations so far:
    a) TI project, extended ID: it is running fine
    b) TI project, standard ID: issue reproducable but only with optimization
    c) TI project, standard ID, new FIFO entry registers (arb, mctl, data0-1) are memcopied to a buffer and later processed -> this works even with opt=off

    Although it seems that optimization solves the issue, actually in our production code it does not.
    Even with the same functionality (I copied the "canGetData" line by line) the result is the missplaced message in the buffer,
    the difference between the TI project and our production code is that we use DMA in the background so that the bus may be used by other peripheries as well when the CAN registers are access in the atomic iteration.

    I even tried out to implement obs c) in our case. To just iterate over the FIFO and copy the registers in an atomic way, and process them after the atomic block, hoping that this fast access will solve our issues (in production code we use optimization level=2 AFAIK). Unfortunately it only increased the robustness of the message processing, but didnt solve it.

    From my point of view the optimization just accidentally organized the instruction so that the timings of readouts and arriving new can messages somehow was always timed correctly, but I think this is just a matter of time or setup (temperature, bus length, ecu oscillator, etc varying parameters) that will sooner or later cause the same issue. Note that your setup is 2 identical boards, used on room temperature with probably a short can bus wire.

    Tamas

    PS: I have no idea why the extended/standard ID influences the outcome of the execution but as written in a) and b) I had no problems when using extended format (although it does not necessarly mean that it is working fine, I just didn't have any problems, maybe you could have a look at this)

  • Hi Bob,

    It's been over a week now, do you have any update on this issue?

    Tamas

  • Bob,

    I hope you didn't forget about us.
    The topic is getting more and more urgent in our project. We need a workaround for this supposed silicon glitch very soon.

    Also a code piece that is guaranteed to work under all circumstances would be great.
    Under all circumstances means from my point of view:

    • independent of the FIFO's size
    • independent of any compiler optimization level or toolchain version
    • independent on the length and type of a CAN frame
    • independent of the internal SoC bus load (given by any other master on the bus, like DMA)

    Thank you very much in advance.

    Regards,
    Michael

  • I am not having much luck narrowing this down. Too often when I try to instrument the code, it stops failing. Since this is taking so long, would you consider using a single message box and IF3 registers with DMA?
  • Hi Bob,

    On our side it is quite the opposite, whenever we extend the code - of the TI project example - we run into the error.
    Using our production code it is inevitable to run into, even with taking most purified example it happens.

    We have already considered using the DMA but there is no guarantee that it wont happen again since TI has not yet released the root-cause of this issue and a statement that DMA approach is a known workaround. From our side this would require a complete redesign of our base library so we either need a workaround for polling the FIFO or an errata like release that assures of using DMA.

    Please if you that you can not proceed further with the analysis, escalate the CAN bug to higher levels.

    Best Regards,
    Tamas

  • Any update on this would be very much appritiated.

  • DCAN core places the new message to the lowest free slot until the last message object of this FIFO Buffer is reached. All further messages will be written into the last message object of the FIFO buffer (EoB=1) and therefore overwrite previous messages in this message object.

    I tried interrupt mode and didn’t see the issues found in polling mode. The RX interrupt of the last message object is enabled. Whenever the RX interrupt is triggered, CPU reads all the message objects starting from the first message object.

    My measurement shows that reading 1 message object with 8 bytes data takes about 4.5us. If DCAN baudrate is 500kbps, the shortest period between 2 adjacent DCAN messages is about 270us (8-byte data and the control bits), so if the FIFO size is set properly, there will be no problem for FIFO operation in interrupt mode.

    Regards,
    QJ
  • Thank you for your answer.

    Unfortunately we have a system running using the IRQ method and there we see that it is not the first mb of the FIFO that triggers the interrupt but sometimes the 2nd or the 3rd(!), of course this is at least a deterministic way of handling the incoming messages and keeping them ordered.

    We can confirm that the IRQ method is also not a solution but a more or less working workaround.

    Still awaiting a solution or errata for the polling method. If errata is going to be released please attach the root cause and a statement that using the proposed workaround (other sequence of reading, IRQ, DMA) is design proven and not experimental proven.

    Thank you,
    Tamas

  • Again a month has passed since the last update request. Any update, information, is anybody working on a solution or the issue and the processor is not supported anymore?

  • Hi,
    By no mean the issue you found with the DCAN is ignored. We are trying to replicate the issue in simulation and hopefully we will be able to identify the root cause in the next few days. We will update this post as soon as possible. Again we apologize for all the inconveniences you have incurred and thank you for your patience.
  • Hi,

    Our design team has spent a lot of efforts and time in order to reproduce the problem in both the simulation and QuickTurn environments. Simulations have been run for many days non-stop to no avail. Sorry to say that we have yet to reproduce the problem in either environments.

    We will discuss what the next step to take.

    As QJ has suggested to use interrupt instead of polling with FIFO mode. If this is acceptable with you then I think this is a workaround for now. Or if you can employ information redundancy technique in your payload. Example would be to use portion of the DCAN payload to encode sequence information of each frame. When the CPU receives the data it will decode the sequence information to reorder your frames. You will lose some bandwidth but perhaps this is a tradeoff you could consider.
  • Hi,

    As mentioned before neither the polling nor the IRQ approach works..

    I tried using the DMA which seemed to work fine but we are developing API to our customers and we have a certified code which would need complete redesign to use the DMA instead. Meaning that we need an official statement (with a supporting document perhaps) that the DMA can work as expected and it is proven by tests, and those tests reveals the problem with the polling and interrupt mode. Otherwise I can not just say to our development team to rewrite our codebase based on a suggestion read on a forum chat.

    If I remember right you were able to reproduce the issue using the TI Hercules board, isn't it possible to attach to this (or an opened) processor and check if it is working as in your simulation?

  • Yes, we were able to reproduce on our boards. We actually use the same testcase to run on the simulation and it took days non-stop on simulation and we were unable to reproduce the problem. Bear in mind that even on the board it took several real time seconds after many CAN frames have been received to produce the problem. To simulate real time in seconds in simulation takes a lot of time. 

     How did you use the DMA to read the DCAN FIFO? Do you use the RTI to generate a periodic DMA request for the DMA module to read the DCAN FIFO? 

  • DMA method setup:

     - using the DCAN module's IF3 as source (this will point to a FIFO element, and will jump automatically to the next non-read)
     - 2 buffers[0..1] allocated, DMA points to one of them meanwhile we read the other one
     - DMA is periodically checked if it transferred a FIFO element
     - if DMA transferred something it gets pointed to the other buffer once when finished the current and gets restarted, we process the other buffer

    I used this implementation along with a software delay between checking the DMA status. It seemes to work right (didnt get any 'lost' or missplaced messages).

  • Any official update regarding the issue? Errata?

  • I'm also concerned about receiving CAN messages using a "FIFO". Is there any update to this?
  • Hello Brian,

    No, there are no further updates. We were never able to reproduce the issue during design simulations that can give detailed look at the internal interactions keeping in mind that these models execute in a very controlled environment.

    We did, however, recreate the similar issue during bench testing as noted in this thread, but this can't be used to determine if there is a silicon bug or not since this method is also subject to potential software constraints that could impact the testing. If you have experienced similar problems, can you create a new post describing the issues as you have observed them and any sample code which you might have for this issue?
  • Dear TI Support,

    We have detected the problem originally and sent you exact description of the problem (see posts above with screenshots and data extract from the Lauterbach debugger you have plenty of data available),
    We also recreated the issue using DCAN example from HalCogen generated code, Hercules prototype board and the TI CCS v6.0, we sent you the projects at least via emails (not sure if this pops up here).
    You have the example codes and proof that it is not just the code as we communicated it several times - when detected the problem first - we used our own written HAL layer (similar to HALcogen).

    Also with the support of your experts we narrowed it down that it is not the configuration neither the timing of the code that causes the problem.
    Also on this forum you can find similar reports, I at least found problems regarding other processors - using the same DCAN module - the exact problem was reported, I believe that was also admitted long time ago.

    Question is: why do we see a similar statement from TI that you are still unsure if this is a code timing issue or silicon bug? Almost 2 years passed and the only answer we have is: we admit that the problem exists we just don't know what it is because the problem _can_not_be_recreated_with____simulations____.

  • Hello user4642853,

    Apologies for any confusion. For sure, I have just joined the discussion so I am not 100% aware of the effort or verification of the issue that had been done at the time of the last inquiry that I responded to. Also, I am not aware without further detail if each poster is identifying the same issue or has the same code set. In short, my comments were relative to my awareness and response to the poster that made the last post. For the overall issue of the FIFO and ordering of the received messages, I am aware of some other devices that have indicated that the ordering of the information within the FIFO cannot be guaranteed. IN the case of the Hercules implementation, we will be including an errata in our next release of the errata documentation. For the time being, you can reference the below early copy of the specific erratum applying to this issue:

    Hopefully, this will satisfy your needs for documentation of the issue and the proposed work around.

  • Hi,

    Sorry for the late reply I just noticed a key word in the "Expected Behavior" description.
    Your errata says that the DCAN might missplace message with "same arbitration and mask IDs" (at least the expected is that it is not missplaced when same arbitr and mask is received) but this is just a subset of the actual truth.
    We have proven and indicated that the arbitration or mask has nothing to do with the missbehavior of the DCAN.
    When used in FIFO the DCAN missplaces the messages regardless of mask,id or data size!

    Please confirm that you have analysed the general case (issue is valid for any configuration) and not just the case of "messages with same arbitration and mask ID" case.

    Thank you.

  • Just a question from a random observer: did the failures happen in a way that was not in accordance with CAN 2.0b? Under CAN 2.0b, priority bits should actually allow some frames to appear before other frames.
  • Hello User,

    The quoted text is from the "expected behavior Section" I think it is quite clear in the "Issue" section that we are not stating anything relative to the ID or arbitration impact to the ordering. It simply and clearly states that the messages can be out of order and if order is important to the application, please use the DMA.