This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMDXRM57LHDK: EMAC (LwIP) and Serial interface cause a data fetch error

Part Number: TMDXRM57LHDK
Other Parts Discussed in Thread: HALCOGEN

Hi,

I don't know how to approach this problem. Hopefully, you can give me some tips.

I am working with lwip 1.4.1 and the Texas Instrument HDK RM57 without an OS.

How my program works:

  • sensor data comes in through a serial interface in asynchronous mode
  • once all data is received, it is sent to a client on a PC via TCP (the server is the HDK)

Problem:

  • after some random time (normally around an hour), the program crashes with a "data fetch" error

Things I noticed:

  • The L4, L4_ABT, L4_USR registers point towards a problem with the serial interface (bad address). I know that whatever is pointed to by the L4 register doesn't necessarily mean that the problem lies at that instruction. The L4 register is set at the time the debugger or board notices the problem, but the time at which the problem occured could be several instructions before. However, I also noticed that when the problem occurs, the buffers (in the double buffer) I use in the receiving interrupt routine of the serial interface point to an address outside the allowed memory region. This buffers are part of my application, not system level buffers.

Tests I did:

  • I left the program running without the TCP server for a whole a day. Sensor data was being received in asynchronous mode and processed by the main loop. Program did NOT crashed.
  • I left the program running with the TCP server for a whole a day with a single static buffer which was initialized once and never changed. Serial interface to the sensor was not active. Program did NOT crashed.
  • I left the program running with the TCP server for 4 hours with a double static buffer which was being updated every 60ms with dummy values. Serial interface to the sensor was not active. Program did NOT crashed. (I did this to test if my copy function somehow was fault).
  • I tried running the TCP Server and the sensor serial interface at the same time but without copying the serial buffer to the tcp buffer. The TCP server was sending, in one test, the single static buffer that is never changed, and in the other test, the double static buffer (being updated every 60ms with dummy values).  In both tests, the program crashed.

The problem only occurs when both the TCP and the asynchronous serial communication are active at the same time.

(Maybe related) The TCP client on the PC is actually a GUI that displays my sensor data. When connected, the "image" of the sensor data "jumps" once every ~2 seconds. The data is wrong. I thought this could be a copy error from the serial buffer to the tcp buffer. But, before the board sends the data over TCP, it processes it and checks if the data is wrong or corrupted. If it is, it sends several error signals (LEDs and serial debug data). When the image in the GUI jumps, I also get the error signals from the board (which means, the data is actually wrong). When the server is not connected to the GUI, I do NOT get those error signals from the board.

This looks as if the TCP server somehow affects the interrupt routine of the serial interface (or the same memory area?) (and somehow corrupts its buffers?). 

How can I check who is trashing memory? How can I check the EMAC driver and the serial interface handler? How can I check If the "port" is reentring non-reentrable functions? The "port" is what LwIP calls to a sort of interface between a device and the LwIP API. My port I got from a texas instrument tutorial.

I don't have much experience in embedded programming. I don't know how to investigate this deeper. What else can I check and how? Maybe my port is not right? Has someone a port to the HDK RM57 that I can compare? 

If you understood my description of the problem, what do you think the cause could be? If you didn't understand, please tell so that I can explain better.

I really appreciate any help.

Best regards,

Julio

  • Hello Julio,

    Thanks for the very detailed explanation of your troubles. At the moment, I am still trying to understand the issue and think of possible places to look for potential trouble. If you don't mind, I would like to think about this for a bit and get back with you on it after I have had some time to investigate and consider areas to look at.
  • Hi Chuck,

    Thanks for taking the time to help me.

    I just would like to point up something new that happend yesterday.

    As an intro:

    I mentioned that I process the sensor data after it has been completly received. As a once-initialization step of the processing I set some global variables of type float, float[] and float[][] (and others) to some values. This variables are used every time the data needs to be processed.

    Problem:

    I was trying to get the error to check some registers, but the data fetch error didn't come up yesterday, the 'new' problem might be related, though.

    The server stops sending data (tcp_write returns -11, which is NO_CONNECTION). The serial interface keeps working as it should. But that global variables I set in the initialization have corrupted data. Somehow their values were changed.

    How can this happend?

    Maybe this can help you understand the issue if it is related.

    Thanks and regards,

    Julio

  • Hello Julio,

    There are a couple of potential issues that might be happening.

    First, you could be running out of frame/message buffers/space for frame/message buffers. I am not an expert on the LWIP application demo, but my understanding is that it does some limited memory allocation on where to store the frame/message information. If you exceed this space, it could be corrupting your SCI buffers and also your global variables. This seems most likely given you initial reported issue where you would start to get erroneous data in a periodic way. It would seem that you were using all the allocated space until the last one which was an errant pointer to random data as the message buffer where the message buffer pointer had been corrupted or overran the allocated space.

    A less likely scenario is that you are over running your stack causing SRAM corruption of data. But this seems less likely due to continued operation of, at least, part of your application.
  • Hi Chuck,

    Thank you for the help.

    Is it possible to increase the size of the STACK section in the memory map? Right now it is about 5k, and one sensor data buffer is about 2.2k. If the serial interrupt is active, and the tcp interface is sending at the same time, the stack must hold the context of both interfaces plus the normal context from the main loop, right? It could be that the stack is not large enough and everything gets corrupted?

    Thanks and regards,

    Julio

  • Hi Chuck, 

    I think I found the cause of the data fetch error. I was writting more data than I should into the serial buffer. I wasn't testing/checking the amount of data to be written. I ran the program for about 2 hours and the data fetch error didn't occur. I am going to run it the whole night and see if the program still runs tomorrow.

    What I think happens is the following:

    • A: The program reads from the serial buffer the "size" of the telegram that is about to come. I was using this value directly without testing it as the amount of data that the serial interface needs to wait for.
    • B: When the tcp interface sends data, it interrupts the serial interface routine. Sometimes it takes so long that when it comes back to the serial interrupt, the data in the internal serial buffer is newer (corrupted?).  

    If B happends when I'm reading the "size" of the next message, this value might be larger than the actual buffer size. Since I was doing A, I was generating the data fetch error. Now I just check if  this value is inside the buffer limits and it seems to work.

    I now have another problem: (should I open a new thread for this?)

    The serial buffer still gets "corrupted" when the tcp server is active, because B can occur while reading any part of the serial message.

    I tried increasing the priority of the serial interrupt, by changing its interrupt handler from IRQ to FIQ in Halcogen.

    Unfortunately, that didn't work. The program crashes with another data fetch error seconds after startup.

    What am I missing?

    Best regards,

    Julio

  • Hello Julio,

    Glad to hear you have been able to move beyond the data abort issue and are now on to more implementation related issues. For more information on the interrupt priorities, please see this thread: e2e.ti.com/.../59793

    The thread discusses in detail how the priorities work. You can adjust the priorities in configuring the interrupt channel assignments as well as changing to FIQ. Also, do you re-enable interrupts after entering your SCI interrupt? If so, wait until the end of the service routing to re-enable to keep the EMAC interrupt from nesting (as long as EMAC interrupt is IRQ as well). You cold also use a flag to check if the SCI interrupt was disrupted when entering state B above. i.e., if SCI Rx is ongoing, finish it before completing the EMAC Tx based on the state of the flag. Is there a reason the Tx needs to be an interrupt based function? i.e., is this in reaction to a request from the PC for a transmission? If there is no compelling reason to have this interrupt based, you could do a polling implementation or you cold make your system event driven where you only transmit over TCP when you have complete data from your sensor.
  • Hi there Chuck,

    I hope you are still around :-)

    According to the link from your last post, my theory about the tcp tx routine interrupting the serial rx routine is wrong. It says another IRQ (even of higher priority) cannot interrupt an already ongoing interrupt routine.

    My new theory now is that the serial interrupt is triggered when the EMAC routine is being serviced, so the serial interface has to wait. While it is waiting, new data is arriving and either the buffer gets overwritten or that new data is lost. Does this make sense to you?

    From the Reference Manual, I know there are two serial receive buffers: the SCIRXSHF and the SCIRD. "The frist one gets all incomming data and when a frame has been completely received, the data is transfered to the second one. As this transfer occurs, the RXRDY flag is set and a receive interrupt is generated".

    If the serial interrupt is generated while the program is in the tcp interrupt, what happens with new arriving data at SCIRXSHF? I would think, the SCIRXSHF keeps getting data to avoid losing some, right? That would mean that more than one frames are going to be arriving. Does this new data gets copied to SCIRD (because a complete frame was received) even though the last interrupt hasn't been serviced yet? If that's the case, what can I do to not lose data?

    Thanks and Regards,
    Julio
  • Julio,

    If you are in the middle of your TCP interrupt when data is recieved and the SCI Rx interupt is suppressed/delayed as a result, the data will still transfer from the SCIRXSHF register to the SCIRD register (two stage buffer) but if you are unable to get to it before another message is received into SCIRXSHF the data in SCIRD will be over written and lost leading to an overrun error (you can enable this error if it isn't already. You can at least check the error flag in the status register). However, I think you are referencing the multi-buffer mode of the SCI where there are 8 buffers in which to move the data from the SCIRXSHF register when received. The trick here is to be able to set the number of bytes received/expected to set the length of the frame/buffer depth.

    The real solution here is to spend as little time as possible in the interrupts and only use them to dump the data into received data structs. Either for the TCP data or the SCI data. i.e., get into the ISR, copy the data to a RAM buffer for later processing, re-enable interrupts, and get out. All the processing could then happen in your application layer. This still won't prevent interrupt latency but, hopefully it will allow you to capture the data prior to another message coming in. In fact, you may be able to get away with doing this only by enableing interrrupts int he TCP interrupt so that it can allow nested intterrupts from the SCI which can capture the data and copy to RAM. Once the TCP transfer is done, you can move the SCI data to where it belongs and adjust the buffer. (ideally you would have a RAM based ring buffer (FIFO) being used for SCI receive data where the interrupt only places data into the ring buffer and the SCI data is processed at a non interrupt level.
  • Hi Chuck,

    thanks again for the help. I really appreciate it.

    Chuck Davenport said:

    If you are in the middle of your TCP interrupt when data is recieved and the SCI Rx interupt is suppressed/delayed as a result, the data will still transfer from the SCIRXSHF register to the SCIRD register (two stage buffer) but if you are unable to get to it before another message is received into SCIRXSHF the data in SCIRD will be over written and lost leading to an overrun error (you can enable this error if it isn't already. You can at least check the error flag in the status register). However, I think you are referencing the multi-buffer mode of the SCI where there are 8 buffers in which to move the data from the SCIRXSHF register when received. The trick here is to be able to set the number of bytes received/expected to set the length of the frame/buffer depth.

    I was able to see the overrun error when I enabled it. So at least I now know for sure what is happening. You mentioned I am using the multi-buffer mode. I am not sure about that. In the "Technical Reference Manual" I only  see the option to set SCI/LIN in multi-buffer mode. Can I also do that for SCI3? I am already setting the expected amount of data on the SCI receive function, and I get the sciNotification after that amount of data has been received. However, I think the sci3HighLevelInterrupt is stil being called after each single frame is received. 

    I did the following test:

    • The sensor needs 67 ms to send data, it then pauses for 10ms, and starts to send the next serial telegram over the next 67ms.
    • I process the complete data and send it (raw) over TCP. This takes around 6-10ms and that's when new serial data is arriving and probably why the error occurs because I need to do some synchronization checks in the first 4 bytes of the telegram and check the size of the telegram in the next 2 bytes. Which means the first 6 bytes I read one at a time (sciReceive(UART3, 1, buffer)), which increases the chances of the TCP blocking the receive stream.
    • The test was to send the data over TCP after I've done all those checks and I've done 'sciReceive(UART3, size, buffer)'

    I now get the overrun error A LOT less frequently. However, I expected to not see it anymore. Which leads me to believe that I am not using the multi-buffer mode. The overrun error occurs again a lot more frequently if I add another tcp connection.

    Chuck Davenport said:

    The real solution here is to spend as little time as possible in the interrupts and only use them to dump the data into received data structs. Either for the TCP data or the SCI data. i.e., get into the ISR, copy the data to a RAM buffer for later processing, re-enable interrupts, and get out. All the processing could then happen in your application layer. This still won't prevent interrupt latency but, hopefully it will allow you to capture the data prior to another message coming in. In fact, you may be able to get away with doing this only by enableing interrrupts int he TCP interrupt so that it can allow nested intterrupts from the SCI which can capture the data and copy to RAM. Once the TCP transfer is done, you can move the SCI data to where it belongs and adjust the buffer. (ideally you would have a RAM based ring buffer (FIFO) being used for SCI receive data where the interrupt only places data into the ring buffer and the SCI data is processed at a non interrupt level.

    I dont't know exactly what the LwIP port (from the Texas Instruments tutorial) does in the tx interrupt. I would assume it tries to spend as less time as possible, which is what I also do. I tried your idea to use a nested IRQ so that the serial interface is able to interrupt the tcp interface. But the board crashes. I also tried it setting the SCI to work with FIQ, but the result is the same. I guess the tcp interrupt of the Port was not designed to be interrupted.
    What about using SCI3 with DMA? Would that help to send all incoming data to the DMA buffer while avoiding the blocking from the tcp?
    Best Regards,
    Julio
  • Julio Cesar Aguilar Zerpa said:

    thanks again for the help. I really appreciate it.

    Chuck Davenport

    If you are in the middle of your TCP interrupt when data is recieved and the SCI Rx interupt is suppressed/delayed as a result, the data will still transfer from the SCIRXSHF register to the SCIRD register (two stage buffer) but if you are unable to get to it before another message is received into SCIRXSHF the data in SCIRD will be over written and lost leading to an overrun error (you can enable this error if it isn't already. You can at least check the error flag in the status register). However, I think you are referencing the multi-buffer mode of the SCI where there are 8 buffers in which to move the data from the SCIRXSHF register when received. The trick here is to be able to set the number of bytes received/expected to set the length of the frame/buffer depth.

    I did the following test:

    • The sensor needs 67 ms to send data, it then pauses for 10ms, and starts to send the next serial telegram over the next 67ms.
    • I process the complete data and send it (raw) over TCP. This takes around 6-10ms and that's when new serial data is arriving and probably why the error occurs because I need to do some synchronization checks in the first 4 bytes of the telegram and check the size of the telegram in the next 2 bytes. Which means the first 6 bytes I read one at a time (sciReceive(UART3, 1, buffer)), which increases the chances of the TCP blocking the receive stream.
    • The test was to send the data over TCP after I've done all those checks and I've done 'sciReceive(UART3, size, buffer)'

    I now get the overrun error A LOT less frequently. However, I expected to not see it anymore. Which leads me to believe that I am not using the multi-buffer mode. The overrun error occurs again a lot more frequently if I add another tcp connection.

    Chuck Davenport --> SCI3 is a standard SCI and will not have the buffered mode. It is only available on the SCI/LIN instantiations. You might consider using the DMA if the device has one. This way the DMA can transfer received data into a buffer location that can be operated on independently of receive. You can then operation on the buffer as needed/when time permits without fear of interruption from the TCP operations.

    Chuck Davenport

    The real solution here is to spend as little time as possible in the interrupts and only use them to dump the data into received data structs. Either for the TCP data or the SCI data. i.e., get into the ISR, copy the data to a RAM buffer for later processing, re-enable interrupts, and get out. All the processing could then happen in your application layer. This still won't prevent interrupt latency but, hopefully it will allow you to capture the data prior to another message coming in. In fact, you may be able to get away with doing this only by enableing interrrupts int he TCP interrupt so that it can allow nested intterrupts from the SCI which can capture the data and copy to RAM. Once the TCP transfer is done, you can move the SCI data to where it belongs and adjust the buffer. (ideally you would have a RAM based ring buffer (FIFO) being used for SCI receive data where the interrupt only places data into the ring buffer and the SCI data is processed at a non interrupt level.
    I dont't know exactly what the LwIP port (from the Texas Instruments tutorial) does in the tx interrupt. I would assume it tries to spend as less time as possible, which is what I also do. I tried your idea to use a nested IRQ so that the serial interface is able to interrupt the tcp interface. But the board crashes. I also tried it setting the SCI to work with FIQ, but the result is the same. I guess the tcp interrupt of the Port was not designed to be interrupted.
    What about using SCI3 with DMA? Would that help to send all incoming data to the DMA buffer while avoiding the blocking from the tcp?
    CD--> Same idea I had earlier. I would suggest giving that a try. There is an SCI DMA example in the examples of HalCoGen that works well. You have to kick off the first communication manually I think to get things kick started.
    Best Regards,
    Julio

  • Hi Chuck,

    thanks a lot for the help. I also found the sci dma example, but I can't get it to work. It prints FAIL in loopback mode or gets stuck in the while loop when loopback is zero.

    I created a new project, enabled sci3 and sci4 in Halcogen, didn't change the configuration (all sci interrupts disabled, frame parameters in both is the same), and generated the code.

    Am I missing some configuration steps?

    Another thing is: if DMA is to be used to receive data, a dummy DMA transfer needs to be implemented to update the CTCOUNT, which allows to trigger again the DMA receive request. So I guess, I am going to need that.

    What would be the dummy_req_line that should be used in 'dmaReqAssign(DMA_CHX, dummy_req_line)', if my dummy send and receive buffer are defined in my application (uint8 dummy_array[2], where idx 0 would be the tx addrs and idx 1 the rx addrs)?

    Best Regards,
    Julio
  • Hi Chuck,

    the original question is already answered. I am going to set the question as answer and open a new thread with the DMA problem.

    If you or a colleague could help me out, it would be great. Here is the link to the other thread: e2e.ti.com/.../579707

    Thanks a lot.
    Julio