This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Ethernet receiver is stalling

I am working with an AM335x and ISDK V1.1.0.3. When the Ethernet driver is flooded with small packets, the Ethernet receiver can run out of buffer descriptors causing it to stall. The Ethernet receive will no longer generate any receive interrupts. My issue is directly related to the post at e2e.ti.com/.../252188 and the problem was explained by Sudhakar Ayyasamy. However, the answer to that post does not work in all situations including my application. The answer to that post is to disable ALL interrupts, which would negatively affect all other interrupt handlers.

I would like to ask TI, and everyone else, if there is any way to detect when the Ethernet receiver has stalled? I believe Sudhakar Ayyasamy stated that it's not possible to recover from this error. However, I believe I may have been able to restart the Ethernet receiver by calling CPSWCPDMARxHdrDescPtrWrite(). The problem is that I don't know how to detect when the Ethernet receiver is stalled. I have looked at the register values, but the register values look correct even when the receiver is stalled. I would appreciate any help in detecting this issue.

Currently, my workaround is to add more buffer descriptors to prevent the receiver from running out of descriptors. However, there is always the possibility of running out of buffers.

  • This is a driver bug, plain and simple. Running out of descriptors should never result in any adverse effect other than dropped packets until fresh descriptors are queued.

    I'm not familiar with NDK, but I'm assuming this is similar code to StarterWare's cpsw driver? I once briefly looked at it and I recall it looked seriously weird. If Sudhakar is right in that the descriptors are linked into a circular list, then that is in itself a major bug, since CPDMA expects a NULL-terminated list of descriptors.

  • fyang said:
    I would like to ask TI, and everyone else, if there is any way to detect when the Ethernet receiver has stalled? I believe Sudhakar Ayyasamy stated that it's not possible to recover from this error. However, I believe I may have been able to restart the Ethernet receiver by calling CPSWCPDMARxHdrDescPtrWrite(). The problem is that I don't know how to detect when the Ethernet receiver is stalled. I have looked at the register values, but the register values look correct even when the receiver is stalled. I would appreciate any help in detecting this issue.

    If CPDMA detects it is being mismanaged this it will signal a DMA error in the DMASTATUS register. If enabled, a "misc" irq will also be asserted. As far as I know, this state is indeed "irrecoverable" in the sense of requiring a CPDMA reset and reinitialization. Of course this is itself a usable recovery procedure, resulting only in brief packet loss, and I would recommend implementing it in response to the DMA error IRQ. Note however that any occurrence of such an error indicates a serious driver bug, and they should not be swept under the carpet.

    CPDMA is obviously not omniscient and cannot detect all software mistakes. Incorrect descriptor list management code can cause receive or transmit to stall without any DMA error being signalled. If the stall you're experiencing can be resolved by writing the HDP register, then apparently you are in such a situation.

  • Matthijs van Duin said:
    If Sudhakar is right in that the descriptors are linked into a circular list, then that is in itself a major bug

    Based on later posts in that thread it sounds like that one was fixed. The issue described on the wiki is quite different. I'll go take a peek at the NDK sources to see what the current situation is.

    Just to be clear, I hope you're not expecting to be able to perform networking calls from inside ISRs? Since that would require a tremendously carefully designed networking stack...

    Update: ok, so the NDK doesn't seem to contain the CPSW layer, and the ISDK is a Windows-only installer. I have no Windows system, so there's not much I can do without having sources to look at. (If license permits it, uploading it to github would be useful)

  • Matthijs, thanks for your responses. I'll have to double check the DMA status register and test out the DMA IRQ. I am not trying any networking calls, or any blocking calls, from inside any ISR. I have considered that this could be a bug in the driver, but I'm surprised that more people aren't running into this issue.
  • fyang said:
    I have considered that this could be a bug in the driver, but I'm surprised that more people aren't running into this issue.

    People probably have, but somehow worked around it or went for some alternative solution. Quite possibly they dumped the TI codebase altogether (I would).

    To put things in perspective: in AM335x StarterWare 02.00.01.01 (latest version as of writing) the uartEcho example, quite possibly one of the more basic examples one would expect, compiles to a deadloop if optimization is enabled. Even if the bug in the example is fixed, the result still causes random corruption of output because the driver is buggy as hell (the functions to enable/disable particular interrupts briefly put the uart in configuration mode, which is absolutely forbidden since it halts uart operation even in the middle of transmitting a character.)

    The ARMv7 MMU support code has a function to update a translation table entry at runtime... it turns off the MMU (flushing all caches in the process), updates the entry, and turns the MMU back on again. Nicely done, really...

    I like TI's processors. I may have periodic complaints about their docs but I know they are still awesome in comparison to alternatives. Their software however... not so much.

  • Hi fyang,

    Can you please give more info on the test setup ?

    Regards,
    Prajith
  • Hi Prajith,

    I have a program that will send UDP packets to the AM335x. The program is an in-house tool tool, but there are open source programs that will do the same thing. The following two conditions must be met to cause the "receiver stall" issue:
    1) Send small packets, such as 64 bytes. The larger the packet, the longer it will take to duplicate the issue. If you try large packet sizes, such as 1400 bytes, you may not be able to reproduce the issue.
    2) Send packets at a high rate. Try sending 5000 to 10000 packets per second. A higher packet rate will cause the issue to occur faster.

    I have tried directly targeting the IP address of the AM335x and I have also tried broadcasting the packet. Both methods will cause the issue to occur.
  • First thought would be some race condition triggered receiving the next packet at some ill-timed moment near the end of handling the current packet. (An alternative would be issues with receiving multiple packets in a single dma, but it seems more likely that if that's not handled well, it would fail consistently.)

    Since I still don't have any code to examine, I might write up a short outline of what ethernet rx code should look like (including any tricky details / cormer cases that I am aware of). It would probably be a few days before I'd have time for that though.

  • Fyang,

    I could not reproduce the issue here. Send 64 bytes UDP packets at line rate (960 ns) to the device for one hour, but the issue did not come up. Tried multiple instances of fpings also to get the same result.

    Can you please give more details on the test? In your set up ,does the driver receive and sent frames simultaneously?

    Regards,
    Prajith
  • Prajith,

    My test application is not sending out anything. The UDP packets are sent to a UDP port 1234. Since my application is not listening to UDP port 1234, the NDK automatically sends out an ICMP packet indicating that the port is unreachable.

    Are you using the ISDK and SYS/BIOS NDK? I am using SYS/BIOS 6.35.4.50, NDK 2.22.2.16, XDC 3.25.3.72, ISDK 1.1.0.3. I have checked ISDK 1.1.0.5 and there are no relevant changes to the Ethernet driver.

  • Fyang,

    Looks like the test setup is similar here also.

    Have you made any changes to CPSW Switch driver or the example. I am assuming you are using Ethernetip adapter example.  Is it possible for you to share the driver and example code so that we can reproduce the issue?

    Regards,

    Prajith