This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM3352: NDK network stack idle timeout

Part Number: AM3352


Dear TI team,

i do have a problem with your TI RTOS network stack:
In case of an idle network connection for about 300-400 seconds, all following, incoming packets (e.g. ICMP Echo or TCP) seems to be dropped for a - about 5 to 10 seconds - period of time.

Sending data over an open but inactive TCP connection (keep-alive disabled) leads to retransmission packets.
After about 10 seconds or non-response and dropping some retransmission packets, the stack responds and the TCP connection can be used as before.
Using a TCP connection with enabled keep-alive feature avoids this issue.

The host (Windows 10 computer) and the target (AM3352) are connected directly (without any other network components, e.g. switches) and use static IP addresses.
Computer: 192.168.1.77
AM3352: 192.168.1.20

I can confirm this behaviour on 2 different boards:
A customized PCB and the BBB (BeagleBone Black).

Toolchain:
NDK 3.60.00.13
PDK 1.0.15
EDMA3 LL 2.12.5
Sys/Bios 6.75.2.00
UIA 2.30.1.02
XDCtools 3.60.2.34

Processor:
AM3352

The same issue occurs when I use NDK 2.26.0.08.

I use a patched NDK which fixes a network stack lockup bug.
-> see the solution in this thread: e2e.ti.com/.../3149418

Debugging via "XDS220 ISO" and enabled SemiHosting feature does not show any messages but only the init message:

[CortxA8] Network Added: If-1:192.168.1.20

The tcps data structure does not show any problems, too.


Beyond you will find the debug- and a wireshark log for the following test case ('Idle-Retransmission-Problem'):
* Boot up AM3352
* Establish a TCP connection at port 12666
* Wait for a period of time (in this case 478s)
* Try to send data over the TCP connection

You will see retransmission packets in the wireshark log until the network stack works again.
I also added a screenshot of the "tcps" data structure after the incident.


The same kind of problem occurs if I do not use a TCP server but ICMP echo instead.
I wrote a test which sends a ICMP Echo, delayed by an increasing interval (10s), starting a delay of 300s until the ICMP Echo fails.
This procedure is executed 3 times.
The wireshark log 'Interval Ping Timeout' shows the results of this test.
You can see the issue starting at packets #18, #41, #65

IMHO this looks like a bug in the network stack.

Kind regards,
Markus

Idle-Retransmission-Problem.pcapng.gz

Interval Ping Timeout.pcapng.gz

  • Hi Markus,

    Please refer to the following thread:

    https://e2e.ti.com/support/processors/f/791/t/852801

    Ming

  • Hello Ming,

    thank you for your fast response.

    I tried the "quick fix" by turning off the ALE aging timer which you've mentioned in the linked thread and it seems to work - no timeouts so far.
    The second option (addding an "active" network component) is not possible in our scenario.

    I have some questions regarding the ALE:

    1)
    AFAIK the ALE does have 1024 entries.
    In case of disabling the ALE aging, does this mean, 1024 different MAC addresses (devices) can access the device before the table is full?

    2)
    What's the consequences if the ALE table is full?
    No more connections possible?

    3)
    I checked the manual regarding the ALE and I have found a option of "bypassing" the ALE.
    We don't need any access control (the device should be reachable for all network devices), so that may be an additional solution.
    What's the disadvantage of bypassing the ALE?

    4)
    Is the ALE table cleared if the device is physically disconnected (e.g. disconnection of ethernet cable) and reconnected?

    5)
    After initial bootup I suggest the ALE table is empty.
    Why does the first connect work without any problems (no delays, no timeouts) but after the ALE table entry is removed by the aging procedure it takes about 10 seconds until the device "recovers"?

    Thank you so far,
    Markus

    P.S.:
    For users having the same problem reading this thread I have to mention that an additional step is required which Ming has mentioned in the linked thread (look for: _RtNoTimer).

  • Hi Markus

    1) AFAIK the ALE does have 1024 entries.
    In case of disabling the ALE aging, does this mean, 1024 different MAC addresses (devices) can access the device before the table is full?

    [MW] Yes


    2) What's the consequences if the ALE table is full?
    No more connections possible?

    [MW] The new packet probably gets dropped, because it does not know how to forward it.


    3) I checked the manual regarding the ALE and I have found a option of "bypassing" the ALE.
    We don't need any access control (the device should be reachable for all network devices), so that may be an additional solution.
    What's the disadvantage of bypassing the ALE?

    [MW] You cannot "by-passing" the ALE, if the CPSW is use as switch. Otherwise, it may work for you. It really depends on your network topology.


    4) Is the ALE table cleared if the device is physically disconnected (e.g. disconnection of ethernet cable) and reconnected?

    [MW] Yes.

    5) After initial bootup I suggest the ALE table is empty.
    Why does the first connect work without any problems (no delays, no timeouts) but after the ALE table entry is removed by the aging procedure it takes about 10 seconds until the device "recovers"?

    [MW] I do not know. I thought it will never "recovers" once the ALE table entry is removed.

    Ming