This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS320C6457: send() function in NDK Library

Part Number: TMS320C6457

This thread is related to the following e2e thread. Since I received additional inquiries, please allow me to re-open this thread.

https://e2e.ti.com/support/processors/f/791/p/895613/3313673#3313673

 

 

Hello team

Due to Covid-19 situation, our customer is not able to check ROV information yet.. sorry. Additional inquiries are as follows.

 

1. Have you ever heard similar freeze issue of send() under the following environment?

====

Tool :CCS Ver.4.2.4.00033

RTOS :SYS/BIOS 6.32.02.39

NDK   : 2.23.02.03

====

2. Is there any limitation of " TCP send buffer size "? I mean, is there maximum size definition of this setting?

3. customer wants to investigate this issue by other way until getting ROV information. Could you share your advice/thought for this, please? First of all, customer wants to investigate if this issue relates to hardware-site or software-site.

 

I believe it is tough to answer above inquiries, but, If you will be able to share your comments on this, it is really appreciated.

Best regards,

Miyazaki

  • Miyazaki,

    Re: #1

    There is a general TCP deadlock problem when all input and output buffers happen to become full at the same time. Here is a link describing the problem.

    Another possibility might occur if recv() is not called frequently enough. At a low level, PBM resources are shared between send and receive. If receive is not occurring, then the PBM pool will run out and the send will block waiting for a PBM resource to become available.

    Re: #2

    The maximum TCP transmit buffer size is 65535 bytes.

    R3: #3

    It would be most useful to capture a Wireshark trace of the problem. Please attach the Wireshark log file.

    Also, try viewing the TCP statistics variables in the CCS Expression view. Enter the name 'tcps' in the view. Take a snapshot before and after the issue.

    As a recovery mechanism, you could add a 5 second timeout value to the socket. If the send() returns with timeout error, then you could close and reconnect.

    /* 5 second timeout on send and receive sockets */
    timeout.tv_sec = 5;
    timeout.tv_usec = 0;
    setsockopt(s, SOL_SOCKET, SO_SNDTIMEO, &timeout, sizeof(timeout));
    setsockopt(s, SOL_SOCKET, SO_RCVTIMEO, &timeout, sizeof(timeout));

    ~Ramsey

  • Hi Ramsey,

    Thanks for your comments and your detail thoughts. I shared them with customer. I’d like to wait for customer’s feedback for a while.

    Best regards, Miyazaki

  • Hello Ramsey,

    Regarding Wireshark log, Customer was not able to capture this so far because this behavior rarely occurs. However, According to Sever-site analysis(TCPDUMP log), the last communication is "Ack" from Linux-server. (it does not seem that Customer is able to share this log). In terms of this point, could you consider the possibility that is "all input/output buffers happen to become full" ?

    I believe it is difficult to do comments on this, however, please allow me to ask it because customer is asking it.

    It is really appreciated if you will be able to share your comments with us.

    Bes regards, Miyazaki

  • Miyazaki,

    I think you will need to enable some debug messages in the stack. The NDK User Guide Section 3.5 has some information on debugging an application. The NDK Reference Guide Section 2.5 has some information on the DbgPrintf() API. It would be good to enable this in the customer build. I will get back to you on where to instrument the code.

    ~Ramsey

  • Miyazaki,

    Try instrumenting the following functions:

    /ti/ndk/stack/sock/sock.c: SockSend()
    /ti/ndk/stack/tcp/tcpout.c: TcpOutput()
    /ti/ndk/stack/ip/ipout.c: IPTxPacket()

    The general execution flow is

    SockSend() --> TcpOutput() --> IPTxPacket()

    Add -DTCP_DEBUG to your build options.

    ~Ramsey

  • Hi Ramsey,

    Thank you  for your helpful advice of this analysis. I shared your comments with our customer and discussed it. For debug-build, since behavior was not same, customer was not able to confirm this issue occurred. Therefore, we suppose it is difficult to clarify the root cause of this issue by some debug message...

    And then, For timeout of send() / recv(), customer added timeout  and tried to verify this issue, however, I heard that the send() did not return with timeout error.

    From point of this view,

    1. Would you consider there is possibility that the PBM pool would run out ?

    2. When PBM pool will run out, is there possibility that recv() will also block waiting for a PBM resource to become available ?

    Can I have your Expert’s advice/comments on them, please?

    Due to the lack of information, customer understand it is tough to do analysis of this issue. However it will be appreciate your advice. After customer will receive your Expert’s comments, customer plans to close this thread until customer will be able to gain more detail data.

    Best regards,

    Miyazaki

  • Miyazaki,

    Yes, it is still a possibility that the PBM pool has run out.

    The PBM pool used by the NDK is a doubly-linked list of buffers. The size of the available PBM pool can be viewed in the CCS Expressions view. Enter 'PBMQ_free' into the Expressions view. Then expand the item. The 'Count' value indicates how many buffers are available in the pool.

    When the send() failure occurs again, halt the processor and inspect the PBMQ_free Count value. If this is zero, it is most likely the reason for the send() lockup.

    If the PBM pool is not zero, there is still another possibility. The driver also has its own PBM pool. If this runs out, this also would be a problem. Finally, the driver EMAC DMA code probably has a pool of buffers. If this runs out, it might also explain the issue. However, the network driver for the C6457 is not owned by the NDK team. You would need to contact the Processor team regarding the driver details.

    The most likely reason for the PBM pool to run out, is if the application does not call recv() when data is available. The most common implementation is to create a dedicated task to receive data. This task would call recv() in a loop. This ensures the incoming data is consumed and all PBM buffers are recycled into the pool.

    You ask if recv() would block if the PBM pool is empty. Yes, it would. The PBM buffers are shared with both the receive and send data paths.

    Did the customer collect the TCP statistics? This also can be viewed by entering 'tcps' in the CCS Expressions view.

    ~Ramsey

  • Hello Ramsey,

    Thank you for your advice. I shared your comments and I received inquiry about the recovery of this issue.

    When PBM pool block issue occurred , is there any recovery procedure except Hardware-reset?

    As I informed previously, customer added timeout into send() / recv(), customer told me that send() did not return with timeout error, I mean, system seems still to be freezing . customer would like to know other recovery method if possible. For example, after issue occurred, if recv() will be called, would you consider there is possibility that this dead-lock be resolved?

    I’m sorry, but, it is really appreciated if you will be share other recovery-method or your comments on this.

    Best regards,

    Miyazaki

  • Miyazaki,

    The goal is to avoid the PBM pool lockup issue, not to implement a recovery method. For debugging, it is good to confirm the PBM is causing the lockup and then to work backwards to prevent the lockup. If the PBM is not the root cause of failure, then we need to look elsewhere. Has the customer confirmed that the PBM pool is empty when the lockup occurs?

    If you have determined that PBM pool is the cause of lockup issue, then it should be preventable by making necessary calls to send() and recv(). This would ensure proper application behavior. Has the customer tried this?

    In the end, if a lockup has occurred, it means the application is in a failure state. The only recovery is to reboot. Attempting to call recv() would probably result in undefined behavior because the application is already in a failure state. But this is a worst case situation. Under normal operating conditions, this failure should not be encountered.

    ~Ramsey

  • Hi Ramsey,

    Thank you for your sharing Expert’s comments. I really appreciated your kind advice. I will close this thread until customer will be able to gain more detail data.

    Best regards,

    Miyazaki