This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

NDK hang after 20 minutes running

Other Parts Discussed in Thread: SYSBIOS

Hi, I am using UDP to send data from PC to c6678 EVM board. I start from the hellowrold sample program in MCSDK package and modify the daemon call back function. The PC site program send 1040 byte packet continuously to the board. My experiment shows that the data transmission was hang after 20 minutes running.  has anyone encounter the same problem before?

NDK version: 2.22.3.20

sys/bios version: 6.35.1.29

xdc version:3.24.7.73

  • Fan,

    Would to know more details:

    - MCSDK version?

    - Does the helloworld sample program also stuck without your code change? Can you share what changed?

    - What is the UDP packet rate from PC to EVM? can you share the program if we can reproduce your issue here? I knew 6678 is quite robust to UDP flood.

    - When DSP stuck, at the EMAC/SW register layer, did you still receive UDP packets from PC? Look at the 0x2090b00 and 0x2090c00 regions. those registers are explained in table 3-2 of http://www.ti.com/lit/ug/sprugv9c/sprugv9c.pdf. Do you see Rx counter increment?

    - Then at NIMU driver layer, did you see EmacRxPktISR() function hit? You can set a breakpoint in assmbly window.

    - Then in NDK stack, can you check if global struct "ips", "udps" still increase in CCS experissions window?

    Regards, Eric 

     

  • Hi, Eric

    Thanks for your reply. Here is some details related to your question.

    - MCSDK version?

         I use MCSDK 2.1.2.6

    - Does the helloworld sample program also stuck without your code change? Can you share what changed?

    - What is the UDP packet rate from PC to EVM? can you share the program if we can reproduce your issue here? I knew 6678 is quite robust to UDP flood.

    - When DSP stuck, at the EMAC/SW register layer, did you still receive UDP packets from PC? Look at the 0x2090b00 and 0x2090c00 regions. those registers are explained in table 3-2 of http://www.ti.com/lit/ug/sprugv9c/sprugv9c.pdf. Do you see Rx counter increment?

    - Then at NIMU driver layer, did you see EmacRxPktISR() function hit? You can set a breakpoint in assmbly window.

    - Then in NDK stack, can you check if global struct "ips", "udps" still increase in CCS experissions window?


     Based on your question, in order to check the internal variable values of NDK stack. I followed the tutorial to rebuild NDK with -g and -O0 option. Now I lost track of the problem. It is running OK for one hour and a half now. It would be great if that is the easy way to solve the problem (by rebuilding the NDK). I will run a overnight test today and if it survives till tomorrow, I will consider it solved and mark this thread as answered. otherwise I will give you the answers to your questions once I catch again the hanging point.


    Thanks again. 

  • Hi, Eric

    Unfortunately the hanging problem came back this afternoon. As you told I test using the original helloworld sample. it didn't hang. What I guess is that the helloworld udp receiver synchronize sender by sending an ack message after each successful recv(). which avoid the emergence of the recv buffer overflow. the differences between my program and the helloworld is I do not send ack but allows a certain number of packet to be dropped. The PC site sender send packet as fast as it can do and so does the receiver.

    Here is the code :

    UDP daemon callback function  

    int DebugUDPDaemon(SOCKET s, UINT32 unused) {

      unsigned char* pBuf;
      HANDLE hBuffer;
      int sin_len;
      struct sockaddr_in sin;

      int sout_len;
      struct sockaddr_in sout;

      printf("Starting UDP receiver..\n");

      // Initialize socket address structure for Internet Protocols
      bzero(&sin, sizeof(sin));
      sin.sin_family = AF_INET;
      sin.sin_addr.s_addr = htonl(INADDR_ANY);
      sin.sin_port = htons(6789);
      sin_len = sizeof(sin);

      bzero(&sout, sizeof(sout));
      sout.sin_family = AF_INET;
      sout.sin_addr.s_addr = inet_addr("129.252.131.30");
      sout.sin_port = htons(6789);
      sout_len = sizeof(sout);

      int status = 0;
      int old_packet_id = 0;

      int total_packet = 0;
      int lost_packet = 0;

      while (1) {
        status = recvncfrom(s, (void **)&pBuf, UDP_PACKET_SIZE + 4 * sizeof(int),
                                            (PSA)&sin, &sin_len, &hBuffer);

        // The first integer of the udp packet is the packet id. I use this number to check the
        // lost rate of data transmission.
        int packet_id = (pBuf[0]) | (pBuf[1] << 8) | (pBuf[2] << 16) | (pBuf[3] << 24);

        if (packet_id != old_packet_id + 1) {
          lost_packet++;
        }
        total_packet++;
        if (total_packet > 10000) {
          // I clear lost packet counter every 10000 packets and output lost rate.
          printf("lost rate = %f\n", (float)(lost_packet) / total_packet);
          total_packet = 0;
          lost_packet = 0;
        }
        old_packet_id = packet_id;
        if (status >= 0) {
          // If I add the following line to ack successful receiving back to sender, it will not hang,
          // otherwise you will see the output packet lost rate rises to over 50% after about 20 minutes.

          // status = sendto(s, pBuf, 0, 0, (struct sockaddr*)&sout, sout_len);
          recvncfree(hBuffer);
        } else {
          printf("error\n");
          break; 
        }
      }

      return 0;

    }

    PC site code (compiled under linux)

    void DebugUDP(char* addr, unsigned int port) {
      int socket_descriptor;

      struct sockaddr_in sin;  
      int sin_len;

      struct sockaddr_in sout;
      int sout_len;

      int optval = 1;

      socket_descriptor = socket(AF_INET, SOCK_DGRAM, 0);
      setsockopt(socket_descriptor, SOL_SOCKET, SO_REUSEADDR, &optval, sizeof(optval));

      bzero(&sin, sizeof(sin));
      sin.sin_family = AF_INET;
      sin.sin_addr.s_addr = htonl(INADDR_ANY);
      sin.sin_port = htons(port);
      sin_len = sizeof(sin);

      bzero(&sout, sizeof(sout));
      sout.sin_family = AF_INET;
      sout.sin_addr.s_addr = inet_addr(addr);
      sout.sin_port = htons(port);
      sout_len = sizeof(sout);

      bind(socket_descriptor, (struct sockaddr*)&sin, sin_len);

      unsigned char buffer[sizeof(int) * 4 + UDP_PACKET_SIZE];

      int packet_id = 0;
      int count = 0;

      int status;

      while (1) {

        ((int*)buffer)[0] = packet_id;
        status = sendto(socket_descriptor, buffer,
                                     4 * sizeof(int) + UDP_PACKET_SIZE, 0,
        (struct sockaddr *)&sout, sout_len);
        usleep(0);

        if (status < 0) {
          printf("sendto error\n");
        }

        cerr << "packet id = " << packet_id << endl;

        packet_id++;

        // uncomment to allow sync data sending with ack message from DSP.
        // status = recvfrom(socket_descriptor, (void*)buffer, 0, 0,
        //                                 (struct sockaddr*)&sin, (socklen_t*)&sin_len);

      }

    }

    the constant  UDP_PACKET_SIZE is defined in both DSP and PC code as

    #define  UDP_PACKET_SIZE 1024

    The hanging problem according to my last run, happens after the 10487588th packet, however, this number varies from run to run.

    - Then in NDK stack, can you check if global struct "ips", "udps" still increase in CCS experissions window?

    I check the value of ips.total and udps.RcvTotal, it seems they are not always increased after recvncfrom() call. after hanging for a while, these two values will be finally frozen and regardless of incoming packets.

    - When DSP stuck, at the EMAC/SW register layer, did you still receive UDP packets from PC? Look at the 0x2090b00 and 0x2090c00 regions. those registers are explained in table 3-2 of http://www.ti.com/lit/ug/sprugv9c/sprugv9c.pdf. Do you see Rx counter increment?

    In the last post, you want me to check the value of  EMAC/SW register, I cannot link to the pdf you give but I find this one http://www.ti.com/lit/ug/sprugv9d/sprugv9d.pdf.  I look at the table 3.2 but didn't see information about the EMAC/SW register. would you please let me know where is the corresponding table according the rev.D version?

    after the hanging happened, the data lost rate will gradually goes up to 100%. finally even the sender is still sending packet. the receiver never returns from the blocked recvncfrm() function.

    - Then at NIMU driver layer, did you see EmacRxPktISR() function hit? You can set a breakpoint in assmbly window.

       Wold you please point to me that how should I set this breakpoint? since I never go into the NIMU layer. Any detailed instruction will be appreciated. 

  • Hi, Eric

    May I ask that if there is a function I can call to reset packet buffer manager or restart network system when hanging is detected? 

    Thanks.

  • Fan,

    - It is table 3-3 in SPRUGV9D—June 2013.

    - Set break point, see attcahed picture

    - What is the purpose for your test - send the packets as soon as possible? Depending on my PC's speed and the connection between PC and EVM (I have a switch). I may not reproduce your issue. It is better to describe the data rate achieved from the PC side. How fast you are sending UDP packets in Mbit/s?

    - When halt, you have to do system reset to DSP and run default_global_setup() from GEL.

    Regards, Eric

     

     

  • The purpose is as you said, to find out the maximum transfer rate that I can use to send data to DSP. the data transfer rate showed on my PC site is 80MBit/sec.

    and I found that when system hang, it loops in function:

    xdc_Bool ti_sysbios_hal_Hwi_HwiProxy_getStackInfo__E( ti_sysbios_interfaces_IHwi_StackInfo* stkInfo, xdc_Bool computeStackDepth) 

    and go nowhere else. I put a breakpoint in function EmacRxPktISR(). it doesn't intercept the program when hanging.

  • Yesterday I tried NDK helloworld example to receive UDP packets (1470 bytes payload) and didn't see any problem at 100 Mb/s rate (my PC send utility is iperf).  For throughput, I knew there are some forum posts which said they achieved around 600-700 Mbit/s. The test you can try is HUA DEMO under C:\ti\mcsdk_2_01_02_06\demos\hua.

    Regards, Eric

  • Hi, Eric

    Thanks to your help that I find out the board I am using is EVM v1.0, which is a Beta test version. I switch to a new board (Rev 3.0b). it runs so far so good!. again I will do a overnight test today and if things going well ( I believe so).  I will mark this post as solved.

    Thanks to all you awesome guys that give hand to solve my questions!

  • I tried on the C6678 EVM board Rev.3b. The problem is still there.

    I took the udp helloworld program. increase the packet size and remove the ack sendto() back to PC site. that's all I change on the program. and also I change the PC site program correspondingly. the data transfer rate from PC Site is 10000 packets per second, each packet = 1024 bytes, so total about 10 MByte/sec, should be much lower that the bandwidth of the board.

    Eric, I don't know how long you run your test. The board will act normally for a while (no longer than 30 minutes) and finally choked by the incoming packets. I track down to the value RXGOODFRAMES register. Once the hanging starts, the update rate of the value in this register becomes slower and slower and finally frozen. I believe that the network switch system was kind of "DoSed". but the DoS usually doesn't appear on hardware level isn't it?

    And Eric, I don't know if you are in charge of customer service but let's say, if it is a bug for real and I and my research team need to talk to TI to file it. Who should I talk to?

    Let me know what you need if you are working on reproduce the problem. I have both the DSP and PC site code.

    Thanks.

  • Fan,

    I will run it over night with 100Mb/s UDP rate on my 6678EVM. I will use the Helloworld example with UDP send back removed. Let's see if DSP stuck or not.

    In your test, did you increased "rc" from 8192 to a big numbet, say 32768 or 65536.

    // UDP Receive limit

    rc = 8192;

    CfgAddEntry( hCfg, CFGTAG_IP, CFGITEM_IP_SOCKUDPRXLIMIT, CFG_ADDMODE_UNIQUE, sizeof(uint), (UINT8 *)&rc, 0 );

    The issue/bug tracking goes through this E2E forum.

    Regards, Eric

  • Hi, Eric

    Thanks for your suggestion.

    I have tried to increase the recv buffer but didn't solve the problem, actually I had tried a 64M buffer and change the packet buffer section on to DDR3. but that even doesn't slow down the hanging.

    Here is a couple of other thing I have tried but doens't work. I don't know if these are helpful but I will just list them out.

    Increase the default daemon task stack size from 4096 to 16K.

    Use non-blocking read to clear the buffer once the hanging happens.

    Close the socket and rebind it once the hanging happens

    Switch to TCP (that's painful and still doesn't work, TCP seems to be has the same problem)

    Slow down the data transfer rate by half, That slows down the hanging by half, but it will finally bump out after the similar amount of packets that have been transferred. seems the problem is related to transfer size, not speed.

    And by the way, the ccs version I use is 5.1. I am not sure if this is relevant but if that is the problem I can update ccs for sure.

  • Fan,

    My XDC, NDK, BIOS versions are same as yours. CCS is 5.4. I ran it for 10 hours, I used iperf to send UDP to it with 100Mb/s. The UDP data size is 1470 bytes, totaling 2.87 G packets, 393GB data sent.

    ------------------------------------------------------------
    Client connecting to 158.218.109.165, UDP port 7
    Sending 1470 byte datagrams
    UDP buffer size: 64.0 KByte (default)
    ------------------------------------------------------------
    [  3] local 158.218.109.172 port 56687 connected with 158.218.109.165 port 7
    [ ID] Interval       Transfer     Bandwidth
    [  3]  0.0-36000.0 sec   393 GBytes  93.8 Mbits/sec
    [  3] Sent 287044389 datagrams
    [  3] WARNING: did not receive ack of last datagram after 10 tries.

    I didn't see any stuck issue. I saw the EMAC/SW registers RXGOOD FRAMES incremental correct.

    Regards, Eric

  • Hi, Eric

    Thanks for your testing result. That is very helpful. If you can make it on your machine, I am sure if I follow exactly the same steps, I can make it, too.

    Today I discover that the CCS version I use (v5.1) doesn't recognize MCSDK v 2.1.2.6. instead of using the 2.1.2.6 version it links back to an earlier version. so does the C6678 PDK. Instead of using the newest one (1.1.2.6) it links back to v1.1.0.1.

    I am rebuilding the test environment using CCS 5.2 and I will follow up this post if the problem is solved.

  • Hi, Eric

    It is confirmed that the hanging came from the C6678 PDK(v1.1.0.1) that I mistakenly used. After I updated CCS to v5.4 and MCSDK to 2.1.2.6, c6678 PDK to 1.1.2.6. they together work with NDK 2.22.3.20 perfectly. The bandwidth I got matches the one that you have posted.

    The problem now can be marked solved. Thank you for following my post during last week and providing us the keys to solve the problem. By solving this problem. we can continue our R&D on the C6678 board soon.

    Have a good weekend. and I know I will :P