This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

NDK 1.94: halt on IP_DROP_MEMBERSHIP or fdClose() after DHCP renew

Hello,

while working with NDK 1.94 on C64+ platform I've stumbled upon problem - I think - related to this library.

My test device is using DHCP and joins multicast group (IP_ADD_MEMBERSHIP) on startup. When address from DHCP changes (either router was changed or IP assignment on same router after lease timeout) I want to shut down part of application and relaunch it. Unfortunately when I'm trying to close socket that previously joined multicast group some task at priority 9 (I believe one of NDK tasks) hangs up consuming all CPU time indefinitely and starving all lower priority tasks.

This happens only with multicast-enabled socket. If I drop multicast membership before network removed/network added events occur than everything works fine. Task halts when socket is closed or when IP_DROP_MEMBERSHIP option is set after DHCP renew. It happens even if delay between DHCP renew and and socket closing is long and in my test even if renewed address is the same as previous (although I don't want normally to reopen/rebind socket in this case).

I would be glad if someone could confirm problem or send any suggestions. 

Edit: after next few tries I'm not sure if this can happen when renewed address is the same as previous, but  changing router seems to be reliable method to get this effect. Task at priority 9 seems to be NDK kernel in my configuration (default NDK priorities). Halting of this task may be just random effect - at the same time I'm observing using TSK_stat strange (either 100% or some seemingly random, out of range number)  stack usage of two other tasks that are using NDK - that may suggest some incorrect memory access.

  • Hi tomeko,

    Which version of BIOS and CCS are you using?

    This certainly could be due to a task stack overflow and this is the easiest thing to try first.  Can you try increasing the stack size of the tasks that you see having close to 100% stack usage?

    Also, as a side note, there have been several NDK releases since 1.94.  I checked the release notes but did not see any bugs related to this listed as being fixed.  But, it's possible that something causing this issue was fixed but not tracked in the release notes.

    But before going the update route, let's see what the results are with increased stack size.

    Steve

  • Thanks for answer.

    Stack overflow seemed unlikely - mentioned tasks have stack margins of 8 kB (or little more) out of 16 kB. After removing from code single setsockopt with IP_ADD_MEMBERSHIP and repeating test this margin is kept, so stack overflow would be possible only if using multicast would consume more than 7 kB (taking DSP/BIOS required stack margin into account). If router was not swapped (and DHCP lease time expired) between IP_ADD_MEMBERSHIP and fdClose() or IP_DROP_MEMBERSHIP or if multicast was not enabled than any significant stack usage increase is not observed.

    I'm using DSP/BIOS 5.33.03, cgtools 6.1.7, CCS 3.3. Unfortunately any updates would not be a viable option (at least not just because of this problem) due to project size and risk involved. If I won't find solution I would probably just block functionality that requires multicast when DHCP is configured.

     

    Edit: I've tested with 256 kB stacks and effect is the same.

    Also I would suppose that IP_DROP_MEMBERSHIP action might be source of the problem - it seems to be executed when calling fdClose(). IGMP Leave message is sent when socket is closed under normal conditions (DHCP router not changed), I didn't checked if it actually sent in this situation.

  • Hi tomeko,

    Very glad that you did and didn't find any stack overflow issues ... by the way:

    tomeko said:
    at the same time I'm observing using TSK_stat strange (either 100% or some seemingly random, out of range number)  stack usage of two other tasks that are using NDK - that may suggest some incorrect memory access.

    ... this is the only reason I suggested that you check the stacks.

    I looked at the code of setsockopts() for IP_DROP_MEMBERSHIP but nothing stood out immediately.

    Do you have the NDK sources with your 1.94 release? Can you try adding the file igmp.c into your project and rebuilding?  This should allow you to set a break point into the function:

    int IGMPLeave (HANDLE hSock, struct ip_mreq* ptr_ipmreq)

    Hopefully you can check to see which part of the code is run for this failure case and then we can go from there.

    Steve

  • No, I don't have NDK sources (at least not stack.lib sources, just usual 1.94 package). I've tried to link with own (empty) IGMPLeave, but it gives me linker conflict.

  • tomeko,

    Ok, yes, the NDK didn't ship sources until version 2.0.0.  I see that the igmp.c file is the same between 1.94 and 2.0.0.  I've attached it for you.  Can you try again with the attached file?

    Steve

    6786.igmp.c

  • Thanks, I didn't know that NDK 2.x comes with source code for stack.lib.

    I think I see the problem. Here is log fragment:

    Link Status: 100Mb/s Full Duplex on PHY 15
    Network Added: If-1:10.0.0.125
    Service Status: DHCPC : Enabled : Running : 017
    DHCP Server 1 = '10.0.0.1'
    Router 1 = '10.0.0.1'
    Mask 1 = '255.255.255.0'
    IGMP: Join hSock = -2121356932, ptr_ipmreq = 80403064
    IGMP: JoinHostGroup
    IGMP: joined
    IGMP: Timer
    ...
    IGMP: Timer
    // eth detached
    Link Status: No Link on PHY 15
    // second router attached
    Link Status: 100Mb/s Half Duplex on PHY 15
    Network Removed: If-1:10.0.0.125
    Service Status: DHCPC : Enabled : Running : 018
    Network Added: If-1:192.168.0.202
    Service Status: DHCPC : Enabled : Running : 017
    DHCP Server 1 = '192.168.1.6'
    Router 1 = '192.168.1.1'
    Mask 1 = '255.255.254.0'
    // closing socket after Network Added event
    IGMP: Leave hSock = -2121356932, ptr_ipmreq = 818e3d14
    IGMP: EINVAL bind
    IGMP: Leave hSock = -2121356932, ptr_ipmreq = 818e3d14
    IGMP: EINVAL bind
    IGMP: Leave hSock = -2121356932, ptr_ipmreq = 818e3d14
    IGMP: EINVAL bind
    ... 

    In SockClose():

    /* If the socket is being closed; we need to ensure that all the multicast group
    * this socket has joined are left. */
    ptr_mcast_rec = (MCAST_SOCK_REC *)list_get_head ((LIST_NODE**)&ps->pMcastList);
    while (ptr_mcast_rec != NULL)
    {
       /* Leave the multicast group */
       IGMPLeave (h, &ptr_mcast_rec->mreq);
       /* Get the head; since the IGMPLeave will have deleted the entry from the list. */
       ptr_mcast_rec = (MCAST_SOCK_REC *)list_get_head ((LIST_NODE**)&ps->pMcastList);
    }

    And IGMPLeave() in my test is leaving immediately without touching multicast list:

    hIf = BindIPHost2IF ((IPN)ptr_ipmreq->imr_interface.s_addr);
    if (!hIf)
    {
       return (EINVAL);
    }

    In effect SockClose() is stuck in loop.
    I'm not sure if my previous report with setsockopt + IP_DROP_MEMBERSHIP was correct - looking into source code I don't see problem there (it would not leave group but it would stuck in loop) and I've done this test only once, so I may have missed that it stuck inside fdClose() not inside setsockopt().

    Anyway, I'll probably have to leave this problem for a moment or two and I'll just deny IGMP in DHCP configuration in next software release.







  • I vote to reopen this issue. I also experienced it today with NDK 2_24_03_35. It's heavily related to

    although the patch suggested there doesn't help with this problem here in this thread.

  • Can you open a new thread for this? You can reference this thread.

    Todd