crash after DHCP timeout handling (reconfigure from DHCP to static IP) only with Task.deleteTerminatedTasks=true

F. Brettschneider

Other Parts Discussed in Thread: SYSBIOS

Hi,

another problem I have is with a program crash, if (after a DHCP timeout) I reconfigure the NDK stack from DHCP to static address. But the crash only happens if Task.deleteTerminatedTasks=true. Having Task.deleteTerminatedTasks=false is no problem for that NDK stack reconfiguration.

Currently I use ndk_2_21_00_32 with bios_6_33_04_39, though the same happens with bios_6_34_02_18.

The stack is prepared with some CfgAddEntry calls applied with CfgSetDefault and started with NC_NetStart.

A DHCP timeout handler is started with Clock_start. When it's expired, this handler function runs hCfg=CfgGetDefault; CfgExecute(hCfg,0); removes the DHCP config by CfgGetEntry / CfgRemoveEntry, and applies a new configuration for a static IP with some calls of CfgAddEntry + CfgExecute. After that I reset the watchdog but a while later the program crashes at a totally different code line, every time the same one. As if some memory was corrupted.

The whole thing has been stable for months with Task.deleteTerminateTasks=false. But due other problems I have to set Task.deleteTerminateTasks=true now.

over 12 years ago

0 Tom Kopriva over 12 years ago

TI__Mastermind 20480 points

Hi,

On what hardware are you seeing this?

Are you using the global NDK configuration?

var ndkGlobal = useModule("ti.ndk.config.Global");
ndkGlobal.enableCodeGeneration = true;

Can you attach the Cfg* C arrays and the .cfg file?

0 Karl Wechsler over 12 years ago in reply to Tom Kopriva

TI__Mastermind 20805 points

Is this problem still open? I see a post on similar topic from your colleague that was posted a few hours after this.

Did the original issue from that post get resolved and now you are faced with this new issue?

http://e2e.ti.com/support/embedded/bios/f/355/p/224760/800709.aspx#800709

Regards,
-Karl-

0 F. Brettschneider over 12 years ago in reply to Karl Wechsler

Intellectual 480 points

Karl Wechsler said:

Is this problem still open? I see a post on similar topic from your colleague

It's still open, my test case doesn't call task deletion which triggered the seen problems there.

0 F. Brettschneider over 12 years ago in reply to Tom Kopriva

Intellectual 480 points

Tom Kopriva said:

Hi,

On what hardware are you seeing this?

DM6437. Beside the NDK library I also build in CSL/NIMU based ethdriver.c, csl_emac.c, csl_mdio in copied from EVM board stuff.

Tom Kopriva said:

Are you using the global NDK configuration?

var ndkGlobal = useModule("ti.ndk.config.Global");
ndkGlobal.enableCodeGeneration = true;

No.

Tom Kopriva said:

Can you attach the Cfg* C arrays and the .cfg file?

I don't know what you mean with Cfg* C arrays, what do you want to see?

The program is running like this (very simplified code):

===== my.cfg =====
...snip...
Task.deleteTerminatedTasks = true;


var hooks = new Task.HookSet();
hooks.registerFxn = '&NDK_hookInit';
hooks.createFxn = '&NDK_hookCreate';
hooks.deleteFxn = '&my_Task_onDeleteOrExit';
hooks.exitFxn = '&my_Task_onExit';
Task.addHookSet(hooks);

var mainParams = new Task.Params();
mainParams.instance.name = "main";
mainParams.stackSize = 0x10000;
mainParams.stackSection = ".uninitializedDDR2";
Program.global.main = Task.create("&my_main", mainParams);

var netctrlParams = new Task.Params();
netctrlParams.instance.name = "netctrl";
netctrlParams.priority = 3;
netctrlParams.stackSize = 12288;
netctrlParams.stackSection = ".uninitializedDDR2";
Program.global.netctrl = Task.create("&my_initStack", netctrlParams);

var prdTimeoutDHCPParams = new Clock.Params();
prdTimeoutDHCPParams.instance.name = "prdTimeoutDHCP";
prdTimeoutDHCPParams.period = 0;
Program.global.prdTimeoutDHCP = Clock.create("&my_onTimeoutDHCP", 1000, prdTimeoutDHCPParams);

var prdNDKParams = new Clock.Params();
prdNDKParams.instance.name = "prdNdk";
prdNDKParams.period = 100;
prdNDKParams.startFlag = true;
Program.global.prdNdk = Clock.create("&llTimerTick", 1, prdNDKParams);

===== variables =====
Clock_Handle s_hPrdTimeoutDHCP;
bool s_bDhcpTimeout = false;
bool s_bNetworkConfigured = false;
int  s_nDHCPTimeoutSecs = 0;
...snip...

===== main.c =====
my_main()
{
    s_hPrdTimeoutDHCP = (Clock_Handle)prdTimeoutDHCP;

    stackConfig_semaphore.wake();

    while (!s_bNetworkConfigured) {
        my_watchdogReset();
        my_usecSleep(100000);

        if (s_bDhcpTimeout) {
            int oldPri = my_setTaskPri(4); // increase task priority
            my_switchFromDhcpToStaticIp();
            my_setTaskPri(oldPri);
            s_bDhcpTimeout = false;
            my_watchdogReset();
        }
    }

    ...
    >>>>somewhere later>>>CRASH<<<<<<<<<<<
}

===== my_network.c =====
my_initStack()
{
    HANDLE hCfg;
    NC_SystemOpen(NC_PRIORITY_LOW, NC_OPMODE_INTERRUPT);

    do {
        // create a new configuration
        hCfg = CfgNew();

        stackConfig_semaphore.wait();

        my_setupStackConfig(hCfg);

        s_nDHCPTimeoutSecs = 5;
        Clock_start(s_hPrdTimeoutDHCP);
        
        NC_NetStart( hCfg, my_NetworkOpen, my_NetworkClose, my_NetworkIPAddr );

        CfgExecute(hCfg, 0);
        CfgFree(hCfg);
    } while(ret == B_REINIT_STACK);
}

my_NetworkIPAddr
{
    Clock_stop(s_hPrdTimeoutDHCP);
    s_bNetworkConfigured = true;
}

my_onTimeoutDHCP()
{
    if (--s_nDHCPTimeoutSecs) {
        Clock_start(s_hPrdTimeoutDHCP); return;
    }

    if (!my_socket_getLocalAddress(sIPAddr, 0)) {
        s_bDhcpTimeout = true;
    }
}

my_setupStackConfig(HANDLE hCfg)
{
    val = 180; // increased from default=2, this covers up to 256K
    CfgAddEntry( hCfg, CFGTAG_IP, CFGITEM_IP_TCPREASMMAXPKT, 0, sizeof(uint), (UINT8 *)&val, 0 );

    // add our global hostname to hCfg (to be claimed in all connected domains)
    CfgAddEntry( hCfg, CFGTAG_SYSINFO, CFGITEM_DHCP_HOSTNAME, 0,
                 strlen(s_pHostName), (UINT8 *)s_pHostName, 0 );
    {
        CI_SERVICE_DHCPC dhcpc;
        // Specify DHCP Service on IF-1
        bzero( &dhcpc, sizeof(CI_SERVICE_DHCPC) );
        dhcpc.cisargs.Mode   = CIS_FLG_IFIDXVALID;
        dhcpc.cisargs.IfIdx  = 1;
        dhcpc.cisargs.pCbSrv = &ServiceReport;

        CfgAddEntry( hCfg, CFGTAG_SERVICE, CFGITEM_SERVICE_DHCPCLIENT, 0,
                     sizeof(dhcpc), (UINT8 *)&dhcpc, 0 );
        s_curIdxManualMethod = B_MANUAL_METHOD_IDX_DHCP;
    }

    val = DBG_WARN;
    CfgAddEntry( hCfg, CFGTAG_OS, CFGITEM_OS_DBGPRINTLEVEL,
                 CFG_ADDMODE_UNIQUE, sizeof(uint), (UINT8 *)&val, 0 );

    val = 1;
    CfgAddEntry( hCfg, CFGTAG_OS, CFGITEM_OS_TASKPRILOW,
                 CFG_ADDMODE_UNIQUE, sizeof(uint), (UINT8 *)&val, 0 );

    val = 3;
    CfgAddEntry( hCfg, CFGTAG_OS, CFGITEM_OS_TASKPRINORM,
                 CFG_ADDMODE_UNIQUE, sizeof(uint), (UINT8 *)&val, 0 );

    val = 13;
    CfgAddEntry( hCfg, CFGTAG_OS, CFGITEM_OS_TASKPRIHIGH,
                 CFG_ADDMODE_UNIQUE, sizeof(uint), (UINT8 *)&val, 0 );

    val = 15;
    CfgAddEntry( hCfg, CFGTAG_OS, CFGITEM_OS_TASKPRIKERN,
                 CFG_ADDMODE_UNIQUE, sizeof(uint), (UINT8 *)&val, 0 );

    CfgSetDefault(hCfg);
}

my_switchFromDhcpToStaticIp()
{
    HANDLE hCfg;
    HANDLE hDhcpEntry, hRouteEntry, hIpEntry;
    CI_IPNET NA;
    CI_ROUTE RT;
    IPN      IPTmp;

    hCfg = CfgGetDefault();
    CfgExecute(hCfg, 0);

    // check if we are in DHCP mode
    if (CfgGetEntry(hCfg, CFGTAG_SERVICE, CFGITEM_SERVICE_DHCPCLIENT, 1, &hDhcpEntry) > 0) {
        CfgRemoveEntry(hCfg, hDhcpEntry);
    }

    // now, replace or add any IP or route settings
    if (CfgGetEntry(hCfg, CFGTAG_IPNET, 1, 1, &hIpEntry) > 0) { // Remove the address
        CfgRemoveEntry(hCfg, hIpEntry);
    }

    if (CfgGetEntry(hCfg, CFGTAG_ROUTE, 0, 1, &hRouteEntry) > 0) { // Remove the route
        CfgRemoveEntry(hCfg, hRouteEntry);
    }

    if (CfgGetEntry(hCfg, CFGTAG_SYSINFO, CFGITEM_DHCP_DOMAINNAMESERVER, 1, &hIpEntry) > 0) {
        CfgRemoveEntry(hCfg, hIpEntry);
    }

    // now, create static config
    my_createConfigStaticIp(&NA, &RT, &IPTmp, &s_staticIpConfig);

    // Add the new address to interface 1
    CfgAddEntry( hCfg, CFGTAG_IPNET, 1, 0,
                       sizeof(CI_IPNET), (UINT8 *)&NA, 0 );

    // Add the route
    CfgAddEntry( hCfg, CFGTAG_ROUTE, 0, 0,
                       sizeof(CI_ROUTE), (UINT8 *)&RT, 0 );

    // Manually add the DNS server when specified
    if( IPTmp ) {
        CfgAddEntry( hCfg, CFGTAG_SYSINFO, CFGITEM_DHCP_DOMAINNAMESERVER,
                     0, sizeof(IPTmp), (UINT8 *)&IPTmp, 0 );
    }

    CfgExecute(hCfg, 1); // apply changes in configuration
}

I've only show you the NDK relevant settings of my .cfg file, is this enough for you?

My application crashes later at the marked point in my main function (see above) on a simple assignment to a struct member of a totally different code part which was allocated on heap. After that I lose connection to target.

The last thing I've tried was an update to ndk_2_22_00_06. The effect I'm seeing is that I lose connection to target somewhere later. It's also in different code, always at the same place, remarkable is it's always a system call, I mean within GateMutex_enter(). In case of doing "assembly steps into", I can prevent the lost target connection, as soon as I normally step over, it's lost.

0 Lars Beikirch over 12 years ago in reply to F. Brettschneider

Intellectual 615 points

Hello all,

I'm a colleague of F.Brettschneider and started to care for the issue described above.

Again a short overview of our application:

CCS 5.2.0.00069
SYS/BIOS 6.34.02.18
XDCTOOLS 3.24.03.33
NDK 2.21.00.32
No network cable connected to our device at all
Start of network stack with "DHCP enabled" by:
CI_SERVICE_DHCPC dhcpc;
HANDLE hCfg = CfgNew();
...
CfgAddEntry( hCfg, CFGTAG_SERVICE, CFGITEM_SERVICE_DHCPCLIENT, 0, sizeof(dhcpc), (UINT8 *)&dhcpc, 0 );
...
NC_NetStart( hCfg, NetworkOpen, NetworkClose, NetworkIPAddr );
After a timeout (implemented in our application) we want to switch to a static IP address if we didn't yet get an IP address by DHCP:
HANDLE hCfg = CfgGetDefault();
CfgExecute(hCfg, 0);
CfgRemoveEntry(hCfg, hDhcpEntry);
CfgAddEntry( hCfg, CFGTAG_IPNET, 1, 0, sizeof(CI_IPNET), (UINT8 *)..., 0 );
CfgAddEntry( hCfg, CFGTAG_ROUTE, 0, 0, sizeof(CI_ROUTE), (UINT8 *)..., 0 );
CfgAddEntry( hCfg, CFGTAG_SYSINFO, CFGITEM_DHCP_DOMAINNAMESERVER, 0, sizeof(IPTmp), (UINT8 *)..., 0 );
CfgExecute(hCfg, 1);

What I understood so far is the following:

At first NC_NetStart() invokes a DHCPclient task inside the NDK. This DHCP client task finally pends (with timeout) on a semaphore by the following call stack:
    dhcpState()
        StateSelecting()
            dhcpPacketReceive()
                recv()
                    SockRecv()
                        FdWaitEvent()
                            fdint_waitevent()
                                SemPend(pfdt->hSem)
In particular Semaphore_pend() is called with a timeout. For that reason a Clock_Struct is created as local var (thus on the stack of the DHCP client task (!)) and is added to a queue of timer callback objects by Clock_construct().
When we try to disable DHCP CfgExecute(hCfg, 0) finally calls DHCPClose() which calls fdCloseSession() and TaskDestroy() (-> Task_delete()) for the DHCP client task.
fdCloseSession() does not delete the internal FDT (an thus does not post the semaphore the DHCP client is pending on) since FdWaitEvent() incremented the ref counter for the FDT.
Task_delete() now cleans up the DHCP client task and deallocates it's stack. But the Clock_Struct instance which has been queued in the clock's queue is not removed from the queue properly (since the DHCP client task is still pending on the semaphore). Thus this Clock_Struct instance is now somewhere in "unused" (free) memory space.
Somewhen the memory where this Clock_Struct instance is located will be used as part of a newly allocated memory block - and when it's overwritten the clock's queue of timer callbacks get's broken.
When the next timer occurs (Clock_workFunc()) traversing the clock's queue by elem = Queue_next(elem); usually gets stuck in an infinite loop, typically because somwhen elem == 0x00000001 and elem->next == 0x00000001. But this is only a subsequent error...

This problem does not occure if the system config entry deleteTerminatedTasks isn't true, obviously because the stack is not deleted and thus the Clock_Struct instance is not overwritten. I didn't check what happens to the task pending on the semaphore while Task_delete() is called...

For me it looks like a conceptional problem at the moment. IMHO any try to configure IP address assignment by DHCP and trying to stop it later has a potential risk to get stuck in this trap.

Does anybody have a hint what to do? Either some special option to make this use case work or another use case avoiding the problem at all?

Thanks, Lars

0 Lars Beikirch over 12 years ago in reply to Lars Beikirch

Intellectual 615 points

I just tried to replace the changing of the current configuration while NDK is active (CfgExecute(hCfg, 0), change hCfg, CfgExecute(hCfg, 1)) by a complete restart of the NDK stack (NC_NetStop(), discard old hCfg, create new hCfg, NC_NetStart()).

By that the DHCP client task seems to be shut down properly (leaving Semaphore_pend() before Task_delete()) and my applications seems not to crash. But it looks like the second try of NC_NetStart() doesn't work - the callback for "IP address assigned" is not called despite I'm quite sure the static configuration is valid and it should be called...

Any idea on that?

Thanks, Lars

0 Lars Beikirch over 12 years ago in reply to Lars Beikirch

Intellectual 615 points

Lars Beikirch said:

I just tried to replace the changing of the current configuration while NDK is active (CfgExecute(hCfg, 0), change hCfg, CfgExecute(hCfg, 1)) by a complete restart of the NDK stack (NC_NetStop(), discard old hCfg, create new hCfg, NC_NetStart()).

By that the DHCP client task seems to be shut down properly (leaving Semaphore_pend() before Task_delete()) and my applications seems not to crash. But it looks like the second try of NC_NetStart() doesn't work - the callback for "IP address assigned" is not called despite I'm quite sure the static configuration is valid and it should be called...

Som additional info on the "restart" problem:

When I called NC_NetStart() for the second time the NDK's llTimerTick() frequently posts a semaphore (STKEVENT_signal( hEvent, STKEVENT_TIMER, 1 )), but no task cares for it. So the semaphore counter increases more and more and finally this causes an assertion "ti.sysbios.knl.Semaphore: line 331: assertion failure: A_overflow: Count has exceeded 65535 and rolled over."

Lars

0 Lars Beikirch over 12 years ago in reply to Lars Beikirch

Intellectual 615 points

Lars Beikirch said:

I just tried to replace the changing of the current configuration while NDK is active (CfgExecute(hCfg, 0), change hCfg, CfgExecute(hCfg, 1)) by a complete restart of the NDK stack (NC_NetStop(), discard old hCfg, create new hCfg, NC_NetStart()).

By that the DHCP client task seems to be shut down properly (leaving Semaphore_pend() before Task_delete()) and my applications seems not to crash. But it looks like the second try of NC_NetStart() doesn't work - the callback for "IP address assigned" is not called despite I'm quite sure the static configuration is valid and it should be called...

Som additional info on the "restart" problem:

[/quote]

Here are the latest news on that:

We use the ethdriver.c from the ndk\hal\evmdm6437\eth_dm6437 directory
The first problem was self made: When the ethernet driver reads the MAC+INT config by calling DM64LCEMAC_getConfig() our implementation of DM64LCEMAC_getConfig() blocked when called for the second time. After fixing this I run into the next problem:
The second call of NC_NetStart() crashed in HwPktOpen() calling Interrupt_init() because the interrupt has already been registered. I saw that Interrupt_end() (called in HwPktClose() when shutting down NDK) only disabled the interrupt, but did not unregister it. Thus it's obvious that calling "init" again will cause trouble. I wonder if this is a bug in ethdriver.c. I changed the implementation of Interrupt_end() by unregistering the interrupt as well. And than...
... I run into other unexpected behaviour. - The NDK and/or my application seems not to come up anyway. I'm still investigating this situation in details...

Does anyone have new ideas from this description?

Thanks, Lars

0 Lars Beikirch over 12 years ago in reply to Lars Beikirch

Intellectual 615 points
Lars Beikirch said:

The second call of NC_NetStart() crashed in HwPktOpen() calling Interrupt_init() because the interrupt has already been registered. I saw that Interrupt_end() (called in HwPktClose() when shutting down NDK) only disabled the interrupt, but did not unregister it. Thus it's obvious that calling "init" again will cause trouble. I wonder if this is a bug in ethdriver.c. I changed the implementation of Interrupt_end() by unregistering the interrupt as well. And than...

... I run into other unexpected behaviour. - The NDK and/or my application seems not to come up anyway. I'm still investigating this situation in details...

The next news:

It looks like calling NC_NetStart() a second time (after cancelling the first by NC_NetStop()) has several issues. As mentioned before it doesn't work at all with the default ethdriver.c code because it tries to register the IRQ twice with the second call. After my fix mentioned above (unregistering the IRQ on NDK shutdown) I don't get a crash/assert any more, but my network still doesn't come up properly. In particular:

To focus on the "calling NC_NetStart() twice" problem I turned off DHCP and switched to static IP configuration

I always start up with no network cable connected to my device

After calling NC_NetStart() the NIC seems to be configured well always (NetworkIPAddr() func provided to NC_NetStart() is called with args "interface added")

Since no network cable is connected my link LED is off and I see frequent "NO PHY CONNECTED" messages on the console

If I call NC_NetStart() only once everything is fine: When connecting the network cable the "NO PHY CONNECTED" messages stop, the link LED turns on and I can see my device in the network.

If I call NC_NetStop() after calling NC_NetStart() the first time and call NC_NetStart() again the network doesn't come up properly: The NetworkIPAddr() func provided to NC_NetStart() is called with args "interface added" a second time, but when connecting the network cable the "NO PHY CONNECTED" messages continue, the link LED doesn't turn on and I still can't see my device in the network. It looks like the NDK has a problem accessing the PHY. And moreover it seems to hang up the PHY (or the connection to it) completely: When I restart my application and call NC_NetStart() only once I always get "NO PHY CONNECTED" messages and I can't see my device on the network (with "CPU reset" before and even with network cable connected at startup). As mentioned above this works in general. I must power off my device to get out of this state.

Did anybody successfully use frequent NC_NetStart()/NC_NetStop() on a EVM6437?

Any ideas? Do we have some NDK issues here???

Thanks, Lars
Cancel
Up 0 True Down

Cancel
0 Lars Beikirch over 12 years ago in reply to Lars Beikirch

Intellectual 615 points
Okay, since I'm still stuck with the "calling NC_NetStart() twice" problem I went back to the roots: My initial problem was, that the system crashed because calling CfgExecute(hCfg, 0) corrupted the internal clock queue by deleting a task pending (with timeout) on a semaphore. Since I saw that NC_NetStop() causes CfgExecute(hCfg, 0) to be called as well and this works fine I wondered what the difference would be. After a time of investigation I understood that:

When I try to stop DHCP by NC_NetStop() the CfgExecute(hCfg, 0) call comes from the "netctrl" task which has a lower priority than the DHCPclient task. Thus the DHCPClose() call to fdCloseSession() readies the DHCPclient task pending on the fdt semaphore and since it has higher priority it continues and leaves the Semaphore_pend() function. DHCPClose() calls Task_delete() after that.

When I try to stop DHCP by a direct call of CfgExecute(hCfg, 0) from my application I'm calling it from a task with a priority higher than the priority of the DHCPclient task (or equal to, it doesn't matter). Thus the DHCPClose() call to fdCloseSession() readies the DHCPclient task pending on the fdt semaphore as well, but since it doesn't have higher priority it doesn't continue and is still in the Semaphore_pend() function. DHCPClose() calls Task_delete() immediately after fdSessionClose() and the behaviour is as described above.

Well, so I thought I simply have to lower the priority of my task calling CfgExecute(hCfg, 0) and everything would be fine. But I run into the next trouble. :-( Since the implementation of DHCP client's StateSelecting() doesn't check if the DHCP configration may have been removed meanwhile or whatever it does busy waiting!!! Here is the code snippet:

static void StateSelecting(DHCPLEASE *pLease)
{
    ...
    MaxTries = 3;
Retry:
    ...
    // Build the DHCP request packet and Send it
    pLease->SendSize = dhcpBuildDiscover(pLease);
    dhcpPacketSend( pLease, INADDR_BROADCAST );

    // Get the time
    TimeStart = llTimerGetTime(0);

    while( (TimeStart + 2) >= llTimerGetTime(0) && nAlive )
    {
        // Get reply (waits for 3 seconds)
        dhcpPacketReceive(pLease);

        if( dhcpVerifyMessage( pLease, &IPOffer, &IPServer ) == DHCPOFFER)
        {
            pLease->IPAddress = IPOffer;
            pLease->IPServer = IPServer;
            pLease->StateNext = REQUESTING;
            return;
        }
    }

    // Timeout - try again
    if( --MaxTries )
        goto Retry;
    ...
}

Since fdCloseSession() has already been called dhcpPacketReceive() returns immediately because recv() can't call fdint_lockfd() successfully. There is no meachnism to stop this until the timeout of 3 x 3 seconds has been reached. Thus CfgExecute(hCfg, 0) may block for up to 9 seconds! And since it doesn't pend on anythig it hogs CPU!

In my particular case it blocked for about 4 seconds (my internal timeout was 5 sec, thus there were 4 of the 9 seconds remaining). During this time I couldn't service the watchdog - so my system went into a watchdog restart. :-(

I added some check in the loop of StateSelecting() cited above which indirectly checks if the file descriptor session has already been closed and cancels the loop in that case. After doing so CfgExecute(hCfg, 0) doesn't block for such a long time and everything is fine for my application.

So I think I finally found a workaround for my problem, but I still think I discovered 3 problems (I don't want to call it bugs right now) in the NDK:

Abstract:

CfgExecute(hCfg, 0) must be called from a task with lower priority than DHCPclient task if DHCP is enabled and deleteTerminatedTasks is true. I think this should at least by mentioned in the documentation - I didn't see that. Probably the same applies to CfgRemoveEntry() if the DHCP entry should be removed from an active configuration.

CfgExecute(hCfg, 0) may block and hog CPU for up to 9 seconds if DHCP is disabled by this call. I think this should be fixed in the NDK code (StateSelecting() in dhcpsm.c). Probably the same applies to CfgRemoveEntry() if the DHCP entry should be removed from an active configuration.

Calling NC_NetStart() again after NC_NetStop() has been called doesn't work at least in the EVMDM6437 ethernet driver code. Interrupts are registered twice when calling NC_NetStart() for a second time causing an assertion and access to PHY doesn't work properly as well if the IRQ registration problem is fixed. I think this should be fixed in NDK code. (I refer to the code provided in the ndk\src\hal\evmdm6437\eth_dm6437 directory of NDK 2.0.0.)

I would greatly appreciate if a skilled TI employee could evaluate the problems I described and maybe schedule a bug report if I'm right.

Thanks, Lars
Cancel
Up 0 True Down

Cancel
0 Steven Connell over 12 years ago in reply to Lars Beikirch

TI__Mastermind 45025 points

Hi Lars,

Lars Beikirch said:
For that reason a Clock_Struct is created as local var (thus on the stack of the DHCP client task (!)) and is added to a queue of timer callback objects by Clock_construct().

The fact that this runs in Task context has been discussed recently, as for other hardware platforms the DHCP client task stack is a waste of valuable space. In the brief discussion I had with one of my colleagues, we couldn't see any good reason for the design choice to run the dhcpState function in Task context.

Lars Beikirch said:
In particular Semaphore_pend() is called with a timeout.

Looking at the code, I see that the timeout is coming from the UDP socket that DCHP client is using:

    /* Set the SOCK recv timeout to be 3 seconds */
    TimeWait.tv_sec = 3;
    TimeWait.tv_usec = 0;

    if (setsockopt(pLease->Sock, SOL_SOCKET, SO_RCVTIMEO, &TimeWait, sizeof(TimeWait)) < 0) {
        rc = 5;
        goto sockError;
    }

One option which may work for you is to change that timeout value. Unfortunately this is not configurable, but it could be achieved with a recompile of the DHCP client code.

Yet another option is to change the code so that dhcpState doesn't run as a Task. Basically you would just call the dhcpState() function instead of calling TaskCreate. This would also require a re-compile.

I'll file a bug for this to ensure that it isn't lost.

In the meantime, we'll continue to work with you to help you get past this.

Steve
Cancel
Up 0 True Down

Cancel
0 Steven Connell over 12 years ago in reply to Steven Connell

TI__Mastermind 45025 points

I see that there's a related bug already:

SDOCM00046855|Submitted|Karl Wechsler|||Add an API to automate DHCP configuration (use of default address and timeout if DHCP server doesn't respond)|Other||SA_NDK|

I've also filed this one:

SDOCM00097776 DHCP client should not run in its own task

Steve
Cancel
Up 0 True Down

Cancel
0 Lars Beikirch over 12 years ago in reply to Steven Connell

Intellectual 615 points

Hi Steve,

thanks for your answer at first.

Steven Connell said:
One option which may work for you is to change that timeout value. Unfortunately this is not configurable, but it could be achieved with a recompile of the DHCP client code.

This wouldn't help much, it would only reduce the critical period of time during that a DHCP deconfig would cause trouble, but it wouldn't reduce it to zero...

Steven Connell said:
Yet another option is to change the code so that dhcpState doesn't run as a Task. Basically you would just call the dhcpState() function instead of calling TaskCreate.

Just from thinking about it (didn't try because I have a working solution now) I'm afraid this wouldn't be a good choice. The DHCPclient task is created in DHCPOpen(), I think the call stack to there is SPService() -> ServiceSpawn() -> DHCPOpen(). As far as I remember SPService() is called either by CfgExecute(hCfg, 1) or the netctrl task after NC_NetStart(). So I think in that case either CfgExecute(hCfg, 1) or the netctrl task would block for up to 9 sec if there is no DHCP server response. And I'm afraid this is not the intention... I think DHCPClose() would have no chance to stop DHCP configuration on request additionally.

As mentioned in my previous post - I cancelled the waiting loop in StateSelecting() by replacing

while( (TimeStart + 2) >= llTimerGetTime(0) )

by

while( (TimeStart + 2) >= llTimerGetTime(0) && fdint_getfdt(0L) )

in dhcpsm.c line 69. I think this is a better choice to avoid blocking of CfgExecute(hCfg, 0).

BTW:

What do you think about the problems with calling NC_NetStart() twice?

Lars
Cancel
Up 0 True Down

Cancel
0 Steven Connell over 12 years ago in reply to Lars Beikirch

TI__Mastermind 45025 points

Hi Lars,

You make some good points. I'll have to think about that some more and I'll also make sure to note your use case in the bug report.

I noticed above that you mentioned you have a working solution. I'm wondering if you have gotten past your issue? If not please let me know.

Steve
Cancel
Up 0 True Down

Cancel
0 Lars Beikirch over 12 years ago in reply to Steven Connell

Intellectual 615 points
Hi Steve,

Steven Connell said:
You make some good points. I'll have to think about that some more and I'll also make sure to note your use case in the bug report.

thanks.

Steven Connell said:
I noticed above that you mentioned you have a working solution. I'm wondering if you have gotten past your issue?

My "workaround" mainly consists of two changes:

I reduced the priority of my taks calling CfgExecute(hCfg, 0). Thus I could avoid the original crash which has been caused by NDK's deletion of a task while it is pending on a semaphore. But this raised another issue: My calling task was blocked for several seconds which prevented it from servicing the watchdog. Thus:

I modified the NDK stack as written in my previous post by checking fdint_getfdt(0L) in the DHCP client task as well while waiting for a DHCP response. This cut the blocking of my calling task.

Well, this works for us now but it is a bit uncomfortable for us to maintain a modified NDK stack (synchronizing several colleagues and the automated build system, take care on updates, ...) - especially since we didn't plan this and the NDK lib is not part of our source code version control.

Lars
Cancel
Up 0 True Down

Cancel
0 Steven Connell over 12 years ago in reply to Lars Beikirch

TI__Mastermind 45025 points

Hi Lars,

You should be able to call NC_NetStart() twice, however it should be done after the original call to NC_NetStart() has returned.   This is how the stack is made to be reboot-able or even shut down.

The following example code shows this (taken from the NDK example stack thread code):

    /*
     * Boot the system using this configuration
     *
     * We keep booting until the function returns 0. This allows
     * us to have a "reboot" command.
    */
    do
    {
        rc = NC_NetStart(hCfg, ti_ndk_config_Global_NetworkOpen,
                         ti_ndk_config_Global_NetworkClose,
                         ti_ndk_config_Global_NetworkIPAddr);
    } while( rc > 0 );

Once NC_NetStop() is called, its argument will be returned here and stored as rc. So if you call NC_NetStop(1), then rc will be set to 1 and the loop shown above will repeat and call NC_NetStart again (rebooting the stack). If you called with an argument of 0 then it would exit (shutdown case). Are you calling NC_NetStart again like this? Or is it before the first call has returned.

Lars Beikirch said:
Well, this works for us now but it is a bit uncomfortable for us to maintain a modified NDK stack (synchronizing several colleagues and the automated build system, take care on updates, ...) - especially since we didn't plan this and the NDK lib is not part of our source code version control.

I understand your frustration on this. I'll see what I can do about getting this into the next NDK release in 1Q2013. Since the changing the DHCP client to not be a seperate task doesn't help you, I've filed a new bug for this:

SDOCM00098066 DHCP Task may perform busy waiting for up to 9 seconds

Steve
Cancel
Up 0 True Down

Cancel
0 Lars Beikirch over 12 years ago in reply to Steven Connell

Intellectual 615 points

Hi Steve,

Steven Connell said:

You should be able to call NC_NetStart() twice, however it should be done after the original call to NC_NetStart() has returned.   This is how the stack is made to be reboot-able or even shut down.

The following example code shows this (taken from the NDK example stack thread code):

    /*
     * Boot the system using this configuration
     *
     * We keep booting until the function returns 0. This allows
     * us to have a "reboot" command.
    */
    do
    {
        rc = NC_NetStart(hCfg, ti_ndk_config_Global_NetworkOpen,
                         ti_ndk_config_Global_NetworkClose,
                         ti_ndk_config_Global_NetworkIPAddr);
    } while( rc > 0 );

Once NC_NetStop() is called, its argument will be returned here and stored as rc. So if you call NC_NetStop(1), then rc will be set to 1 and the loop shown above will repeat and call NC_NetStart again (rebooting the stack). If you called with an argument of 0 then it would exit (shutdown case). Are you calling NC_NetStart again like this? Or is it before the first call has returned.

I did it exactly this way - with the result described in my previous posts. I think in that case the problem isn't in the NDK stack but in the EVMDM6437 ethernet driver code. Unfortunately this part has originally been implemented by a colleague of mine who left our company a while ago. So I can't ask him where he got the code from, now it's part of our source code under version control. If you gave me a hint where to find the original code I could check if we applied some modifications on it or not.

Steven Connell said:

I'll see what I can do about getting this into the next NDK release in 1Q2013. Since the changing the DHCP client to not be a seperate task doesn't help you, I've filed a new bug for this:

SDOCM00098066 DHCP Task may perform busy waiting for up to 9 seconds

Thanks a lot. I think this describes one of the real problems I would need to be resolved very well.

Lars
Cancel
Up 0 True Down

Cancel
0 Steven Connell over 12 years ago in reply to Lars Beikirch

TI__Mastermind 45025 points

Lars,

Lars Beikirch said:
If you gave me a hint where to find the original code I could check if we applied some modifications on it or not.

It could have come from a number of places but it was most likely from the NDK 2.0.0 product (as this is the last version that shipped the DM6437 driver). Looking back at the code posted previously in the thread, it doesn't appear to be simply a copy and paste of our examples, though. But it would still be worthwhile to look at the examples there.

Do you have that version handy? If not you can find it on the download page:

http://software-dl.ti.com/dsps/dsps_public_sw/sdo_sb/targetcontent/ndk/index.html

The most common example that people use is called 'client'. In fact, the client example has a console mode that you can telnet in to. Once in, you can run a 'reboot' command that will reboot the stack. You may want to give that example a try to see if you can reboot it, since it should use the same driver.

Do you know if the driver you are using came from the NDK 2.0.0?

Steve
Cancel
Up 0 True Down

Cancel
0 Lars Beikirch over 12 years ago in reply to Steven Connell

Intellectual 615 points

Hi Steve,

Steven Connell said:
Do you know if the driver you are using came from the NDK 2.0.0?

I just compared our active code with the eth_dm6437 code shipped with NDK: ethdriver.c and csl_mdio.c have some minor modifications to access our "data LED" and to adapt it to SYS/BIOS 6, csl_emac.c and nimu_eth.c are exactly the same. I think the first problem of registering the interrupts twice comes originally from there. I quote from my previous post:

Lars Beikirch said:

The second call of NC_NetStart() crashed in HwPktOpen() calling Interrupt_init() because the interrupt has already been registered. I saw that Interrupt_end() (called in HwPktClose() when shutting down NDK) only disabled the interrupt, but did not unregister it. Thus it's obvious that calling "init" again will cause trouble. I wonder if this is a bug in ethdriver.c. I changed the implementation of Interrupt_end() by unregistering the interrupt as well. And than...

I had a quick look at the client.c example. At the first look I see only one diff in general: The main control flow of the example is:

    NC_SystemOpen( NC_PRIORITY_LOW, NC_OPMODE_INTERRUPT );
    hCfg = CfgNew();
    CfgAddEntry( hCfg, ...);
    do
    {
        rc = NC_NetStart( hCfg, NetworkOpen, NetworkClose, NetworkIPAddr );
    } while( rc > 0 );
    CfgFree( hCfg );
    NC_SystemClose();

My main control flow was (in principle):

    do
    {
      NC_SystemOpen( NC_PRIORITY_LOW, NC_OPMODE_INTERRUPT );
        hCfg = CfgNew();
      CfgAddEntry( hCfg, ...);
        rc = NC_NetStart( hCfg, NetworkOpen, NetworkClose, NetworkIPAddr );
      CfgFree( hCfg );
        NC_SystemClose();
    } while( rc > 0 );

But I would expect this to work as well...

Steven Connell said:
The most common example that people use is called 'client'. In fact, the client example has a console mode that you can telnet in to. Once in, you can run a 'reboot' command that will reboot the stack. You may want to give that example a try to see if you can reboot it, since it should use the same driver.

Well, if I find some time I'll give it a try... (I'm afraid this won't happen this year any more.)

Lars
Cancel
Up 0 True Down

Cancel
0 Dale Peterson1 over 10 years ago in reply to Lars Beikirch

Prodigy 60 points

Steven Connell said:

I'll see what I can do about getting this into the next NDK release in 1Q2013. Since the changing the DHCP client to not be a seperate task doesn't help you, I've filed a new bug for this:

SDOCM00098066 DHCP Task may perform busy waiting for up to 9 seconds

It appears this bug report has fallen off the bug reporting system, and it doesn't appear to have been fixed. Are the only workarounds for this issue the modification of the DHCP client code, or restarting the NDK?
Cancel
Up 0 True Down

Cancel

Processors

Processors forum

crash after DHCP timeout handling (reconfigure from DHCP to static IP) only with Task.deleteTerminatedTasks=true