NDK race condition/improper locking

Denes Balatoni

We are using ndk 2_22_00_06.

When the dhcp client cannot renew the IP address until the end of the leasetime, it will try to close all open sockets before requesting a new IP address (in netsrv.c:291). When this is done llEnter is never called, so when llExit gets called in fdint_waitevent (file.c:194) we get an error ("Illegal call to llExit()").

If we add llEnter and llExit calls around the SockCleanPcb calls in netsrv.c:291, it still does not fix every problem.

In this case while the dhcp client is trying to close all open sockets, the application may preempt the dhcp client and close one of the sockets (due to an fdClose call application)- in this case when the dhcp client gets to actually closing the same tcp socket we get double free errors. The application can preempt the dhcp client for example - there may be other places - after the llExit call in fdint_waitevent (file.c:194, it gets called when closing a socket) - in the end what we observe is that TcpPrDetach (in tcpprot.c:94) is called twice on the same socket, first from the application, then from the dhcp client, leading to double free errors.

Here is what we see in the debugger when the error happens:
[ARM9_0] 01115.103 Illegal call to llExit()
[ARM9_0] 01135.105 mmFree: Double Free
[ARM9_0] 01135.107 mmFree: Double Free
[ARM9_0] 01135.109 mmBulkFree: Corrupted mem or bad ptr (c04e6520)
[ARM9_0] 01135.111 mmFree: Double Free
[ARM9_0] 01135.113 mmFree: Double Free

over 12 years ago

Steven Connell over 12 years ago

TI__Mastermind 45025 points

Hi Denes Balatoni,

Denes Balatoni said:
he application may preempt the dhcp client and close one of the sockets (due to an fdClose call application)

What priority is your application thread running at?

Steve

Denes Balatoni over 12 years ago in reply to Steven Connell

Prodigy 60 points

Hi Steven!

The application thread is running at priority 5, but the priority seems to be bumped to 9 in fdClose.

Thank you,

Best regards,

Denes

Steven Connell over 12 years ago in reply to Denes Balatoni

TI__Mastermind 45025 points

Denes Balatoni said:
The application thread is running at priority 5, but the priority seems to be bumped to 9 in fdClose.

Right. This is due to the llEnter() call in fdClose. llEnter() changes the current threads priority to be NDK kernel level priority (usually priority 9), which looks to be what you're seeing.

Denes Balatoni said:
If we add llEnter and llExit calls around the SockCleanPcb calls in netsrv.c:291, it still does not fix every problem.
In this case while the dhcp client is trying to close all open sockets, the application may preempt the dhcp client and close one of the sockets (due to an fdClose call application)

Thinking out loud a bit here, I'm trying to better understand this. When you surround the socket clean up code in netsrv.c with an llEnter/llExit pair, you are seeing that code gets pre-empted by the application due to the fdClose call?

It seems like one of the following two scenarios may be happening:

Scenario A:

1. Application calls fdClose(), which calls llEnter , raises its priority to 9

2. At this point the DHPC task has not reached its llEnter call yet, so it's still at priority 5. It therefore cannot run because the higher priority application task is running (at kernel priority).

3. Application task blocks (before call to llExit is reached).

4. This would then allow the DHCP task to run, reach its llEnter call and have its priority raised to kernel level, too.

Scenario B:

1. DHCP task calls llEnter in order to clean up sockets. This raises its priority to kernel level

2. At this point the application task has not yet reached its fdClose/llEnter call yet, so it's still at priority 5. It therefore cannot run because the higher priority DHCP task is running (at kernel priority).

3. DHCP task blocks (before call to llExit is reached).

4. This would then allow the Application task to run, reach its llEnter call and have its priority raised to kernel level, too.

Actually it seems like scenario B is what you're seeing. In either case, I'm still having a hard time understanding because even if the application task is allowed to run and reaches its call to llEnter (), it should be blocked at the llEnter call due to a Semaphore within the llEnter code.

Steve

Steven Connell over 12 years ago in reply to Steven Connell

TI__Mastermind 45025 points

Denese,

I just thought of something to try. Can you try setting a couple of break points? Note that you may need to rebuild the NDK in debug mode in order to do this. If that's not a big deal for you, then maybe you can try this.

First, in netsrv.c can you try setting a break point at the first call to SockCleanPcb , followed by a break point at the llExitcall:

{

            llEnter();
            SockCleanPcb (SOCKPROT_TCP, pi->IPAddr); // <------- set break point here (break point "A")
            SockCleanPcb (SOCKPROT_UDP, pi->IPAddr);
            SockCleanPcb (SOCKPROT_RAW, pi->IPAddr);
            llExit(); // <------- set break point here (break point "B")

}

Then, once you hit the first break point (break point A), can you then set another break point at the llExit function itself (this would be break point C)? You can do this by making a new breakpoint in CCS, and typing the symbol 'llExit'.

Once that's set, hit run. I want to see if you hit the llExit function (break point C) *before* hitting break point B. If so, then it means somewhere in the SockCleanPcb call chain, llExit is being called.

Steve

Denes Balatoni over 12 years ago in reply to Steven Connell

Prodigy 60 points

Hi Steven!

The call chain is the following: SockCleanPcb->SockClose->FdWaitEvent->fdint_waitevent->llExit .

This llExit call was why we were getting the "Illegal call to llExit" error message and also one of the reasons why we tried putting llEnter & llExit around the SockCleanPcb calls.

As far as I can tell this llExit call can allow the application to preemt the dhcp client's SockCleanPcb call.

Thanks for looking into this, and let me know if you need more information!

Steven Connell over 12 years ago in reply to Denes Balatoni

TI__Mastermind 45025 points

Ok, thanks for clarifying, I see the race condition you talking about, I think the scenario B that I described above is what's happening. Just to reiterate, I think the call to llExit() in fdint_waitevent causes the application task to run and then delete the same socket that the DHCP client task is trying to delete. . In this case, the call to llExit is allowing the application to un-block and be raised to kernel priority and then delete the socket.

I think I may have a fix for you. Can you try adding the following code to the SockIntAbort() function? I think you should try doing this and also removing the calls to llEnter/llExit that you may have added to netsrv.c (around the SockCleanPcb calls):

/*-------------------------------------------------------------------- */
/* SockIntAbort() */
/* Close a socket - Called only internally to stack */
/*-------------------------------------------------------------------- */
void SockIntAbort( SOCK *ps )
{

    /* check that parameters are valid */
    if (!ps) {
        DbgPrint(DBG_INFO, "SockIntAbort: socket is NULL, returning\n");
        return;
    }

    /* Detach the socket from protocol processing */
    SockPrDetach( ps );

    /* DeRef any held route */
    if( ps->hRoute )
        RtDeRef( ps->hRoute );

    /* Free the Rx and Tx Buffers */
    if( ps->hSBRx )
        SBFree( ps->hSBRx );
    if( ps->hSBTx )
        SBFree( ps->hSBTx );

    /* Free the sock memory */
    ps->fd.Type = 0;
    mmFree( ps );
}

Steve

Denes Balatoni over 12 years ago in reply to Steven Connell

Prodigy 60 points

Hello Steve!

I see the same behaviour. As you can see in the attachment when the DHCP is trying to close the connections, ps is not NULL, it's the second paramater of TcpPrDetach that is 0 (which is derived from ps). Also, if I remove the lllEnter/llExit the "Illegal call to llExit() " message is going to be signalled, and DbgPrintf will call NC_NetStop(-1), thereby shutting down the NDK.

Steven Connell over 12 years ago in reply to Denes Balatoni

TI__Mastermind 45025 points

Hi Denes,

I think at this point I am going to need to reproduce this issue on my side. Can this issue be easily reproduced with the NDK client example? Or perhaps you could attach an example that I could build/load/run and easily reproduce.

Steve

Denes Balatoni over 12 years ago in reply to Steven Connell

Prodigy 60 points

Unfortunatelly I do not have a simple application that reproduces the problem, but I can describe the scenario. We are using an OMAP L138. An IP address is obtained with the NDK DHCP client, than communication starts with the PC.
Then, before the device tries to renew the IP address, the DHCP server is restarted so that it forgets the given out leases (I use the program called "Open DHCP server" on the pc).
Then when the device tries to renew it's address it will send a few DHCP requests for it's IP address, which will be NAKed (because of a lack of DHCP discover).
Then the DHCP client will give up trying and go on to close down the connection, and it will get to the "SockCleanPcb (SOCKPROT_TCP, pi->IPAddr);" call.
Here is a log to see what happens after this (DbgPrintfs were added to fdClose before the llEnter and after the llExits):
------------------
/* DHCP client tries to obtain address, then gives up trying */
[ARM9_0] 00151.877 Going to call "SockCleanPcb (SOCKPROT_TCP, pi->IPAddr);" now!
/* llExit() called in fdint_waitevent */
[ARM9_0] 00151.881 Illegal call to llExit()
/* Intervening fdClose call by application */
[ARM9_0] 00151.885 fdClose started
[ARM9_0] 00171.901 fdClose finished OK
/* DHCP client continues */
[ARM9_0] 00171.906 mmFree: Double Free
[ARM9_0] 00171.908 mmFree: Double Free
[ARM9_0] 00171.910 mmBulkFree: Corrupted mem or bad ptr (c04ed2d8)
[ARM9_0] 00172.000 mmFree: Double Free
[ARM9_0] 00172.003 mmFree: Double Free
------------------
Note, the clock is running at 2x speed. With the llEnter/llExit added around the SockCleanPcb calls I get a very similar log (with respect to times), without the llExit error.

Let me know if this information is not sufficient.

Thank you,
Best regards,
Denes

Denes Balatoni over 12 years ago in reply to Denes Balatoni

Prodigy 60 points

One more thing to note: I changed the System_printfs in DbgPrintf to printfs, because for some reason it was not printing anything (maybe an error in my settings).

Steven Connell over 12 years ago in reply to Denes Balatoni

TI__Mastermind 45025 points

Hi Denes,

I'm trying to reproduce the problem but I'm having some troubles. I'm wondering if you can give me some hints.

I've got Open DHCP Server set up on my PC and have a direct connection to my 6748 board.
I'm running the NDK Client example and it's configured to get the IP address dynamically.
The board successfully gets the IP address from Open DHCP Server.
I've configured the DHCP server to hand out IP address with a lease time of 30 seconds.

I have tried running the example and killing the DHCP server during the 30 sec window. Usually the NDK is able to just renew the lease and renew the IP address.

I'm also trying to run the client example's Windows side app "echoc.exe" (sends TCP data to NDK app, app echoes it back). So I'm also trying to reproduce by killing the server, then killing the echo app, and vice versa. But none of that seems to cause the failure case to come up.

Does it sound like I'm on the right track here? Any suggestions?

Maybe the echoc app is not a good way to reproduce this? Maybe your PC communications are something different and I need to do something more along the lines of what you have?

Steve

Processors

Processors forum

NDK race condition/improper locking