This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

NDK dchild task terminating

Other Parts Discussed in Thread: SYSBIOS

I am experiencing a strange NDK issue.  I am currently starting two receive daemons to receive message data on two separate ports.  I also have another task running that sends packets out on the network.  This is all running on core 0 while other processing occurs on core 1.

This scheme has been working for a while, but I am seeing an issue when running a longevity test.  If I run a test with packets streaming in and out of the DSP, eventually the NDK will stop processing incoming packets.  This happens anywhere between 1-8 hours.  I have yet to narrow the timeline down any further.

I attached the ROV when this condition occurs and I can see that dchild is in the terminated state.  I understand that this is the main task that runs the NDK.  The problem is that I have not been able to determine what brought this task down.  Are there common problems that could terminate dchild without bringing down the rest of the core with it?

I am using NDK 2_20_04_26 on a C6472 DSP.

Thanks,

Kevin

  • Hi Kevin,

    Is the main NDK stack task still alive (e.g. blocked, ready or running)? Which BIOS are you using (BIOS 5.x or SYS/BIOS).

    With your version of NDK, when a NDK task terminates, it gets cleaned up when the next one terminates. This is because the terminating task cannot delete itself. So when the next one terminates, it cleans up the previously terminated one. Can you halt the target while the system is up and running fine? Look at ROV and see if you have a terminated dchild.

    There are several reason why a dchild goes into terminated state. I'm assuming the stack thread is still alive, so a NC_SystemClose() was not called.

    The TaskExit/TaskDestroy APIs will terminate a NDK task. Are you calling these? Or a function that calls them (e.g. DHCPClose(), DNSServerClose(), etc).

    The socket the dchild was handling could have been closed on the remote side and the dchild is terminating itself. You can look at the dchild code in ndk_2_20_04_26\packages\ti\ndk\src\nettools\daemon\daemon.c.

    Assuming you are using DHCP, could the lease not have been renewed?

    Todd

  • I forgot the name of the task, but the other NDK task is blocked on a semaphore.  I am using Sys/BIOS 6_31_04_27

    If I halt the target during normal operation, dchild appears to be running fine.  It is not terminated.

    When I call NC_NetStart I provide a callback function for NetworkOpen and NetworkClose.  In NetworkClose I am closing my UDP daemons with DaemonFree.  I guess it is possible that the NDK triggered a network close, thus closing my sockets.  I don't know why it would do that though.  There are no other calls in my code that call any sort of task exit on the NDK.

    Also, I am using a static IP (no DHCP) and UDP only.  There are no TCP sockets active.

    Thanks

  • Kevin,

    Is it one of the two receiving dchild that is terminating or the sending one? Can you attach the ROV snapshot of the Tasks (it looks the one of the first post did not stick).

    Todd

  • I can post the ROV snapshot tomorrow morning.  My sending task never terminates, but it only attempts to send after receiving a packet, so it is just blocked waiting on data.  Just the receiving dchild is terminating.  

    I think the ROV output will clear things up a bit tomorrow.

    Thanks

  • That is the ROV task view when this problem occurs. I checked the ROV when everything is working and noticed that I still see dchild in the terminated state.  It seems you were correct and that is normal.

    I ran two tests last night.  One had the same software and lost network comms like usual.  The other test had a build with some slight changes to my .cfg file and networking is still up.

    My old .cfg file had this for NDK setup:

    ti_sysbios_knl_Clock.timerId = 6;
    Task.addHookSet ({ registerFxn: '&NDK_hookInit', createFxn: '&NDK_hookCreate', });
    var instti_sysbios_knl_Clock0Params0 = new ti_sysbios_knl_Clock.Params();
    instti_sysbios_knl_Clock0Params0.instance.name = "net_timer"; instti_sysbios_knl_Clock0Params0.startFlag = true;
    Program.global.net_timer = ti_sysbios_knl_Clock.create("&llTimerTick", 10, instti_sysbios_knl_Clock0Params0);

    I removed that and now have just the following:
    var Global = xdc.useModule('ti.ndk.config.Global'); Global.enableCodeGeneration = false;

    
    
    I need to test more to verify that the problem is gone, but would you expect this modification to solve this kind of issue?
  • Kevin,

    When you use the .cfg configuration, it does several things for you (e.g. adding the hookInit/hooCreate, creating the clock instance (which is the heartbeat of the stack), etc). So removing those from your .cfg and just using Global is a good idea.

    I noticed with your .cfg, you had the Clock function run every 10 ticks. What is your Clock module period configured for (default is 1ms). I'm assuming that your .cfg did not change it and it is the default. The NDK requires that it's heartbeat (llTimerTick) be 100ms, not 10ms. I cannot say with certainty that this is the problem, but it is definitely something that is incorrect. Over time, this might trip some timeout window incorrectly and force the socket to be closed.

    Todd

  • It turns out changing to ti.ndk.config.Global did not fix my problem.  I reproduced it again last night.  I have more information though.

    Previously, I thought the DSP was not responding to pings.  I looked at a wireshark capture, and it turns out it is responding to ping.  The ping responses are just delayed by 5 - 7 seconds.

    I put some breakpoints in and it does look like I am receiving packets and sending them, but there is a huge delay somewhere.  When I check the ROV, the idle task is always running so I don't think the NDK is getting starved or anything.

    Any idea what would could cause such a big delay in ping responses?

  • More information:

    It looks like the delay is on the receive side.  I put a breakpoint in the NDK where ICMPInput gets called and it is taking 6 seconds from the time the ping packet goes out on the network to when it hits the breakpoint in code.  Even then, the response packet is not always even sent.

    It's hard to tell at this point where the delay is coming from.  There are other packets coming through the switch to my device, so its hard to know if there is a delay associated with a specific packet if it put breakpoints earlier in the stack. The following is my handler function that dequeue's incoming packets and provides them to the NDK for processing.  I don't see any source of delay, and if I put a break point here it always hits.  Maybe I am missing something though:

    /**
    * @b EmacPktService
    * @n
    * The function is called by the NDK core stack to receive any packets
    * from the driver.
    *
    * @param[in] ptr_net_device
    * NETIF_DEVICE structure pointer.
    *
    * @retval
    * void
    */
    static void EmacPktService (NETIF_DEVICE* ptr_net_device)
    {
    EMAC_DATA* ptr_pvt_data;
    PBM_Handle hPacket;

    /* Get the pointer to the private data */
    ptr_pvt_data = (EMAC_DATA *)ptr_net_device->pvt_data;

    /* Give all queued Raw packets first to the Ether module */
    while (PBMQ_count(&ptr_pvt_data->pdi.PBMQ_rawrx))
    {
    /* Dequeue a packet from the driver's Raw receive queue. */
    hPacket = PBMQ_deq(&ptr_pvt_data->pdi.PBMQ_rawrx);

    /* Prepare the packet so that it can be passed up the networking stack.
    * If this 'step' is not done the fields in the packet are not correct
    * and the packet will eventually be dropped. */
    PBM_setIFRx (hPacket, ptr_net_device);

    /* Pass the packet to the NDK Core stack. */
    NIMUReceivePacket(hPacket);
    }

    /* Give all queued IP packets to the Ether module */
    while (PBMQ_count(&ptr_pvt_data->pdi.PBMQ_rx))
    {
    /* Dequeue a packet from the driver receive queue. */
    hPacket = PBMQ_deq(&ptr_pvt_data->pdi.PBMQ_rx);

    /* Prepare the packet so that it can be passed up the networking stack.
    * If this 'step' is not done the fields in the packet are not correct
    * and the packet will eventually be dropped. */
    PBM_setIFRx (hPacket, ptr_net_device);

    /* Pass the packet to the NDK Core stack. */
    NIMUReceivePacket(hPacket);
    }

    /* Work has been completed; the receive queue is empty... */
    return;
    }
  • Can you remove all the routers or switches from the equation?

  • Not entirely.  My DSP is on a board with two other parts connected through a local switch.  Then the PC I am using to test occasionally broadcasts packets, since Windows likes to do that.  I also don't think the switch is to blame since the other connected components can communicate just fine.

    6 seconds is a long time for a packet to get delayed.  I checked all the network traffic coming into the DSP and there isn't a lot.  Nothing that would flood a buffer causing some big packet delay.  Is it possible that the timer feeding the NDK is firing too slow for some reason?  I can't think of anything else that would cause a packet to get processed this slowly.

    I just noticed I forgot to remove this line from my config:

    ti_sysbios_knl_Clock.timerId = 6;

    I'm not sure what that even does, but I'll try removing it

  • If the llTimerTick is not 100ms, it might happen. What is your SYS/BIOS Clock period? 1ms?

  • llTimerTick is 100, but if we are forcing timer 6, im not sure if that translates into 100ms or something else.  Where can I find what my Sys/Bios clock period is?

  • Hi Kevin,

    Usually, default BIOS tick period is 1ms.

    It's changed in the .cfg file using:

    var Clock = xdc.useModule('ti.sysbios.knl.Clock');

    Clock.tickPeriod = <desired clock period in uS>;

    Also there are APIs in the Clock module (e.g. Clock_getPeriod()) which can be found in the SYS BIOS API docs.

    -Tom

  • Well it looks like we don't currently change tick period.  I checked and it is 1000 by default.

  • Let me know what happens with the default timer. Are you getting any NDK allocation warnings on the console?

  • I see the same sort of behavior with the default timer.  Also, I do not see any NDK warnings on the console, but I am also not able to attach the debugger until after I am in the bad state.

    This morning was a bit different though.  Now the DSP is not responding to ARP requests.  I dug in a bit and found that NIMUPacketService is not getting called because the following condition in netctrl.c line 839 is never true

    /* Was an Ethernet event signaled? */
            if(stkEvent.EventCodes[STKEVENT_ETHERNET])
            { .......
    
    
    

    This seems to trace back to the RxPacket function in my ethernet driver code ethdriver.c. This function is never getting called so no packets are ever getting handled. I setup the RxPacket function pointer with a call to EMAC_Open.  At that point I don't know what happens because I only have the header files for the 6472 CSL.  I don't know what triggers this RxPacket function to get called, but whatever it is seems to be malfunctioning.

    EDIT -

    So I downloaded the CSL source code and traced this through further.  It appears that RxPacket is called from EMAC_RxServiceCheck.  My ethernet driver is responsible for calling EMAC_RxServiceCheck whenever there is an interrupt for an incoming packet.  I put a breakpoint in my ISR and it never hits.  So now the question is, why isn't my packet receive ISR getting called?  I'm not sure if I should blame the switch or the DSP.

    I checked the IER register and it looks like the EMAC interrupts are enabled (IE9 and IE10).  I also see these interrupts in the Hwi section of the ROV.  The IFR register never shows interrupt 9 or 10 go high, so it looks to me that something happened to the EMAC on the chip to stop interrupts from triggering.


  • I guess this thread is dead, but I faced a similar issue today. 

    The problem I faced was that my application would take in only 5-6 send operations. Each time a send is executed, I lost 2048 bytes of Heap. I later on figured out that this was the stack size. The problem was that my application was executing in a while loop, and never allowed the Task to get killed. Everything else worked fine at this point though. I had to add a sleep in this loop to get it fixed.

  • It seems that the Task_sleep added to fix this issue effects the performance of our application. Can you suggest any other way to fix this issue? I tried reducing the clock period, but it's also effecting the performance of the application. I also tried enabling Task_deleteTerminatedTasks in .cfg(documented in SYS/BIOS), but it doesn't work. Right now the issue I'm facing is that after every send, I can see a dchild in terminated state in ROV, which results is a heap loss equaling to the stack size of the task.

    Thanks,
    Vinesh 

  • Hi Vinesh,

    Can you start a new thread? We prefer not to add onto a old thread because it hurts the search capabilities. Please include the device you are using and versions of the NDK & SYS/BIOS also.

    Todd

  • Todd,

    I've made a new thread - http://e2e.ti.com/support/embedded/bios/f/355/t/253203.aspx

    Regards,
    Vinesh