This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Recovering McFW application from M3 hang in DM8127

Other Parts Discussed in Thread: SYSBIOS

We are using IPNC RDK 3.5.0 on a DM8127.  We would like to be able to recover from a crashed or hung M3 by some method less drastic than a reboot of the A8 or a power cycle.  I've experimented with hanging the VPSS or VIdeo M3 by putting a high priority BIOS task into a hard loop and then trying to terminate our McFW application on the A8 by sending it a SIGTERM signal.  With the unresponsive M3, several operations that our application normally does during graceful termination, like System_LinkStop() and System_LinkDelete(), will hang and prevent the termination from completing.  I've tried just commenting out these calls, which allows the application to terminate, but this apparently leaves the Linux Syslink driver in a state that prevents successful restart of the application.  I've also tried to unload and reload the Syslink driver after the application is terminated, but the unload doesn't complete, also due to the state of the driver.  (This issue is an evolution of the problem I reported in this thread: http://e2e.ti.com/support/embedded/bios/f/355/p/267340/957777.aspx#957777 )

Any suggestions on how to reinitialize the Syslink driver with an unresponsive M3 without rebooting Linux?  Thanks!

  • FYI, here's what happens when I try to unload the Syslink driver from Linux after terminating the McFW application (without doing System_LinkStop and System_LinkDelete).  Note that before the attempted unload, /proc/modules indicates that Syslink has zero users, which would imply that unloading should be possible.

    ~ # cat /proc/modules
    syslink 806222 0 - Live 0xbf01c000
    snapshot 9720 0 - Live 0xbf013000
    cmem 19969 1 snapshot, Live 0xbf008000
    dm81xx_edma 4216 1 snapshot, Live 0xbf000000
    ~ # rmmod syslink
    Unable to handle kernel paging request at virtual address d108a682
    pgd = c139c000
    [d108a682] *pgd=89a2a011, *pte=00000000, *ppte=00000000
    Internal error: Oops: 807 [#1]
    last sysfs file: /sys/devices/platform/omap/omap_i2c.1/i2c-1/1-004a/temp1_input
    Modules linked in: syslink(-) snapshot cmem dm81xx_edma
    CPU: 0    Not tainted  (2.6.37-maxwell+ #1)
    PC is at GatePeterson_Instance_finalize+0x44/0x88 [syslink]
    LR is at GatePeterson_Instance_finalize+0x34/0x88 [syslink]
    pc : [<bf03f6d4>]    lr : [<bf03f6c4>]    psr: 60000013
    sp : cad79df8  ip : 00000000  fp : cad79e1c
    r10: 00000000  r9 : cad78000  r8 : 00000000
    r7 : 00000000  r6 : 00000000  r5 : bf0cb23c  r4 : cf7e2000
    r3 : d108a680  r2 : ffffffff  r1 : cf7e2000  r0 : bf0842ce
    Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
    Control: 10c5387d  Table: 8139c019  DAC: 00000015
    Process rmmod (pid: 823, stack limit = 0xcad782e8)
    Stack: (0xcad79df8 to 0xcad7a000)
    9de0:                                                       00000000 cad79e08
    9e00: c00d1004 cad79e40 00000000 00000000 cad79e34 cad79e20 bf03fc20 bf03f69c
    9e20: cf7e2000 cf7df000 cad79e64 cad79e38 bf048da8 bf03fbb4 00000000 cbb90000
    9e40: cf7e2000 cad79e50 cad79e84 00000000 00000000 00000000 cad79e7c cad79e68
    9e60: bf04a35c bf048ca0 00000000 cf7dc010 cad79e9c cad79e80 bf04a4a4 bf04a2c8
    9e80: cf7dc000 cf7df000 cf7dc034 cf7dc000 cad79ec4 cad79ea0 bf043cf8 bf04a430
    9ea0: 00000294 0000ffff cf7dc000 bf09ae58 bf09ae60 c004cde8 cad79ee4 cad79ec8
    9ec0: bf043e04 bf043c50 bf0cafc0 bf0cb23c bef6cbd8 00000081 cad79efc cad79ee8
    9ee0: bf02deb4 bf043d94 00000000 bf0cae3c cad79f1c cad79f00 bf04c288 bf02ddb0
    9f00: c0042298 c046f81c bf0cae3c 00000000 cad79f34 cad79f20 bf06ecfc bf04c24c
    9f20: bf06ece8 bf0cad14 cad79fa4 cad79f38 c009fdd8 bf06ecf4 40027000 6c737973
    9f40: 006b6e69 cad79f50 c0091d24 c02062e4 cad79fa4 cad79f60 c00c8a50 c0091d20
    9f60: 08100871 c02062e4 00000000 00000001 c00c8120 00926840 bf0cad14 00000880
    9f80: cad79f84 00000000 0000007d 0001ea00 6c737973 006b6e69 00000000 cad79fa8
    9fa0: c004cc40 c009fc30 0001ea00 6c737973 bef6cbd8 00000880 bef6cbd8 00000880
    9fc0: 0001ea00 6c737973 006b6e69 00000081 000004d8 00000000 40027000 00000000
    9fe0: bef6cbd0 bef6cbc0 0001e8e8 4018c8c0 60000010 bef6cbd8 00000000 00000000
    Backtrace:
    [<bf03f690>] (GatePeterson_Instance_finalize+0x0/0x88 [syslink]) from [<bf03fc20>] (GatePeterson_delete+0x78/0xac [syslink])
     r6:00000000 r5:00000000 r4:cad79e40
    [<bf03fba8>] (GatePeterson_delete+0x0/0xac [syslink]) from [<bf048da8>] (GateMP_Instance_finalize+0x114/0x228 [syslink])
     r4:cf7df000 r3:cf7e2000
    [<bf048c94>] (GateMP_Instance_finalize+0x0/0x228 [syslink]) from [<bf04a35c>] (GateMP_delete+0xa0/0xf8 [syslink])
     r7:00000000 r6:00000000 r5:00000000 r4:cad79e84
    [<bf04a2bc>] (GateMP_delete+0x0/0xf8 [syslink]) from [<bf04a4a4>] (GateMP_close+0x80/0xb8 [syslink])
     r5:cf7dc010 r4:00000000
    [<bf04a424>] (GateMP_close+0x0/0xb8 [syslink]) from [<bf043cf8>] (ListMP_close+0xb4/0x144 [syslink])
     r5:cf7dc000 r4:cf7dc034
    [<bf043c44>] (ListMP_close+0x0/0x144 [syslink]) from [<bf043e04>] (ListMP_destroy+0x7c/0x120 [syslink])
     r8:c004cde8 r7:bf09ae60 r6:bf09ae58 r5:cf7dc000 r4:0000ffff
    r3:00000294
    [<bf043d88>] (ListMP_destroy+0x0/0x120 [syslink]) from [<bf02deb4>] (Platform_destroy+0x110/0x2e0 [syslink])
     r7:00000081 r6:bef6cbd8 r5:bf0cb23c r4:bf0cafc0
    [<bf02dda4>] (Platform_destroy+0x0/0x2e0 [syslink]) from [<bf04c288>] (Ipc_destroy+0x48/0xc0 [syslink])
     r4:bf0cae3c r3:00000000
    [<bf04c240>] (Ipc_destroy+0x0/0xc0 [syslink]) from [<bf06ecfc>] (KnlUtilsDrv_finalizeModule+0x14/0xb8 [syslink])
     r5:00000000 r4:bf0cae3c
    [<bf06ece8>] (KnlUtilsDrv_finalizeModule+0x0/0xb8 [syslink]) from [<c009fdd8>] (sys_delete_module+0x1b4/0x228)
     r4:bf0cad14 r3:bf06ece8
    [<c009fc24>] (sys_delete_module+0x0/0x228) from [<c004cc40>] (ret_fast_syscall+0x0/0x30)
     r6:006b6e69 r5:6c737973 r4:0001ea00
    Code: e3560000 1a000009 e5943010 e3e02000 (e1c320b2)
    ---[ end trace b7a09106cd357c50 ]---
    Segmentation fault
    ~ #

  • What version of SysLink are you using?  Releases are here:
            http://software-dl.ti.com/dsps/dsps_public_sw/sdo_sb/targetcontent/syslink/index.html 

    SysLink can't recover from every situation (e.g. if a slave goes rogue and corrupts Linux), but more recent releases do have improved teardown support.

    Chris

  • The SysLink driver is failing because not all the SysLink resources are being freed.

    To add more context to Chris' post.

    Newer versions of SysLink have added limited resource tracking and some level of terminate support.

    The terminate support information can be found in the SysLink API docs. (SYSLINK_INSTALL_DIR/docs/index.html).  You'll find a reference to it in the IPC (host only) section.  This is a mechanism to tell the slave cores the host-side app has terminated.

    As for resource tracking, SysLink tracks only the following host-side modules (HeapBuf, MessageQ, Notify).  When the host application is terminated,  it will attempt to properly clean up any kernel related driver resources for those modules.

    FYI:  Before terminating the host-side app, it is advisable to call SysLink_destroy() at a minimum. Ideally, you would want to free up all the host-side SysLink resources that were allocated by calling their delete/destroy functions (e.g. MessageQ_delete, etc) when possible.

  • Thanks for the suggestions, Chris and Arnie.  We are currently using Syslink 2.20.2.20.  I downloaded 2.21.01.05 but then noticed that it requires new versions of Sysbios and IPC, so I decided not to start peeling that onion now.  I am calling SysLink_destroy() when the application is terminated.  I added a call to set Ipc_TERMINATEPOLICY_STOP, but this did not help; I still get the same kernel oops when I try to rmmod the Syslink driver after terminating the application.  The oops occurs when the driver tries to delete a GatePeterson.  Since gates were not on Arnie's list of tracked resources, I'm sceptical of whether this problem can be solved without an enhancement to the driver, but I will try adding delete and destroy calls for things that the application does not now explicitly destroy.  If that doesn't help, I'll try updating Syslink and its dependencies.

  • An update:  The application's termination code already contained destroy() calls for every type of Syslink resource, so that's not the answer.  The reason Syslink isn't releasing all of  its resources is that the application isn't calling System_linkDelete() on its McFW links during termination.  It's not calling System_linkDelete() because if an M3 is unresponsive, the call will hang and prevent the application from terminating.

  • The terminate and resource tracking is already in the 2.20.2.20 release so there wouldn't be a gain updating to a newer version.

    The terminate support is only helpful when the GPP-side application abruptly terminates, so the slave cores know they need to clean-up.  The cleanup on the slave core doesn't happen automatically.  Its just a hook to indicate to the application that it needs to clean-up.  The modules resources aren't being tracked on the slave cores. Its up to the application code to determine what needs to be cleaned up.

    As for resource tracking, we only added resource tracking to the host side modules I previously indicated so not all cleanup may be taking place when the GPP-side application is killed.  If you are curious, you can take a look at following SysLink source files to see how resource tracking is implemented and used:

    • ti/syslink/utils/ResTrack.h - declares the API
    • ti/syslink/utils/hlos/knl/ResTrack.c - implementation
    • ti/syslink/ipc/hlos/knl/Linux/HeapBufMPDrv.c - user of resource tracking
    • ti/syslink/ipc/hlos/knl/Linux/MessageQDrv.c - user of resource tracking
    • ti/syslink/ipc/hlos/knl/Linux/NotifyDrv.c - user of resource tracking
  • We've found a resolution for this problem.  To recap, the problem was that we wanted to recover our DM8127-based system if an M3 processor became unresponsive (crashed or hung) by some means less drastic than rebooting the A8 or cycling power.  The final challenge was terminating the application on the A8 with an unresponsive M3 in a way that didn't leave the Linux Syslink driver in a state that prevented successful restart of the application.  The complete solution required several pieces:

    1. A patch from Brigesh Jadav at TI (5518.eventmngr.zip in http://e2e.ti.com/support/embedded/bios/f/355/t/267340.aspx?pi199400=1) that prevents the M3 from being flooded by interrupts after it is restarted.
    2. Removing some calls in our application that would hang its termination if an M3 was unresponsive. (RemoteDebug_putChar and _getChar())
    3. A change to System_linkControl() (in ipnc_rdk/ipnc_mcfw/mcfw/src_linux/links/system/system_linkApi.c) to use a timeout when calling System_ipcMsgQSendMsg().
    4. Removal of System_linkStop() calls from the application; the calls are apparently unnecessary if the M3s are to be restarted and will slow termination by incurring a timeout if an M3 is unresponsive.
    5. Although it wasn't necessary to allow application restart, we also fixed a problem in Platform_startCallback() that caused its timeout to be inaccurate.
    6. We also diagnosed a problem that caused Sysbios to call abort() when the M3 was restarted.  However, this problem occurred only when our application was terminated with SIGKILL rather than SIGTERM.

    It turned out that additional resource recovery in the Syslink driver was not necessary to resolve the problem.

    My thanks to everyone who contributed to this resolution!

  • Glad you have a solution now.  Though I am curious about the inaccurate timeout in Platform_startCallback()

    Dave Beal said:
    Although it wasn't necessary to allow application restart, we also fixed a problem in Platform_startCallback() that caused its timeout to be inaccurate.

    Can you provide more info on what you found, so that it can be addressed in future SysLink releases.

    Thanks

  • Hi Arnie -

    The timeout problem was in the Platform_startCallback() function in Source/ti_tools/syslink_2_20_02_20/packages/ti/syslink/family/hlos/knl/ti81xx/Platform.c.  The code calls Ipc_attach() repeatedly until it succeeds or a timeout occurs.  The timeout was implemented by a loop that calls Ipc_attach up to 20480000 times while delaying for 1 msec after each 4096th call to implement a total timeout of 5 seconds ((20480000 / 4096) * 1 msec = 5 sec).  In other words, the code was assuming that the 20 million calls to Ipc_attach() consumed no time at all.  The result was that the timeout actually took about a minute.  I changed it to call Ipc_attach() one every 100 msec, up to 50 times.

  • Dave,

    Thanks for the details.  I'll add this information to our bug tracking system to be reviewed prior to the next SysLink release.