Failure of Syslink Ipc_attach() after application restart

Dave Beal

Other Parts Discussed in Thread: SYSBIOS

We are running IPNC RDK 3.5 on a custom board containing a DM8127. On initial boot of the board, everything works fine and the board streams video. However, when we kill the Linux process on the A8 that manages the McFW chain and restart it, it doesn't come back up successfully. I've traced the problem into the Linux Syslink driver where calls to Ipc_procSyncStart() for connection to the VPSS M3 consistently fail because the A8 never sees the M3 set startedKey to Ipc_PROCSYNCSTART. However, Ipc_procSyncStart() works fine to the Video M3 on the restart. I believe that the A8 is looking in the right place in shared memory, because the address at which the A8 is expecting the VPSS M3 to set startedKey is the same on the successful initial boot and the unsuccessful restart. I've also tried hacking the code in Ipc_procSyncStart to force cacheEnabled to 1; this had no effect.

Debugging this is a chicken-and-egg problem: I can't see what's going on on the VPSS M3 because I can't communicate with it. I would use Code Composer Studio's JTAG interface to look at the VPSS M3, but my evaluation license has expired. Suggestions on how to debug this?

over 12 years ago

0 Dave Beal over 12 years ago

Intellectual 880 points

I've made some progress on this. I found that CCS would allow me to renew my license for another 90 days, so I'm using it via JTAG. The problem isn't on the A8; it's on the VPSS M3. I'm convinced that the VPSS M3 doesn't come back up successfully after I've killed and restarted the process on the A8. The A8 thinks that it has successfully reloaded and restarted both M3s, but if I then connect to the VPSS M3 in CCS and randomly suspend it, it's always in code with names that contain "ISR" or "Hwi", so it looks like the VPSS M3 is stuck constantly handling interrupts and not getting anything else done. When it's in this state, I tried setting breakpoints at routines that should be getting executed during normal operation, like Utils_quePut() and Notify_sendEvent(), but CCS complains that it can't set breakpoints when the M3 is disconnected; apparently the VPSS M3 isn't even working well enough after the restart for CCS to connect, although I can suspend and run it.

Any ideas why, after it's reloaded and restarted, the VPSS M3 is apparently stuck doing nothing but servicing interrupts?

0 Dave Beal over 12 years ago in reply to Dave Beal

Intellectual 880 points

A little more information. I've traced the restart of the VPSS M3 after it's reloaded. It gets into the function ti_sysbios_BIOS_startFunc__I() which calls several other startup routines. All goes well until ti_sysbios_hal_Hwi_startup() is called and returns, at which point the M3 does nothing but service hardware interrupt 0x20. The next startup function to be called (ti_sysbios_hal_Timer_startup()) never gets executed.

Anybody know what interrupt 0x20 is on the VPSS M3? I haven't found anything in the source code that tells me.

0 Arnie Reynoso over 12 years ago in reply to Dave Beal

TI__Expert 6860 points

Are you able to connect to the VPSS M3 core? If so, you can try taking a look to see if the HLOS core is setting the correct bit for the slave (M3) to detect a sync.

Though the following post is on a different device, it might be of some use

http://e2e.ti.com/support/embedded/bios/f/355/t/261836.aspx

You can also try enabling trace for the SysLink kernel module to see if you get more info. Take a look at the following:

http://processors.wiki.ti.com/index.php/SysLink_Install_Guide#SysLink_Kernel_Driver

http://processors.wiki.ti.com/index.php/SysLink_UserGuide#Trace.2C_debug_and_build_configuration

0 Arnie Reynoso over 12 years ago in reply to Arnie Reynoso

TI__Expert 6860 points

It seems the VPSS-M3 core is stuck in some initialization state. The SysLink sample application, typically load/run the Video M3 core before the VPSS-M3 since its the primary core and is typically responsible for enabling the unicache and configuring the AMMU. It may be that upon reset, the VPSS-M3 is loaded and executing before the Video-M3 has done it's proper setup/initialization to allow the VPSS to run correctly.

0 Dave Beal over 12 years ago in reply to Arnie Reynoso

Intellectual 880 points

Hi Arnie -

Thanks for your response. As I indicated in my second and third posts to this thread, the problem is a failure of the VPSS M3 to make it through initialization when it comes back up after being reloaded and restarted. It gets stuck in a state doing nothing but servicing interrupt 0x20. I haven't been able to determine what the interrupt is.

0 Dave Beal over 12 years ago in reply to Arnie Reynoso

Intellectual 880 points

Arnie, loading and starting the VPSS M3 before the Video M3 always seems to work correctly on the initial boot of the device. It's only on restart that my problem occurs. However, I will try reversing the order of the M3 startups (on both boot and restart) and see what happens.

0 Dave Beal over 12 years ago in reply to Arnie Reynoso

Intellectual 880 points

Arnie, I reversed the order of loading and starting the M3s - that is, starting the Video M3 before the VPSS M3. No change to the behavior. It still works fine on initial boot, but the VPSS M3 does nothing but service interrupts after restart.

0 Dave Beal over 12 years ago in reply to Dave Beal

Intellectual 880 points

The interrupt that pummels the VPSS M3 after restart is apparently associated with the ISS, because the interrupt dispatcher calls a function (Iem_masterISR()) in iss_evtMgr.c to service it. I hacked up Iem_masterISR() so that when it is called after the VPSS M3 is restarted, the interrupt is serviced but not re-enabled. With this change, when I randomly suspend the M3, it is usually in ti_sysbios_knl_idle_loop__E(), as it is when everything works after boot. More evidence that the interrupt is preventing the M3 from resuming normal operation after its restart.

0 Arnie Reynoso over 12 years ago in reply to Dave Beal

TI__Expert 6860 points

I'm not familiar with what the Iem_masterISR() function is doing but it seems you are on the right path. Is that function part of the IPNC RDK stack? Maybe others here might be more familiar with what's going on.

0 Dave Beal over 12 years ago in reply to Arnie Reynoso

Intellectual 880 points

Hi Arnie -

The Iem_masterISR() function is in the IPNC RDK under Source/ti_tools/iss_03_50_00_00/packages/ti/psp/iss/common/src/iss_evtMgr.c. My current theory is that rebooting my board by pressing the reset button completely resets the ISS hardware and allows everything to work, while just reloading and restarting the VPSS M3 doesn't reset some part of the ISS and results in the M3 being hammered by interrupts.

0 Brijesh Jadav over 12 years ago in reply to Dave Beal

TI__Guru**** 479995 points

Hi Dave,

Could you please share dump of few registers when you see this issue? I want to first check why interrupt is enabled.

four registers from 0x55040020

Four registesr from 0x55050024

Regards,

Brijesh Jadav

0 Dave Beal over 12 years ago in reply to Brijesh Jadav

Intellectual 880 points

Thanks for your response, Brijesh. I am out of the office for the next week, but solving this mystery is my top priority, so I will provide this information to you as soon as I get back. In the meantime, if there is any other info that would help, please let me know.

0 Brijesh Jadav over 12 years ago in reply to Dave Beal

TI__Guru**** 479995 points

Hello Dave,

I have made few changes in the event manager for fixing this issue. Could you please try it and check if it resolves the issue?

Can you please copy/replace the files as shown below from below eventmngr.zip file?

5518.eventmngr.zip

- rename BIOS_m3vpss_ti8107.cfg to BIOS_m3vpss.cfg and copy it to mcfw\src_bios6\cfg\ti810x folder

- rename BIOS_m3vpss_ti814x.cfg to BIOS_m3vpss.cfg and copy it to mcfw\src_bios6\cfg\ti814x folder

- copy iss_evtMgr.h file into hdvpss\packages\ti\psp\iss\common folder

- copy iss_evtMgr.c into hdvpss\packages\ti\psp\iss\common\src folder

Regards,

Brijesh Jadav

0 Dave Beal over 12 years ago in reply to Brijesh Jadav

Intellectual 880 points

Hi Brijesh -

Thank you for the patch files. I have rebuilt using these files and, so far, it appears that I can now usually kill and restart my McFW application successfully. This is a major improvement. However, about once out of three or four times I kill and restart the application, I still get an "Ipc_attach timeout" error on the Linux console. Using CCS to suspend the VPSS M3 when we get in this state, I no longer see it continually servicing interrupts, so I believe that your patch has fixed that problem. I will proceed with diagnosing this new problem that happens occasionally.

0 Rajat Sagar over 12 years ago in reply to Dave Beal

TI__Expert 8780 points

Hi Dave,

Can you try remote resetting ISS when you do the tear-down sequence? For ISS the base address from A8 is 0x5C000000. Its also published in A8 L3 memory map. This can be used to remote (A8) reset ISS when you do the tear-down sequence.

Is the IPC timeout occuring after 3rd or 4th restart attempt or is random in nature i.e. can occur even on the first restart?

Regards

Rajat

0 Dave Beal over 12 years ago in reply to Rajat Sagar

Intellectual 880 points

Hi Rajat -

Thanks for the reply. The current state of the problem is that if we take down our Linux application gracefully and allow it to tear down the McFW chain, there is no problem; we can always restart the application and everything comes back up successfully. The remaining problem scenario is when I kill the application with a SIGKILL, so no graceful teardown is possible. In this case, the VPSS M3 sometimes calls abort() before attaching to the A8. This happens randomly - sometimes on the first kill, sometimes the fifth - but an average of about every third kill/restart. There is also a case (about one of of every 10 trials) where the A8 Linux console completely locks up when the application is killed, followed by a spontaneous Linux reboot about a minute later. I haven't yet tried to diagnose this case.

At the moment, I'm trying to diagnose the VPSS abort() by manually dissecting the stack to see where abort() is called from. I'll let you know when I find something, Thanks again for your attention to this problem.

0 Dave Beal over 12 years ago in reply to Dave Beal

Intellectual 880 points

In the case when the VPSS restart fails because it calls abort(), abort() is being called by the function Task_checkStacks(), because it thinks that the stack belonging to System_main() has overflowed. When I look at the stack, it is filled to capacity with zero bytes, whereas unused stack space is apparently supposed to contain bytes of 0xBE. I don't know yet whether the stack was overwritten with zeroes or if it was never initialized with 0xBE's. However, the zero bytes stop exactly at the beginning (high address end) of System_main's stack, so it doesn't look like the result of a random wild write.

0 Dave Beal over 12 years ago in reply to Dave Beal

Intellectual 880 points

My previous post was incorrect. When our application is restarted, Task_checkStacks() is calling abort() on the VPSS M3, but it's not because a stack has overflowed. It's because Task_schedule() gets called with nothing on the readyQ. Instead of containing a pointer to a valid Task_Object, the head of the readyQ just points to itself. Task_schedule() apparently assumes that there will always be something on the readyQ, so it grabs the pointer at the head of the readyQ, interprets it as a pointer to a Task_object and passes it as "newtask" to Task_checkStacks(), which calls abort() because the stack of the bogus "task" doesn't look right.

0 Yogesh Marathe over 12 years ago in reply to Dave Beal

TI__Expert 7765 points

I think killing A8 side application with SIGKILL or with any signal for that matter will not help. SysLink ProcMgr module which is reponsible for booting, loading and resetting remote cores works from Linux kernel space and if you kill A8 linux application, data structures related to this module and all other modules in syslink will remain in inconsistent state. You can see any issues like abort() on m3 as well as A8 if you try to re-start/re-load without clean exit.

As I understand, M3s work with unicache and whenever there is graceful shutdown of M3 from A8, A8 side code ensures that ducati's unicache is flushed out. Without this operation you cannot restart any application on M3 even if it is powered down and powered up. I think if a signal is delivered to a process where SysLink is setup, the signal handler in the process should ensure unload for all slave cores happen, also rmmod and insmod (again) of syslink.ko is done before re-starting the application.

0 Dave Beal over 12 years ago in reply to Yogesh Marathe

Intellectual 880 points

Hello Yogesh -

Thank you for your reply. Yes, I've seen evidence that when I use SIGKILL to kill and restart my A8 application, things are not entirely reset on the A8; I get "NameServer_add: duplicate entry found!" messages on the A8 console, and if I repeatedly kill and restart, I eventually get "_ProcMgr_map: All memEntries slots are in use!".

On the other hand, I've managed to add some code to the BIOS6 Task_schedule() function that allows recovery from the empty readyQ problem I described above, so I can now successfully SIGKILL and restart my A8 application until I hit the ProcMgr problem.

0 Dave Beal over 12 years ago in reply to Dave Beal

Intellectual 880 points

Since this thread has evolved into a question of how to reinitialize the Linux Syslink driver with an unresponsive M3, I've started a new thread: http://e2e.ti.com/support/embedded/linux/f/354/t/275718.aspx

Processors

Processors forum

Failure of Syslink Ipc_attach() after application restart