AM620-Q1: IPC crash in resume from mcu-only mode

Walter Wang

Other Parts Discussed in Thread: SK-AM62-LP

[continue conversation from https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1529704/am620-q1-failed-to-use-mcu-only-mode]

Hi, Gibbs，

We have solved the problem that mcu cannot control IO in only mode.

Next, please give priority to the problem that mcu wakes up soc and causes IPC crash and automatic restart

Thanks

4 months ago

0 Nick Saulnier 4 months ago

TI__Guru** 103580 points

Hello Walter,

We need more information about what you are testing. Here are the details I was able to figure out so far:

Anshu told me that you were running tests with this Linux script:

while true; do echo mem > /sys/power/state; done;

Based on that input, I assume that the terminal output you showed is broken into several wakeup / suspend cycles, like this:

We can verify from the output "fifo 1 has unexpected unread messages" that the M4F is the one who did not read the mailbox message during #2 in the above image (i.e., not Linux).

In case anyone wants to double-check the thought process on how we can verify what "fifo 1" means, I'll attach the relevent code snippets.

// from devicetree files
// FIFO 1 is used for transmitting signals from Linux to M4F
// mbox-m4-0 > ti,mbox-tx = <1 0 0>;

        mailbox0_cluster0: mailbox@29000000 {
                compatible = "ti,am64-mailbox";
                reg = <0x00 0x29000000 0x00 0x200>;
                interrupts = <GIC_SPI 76 IRQ_TYPE_LEVEL_HIGH>,
                             <GIC_SPI 77 IRQ_TYPE_LEVEL_HIGH>;
                #mbox-cells = <1>;
                ti,mbox-num-users = <4>;
                ti,mbox-num-fifos = <16>;
        };

&mailbox0_cluster0 {
        mbox_m4_0: mbox-m4-0 {
                ti,mbox-rx = <0 0 0>;
                ti,mbox-tx = <1 0 0>;
        };
        mbox_r5_0: mbox-r5-0 {
                ti,mbox-rx = <2 0 0>;
                ti,mbox-tx = <3 0 0>;
        };
};

// from Linux driver
// fifo %d refers to the actual hardware offset
// i.e., 100% of the time, fifo 1 = the 2nd hardware fifo instance

#define MAILBOX_MSGSTATUS(m)            (0x0c0 + 4 * (m))

#ifdef CONFIG_PM_SLEEP
static int omap_mbox_suspend(struct device *dev)
{
        struct omap_mbox_device *mdev = dev_get_drvdata(dev);
        u32 usr, fifo, reg;

        if (pm_runtime_status_suspended(dev))
                return 0;

        for (fifo = 0; fifo < mdev->num_fifos; fifo++) {
                if (mbox_read_reg(mdev, MAILBOX_MSGSTATUS(fifo))) {
                        dev_err(mdev->dev, "fifo %d has unexpected unread messages\n",
                                fifo);
                        return -EBUSY;
                }
        }

0 Nick Saulnier 4 months ago in reply to Nick Saulnier

TI__Guru** 103580 points

Please answer the questions in bold.

Based on the terminal timestamps of 7897 seconds, I assume that you had many successful low power mode entries and exits before you hit an error.

QUESTION 1: Is it true that you had multiple successful low power mode cycles before running into an issue?

One potential issue is that you could have a race condition between Linux and the M4F core.

There is only 3 milliseconds between receiving the ECHO_REPLY mailbox from the M4F, and starting the next suspend. Perhaps Linux is sending the next low power mode signal to the M4F before the M4F has finished changing state back from "low power mode" to "system running", and that is causing an issue.

TEST 1: Add a short delay in the Linux test script

If Linux waits for a half second or a second before it starts the next low power mode, what happens? Does the crash behavior go away?

If yes, then you have a race condition.

QUESTION 2: What wakeup source are you using for these tests? How are you triggering the M4F to wake up the system?

This line tells us that Linux is getting woken up by the M4F core:
ti_sci _resume: wakeup source: 0x90

For more info, refer to

https://software-dl.ti.com/processor-sdk-linux/esd/AM62X/10_01_10_04/exports/docs/linux/Foundational_Components/Power_Management/pm_wakeup_sources.html#mcu-ipc-based-wakeup

and

https://downloads.ti.com/tisci/esd/latest/2_tisci_msgs/pm/lpm.html#tisci-msg-lpm-wake-reason

And Anshu mentioned that you have experimented with a bunch of different wakeup sources. So are you using the UART terminal to the M4F core? Are you using something else? Please clarify which wakeup sources are for testing purposes and which are for the final product.

QUESTION 3: Are you using the unmodified ipc_rpmsg_echo_linux example? If you are using custom code, please have Gibbs share that custom code with us offline

If you made a bunch of code changes, it is also possible that the M4F is crashing somewhere. If you still have IPC_RPMsg echo test code running on the M4F, you could use the IPC_Echo test to see if the M4F is still able to respond after a low power mode transition fails:
https://dev.ti.com/tirex/explore/node?node=A__AXINfJJ0T8V7CR5pTK41ww__AM62-ACADEMY__uiYMDcq__LATEST

Regards,

Nick

0 Nick Saulnier 4 months ago in reply to Nick Saulnier

TI__Guru** 103580 points

QUESTION 4: If you are using custom M4F code, are you storing any program data in DDR?

Please have Gibbs share the linker.cmd file with us offline.

DDR is not accessible during MCU-only low power mode.

Even during Linux runtime, M4F does not have any cache. So accesses to DDR can be very slow. That could also influence a race condition between Linux and M4F.

0 Gibbs Shih 4 months ago in reply to Nick Saulnier

TI__Expert 5475 points

Hi, Nick

Thanks your suggestion,

Simple update status first.

I think customer want to do an "stress test" for LPM (MCU ONLY MODE) wakeup cycling, so they design the expirments, the goal is :

(1) Does it caused any software crash when AM62 enter LPM and wakeup from LPM cycling with long time testing?

(2) Does it caused any software crash when AM62 "is going to" LPM procedure, but also get wake up event at the same time?

(3) Does it caused any software crash when AM62 "is waking up" from LPM procedure, but also get "sleep" event at the same time?

Basically, they base on our IPC_RPMSG_ECHO_LINUX example code to this test.

file as attachment.

https://e2e.ti.com/cfs-file/__key/communityserver-discussions-components-files/791/ipc_5F00_rpmsg_5F00_echo_5F00_linux_5F00_am62x_2D00_sk_2D00_lp_5F00_m4fss0_2D00_0_5F00_freertos_5F00_ti_2D00_arm_2D00_clang_2D00_auto_5F00_timer.7z

Key code modification

static void lpm_mcu_wait_for_uart()
{
    UART_Transaction trans;
    uint8_t uartData;
    int32_t status;
    uint32_t cnt = 0U;
    static uint32_t cnt1 = 0U;
    static uint8_t WakeFlag = 0U;

    UART_Transaction_init(&trans);

    /* Read 1 byte */
    trans.buf   = &uartData;
    trans.count = 1U;

    DebugP_memLogWriterPause();

    gNumBytesRead = 0u;

	//DebugP_log(" mcu is running !\r\n mcu is running\r\n mcu is running\r\n mcu is running\r\n");

	vTaskDelay(1000);
	SOC_triggerMcuLpmWakeup();
	SemaphoreP_pend(&gLpmResumeSem, SystemP_WAIT_FOREVER);
	cnt1++;
	DebugP_log("wake soc success,cnt = %u\r\n",cnt1);
#if 0

    while(1)
    {
    	/*
		if(cnt < 0x7FFFF)
        {
            cnt++;
        }
		else
		{
			cnt1++;
			DebugP_log("cnt = %u\r\n",cnt1);
			cnt = 0U;
			if (cnt1 == 5)
			{
				cnt1 = 0U;
				SOC_triggerMcuLpmWakeup();
				SemaphoreP_pend(&gLpmResumeSem, SystemP_WAIT_FOREVER);
				break;
			}
		}
		*/
    	// mcu透过api唤醒soc code
		/*
        if(cnt < 0x7FFFFF)
        {
            cnt++;
        }
		else
		{
			SOC_triggerMcuLpmWakeup();
			SemaphoreP_pend(&gLpmResumeSem, SystemP_WAIT_FOREVER);
			cnt1++;
			DebugP_log("wake soc success,cnt = %u\r\n",cnt1);
			break;
		}
		*/
    }


    /* Wait for any key to be pressed */
    status = UART_read(gUartHandle[CONFIG_UART0], &trans);
    DebugP_assert(status == SystemP_SUCCESS);

    while (gNumBytesRead == 0u && gbSuspended == 1u)
    {
    }

    if (gNumBytesRead != 0)
    {
        DebugP_log("[IPC RPMSG ECHO] Key pressed. Notifying DM to wakeup main domain\r\n");
        SOC_triggerMcuLpmWakeup();
        /* Wait for resuming the main domain */
        SemaphoreP_pend(&gLpmResumeSem, SystemP_WAIT_FOREVER);

        DebugP_log("[IPC RPMSG ECHO] Main domain resumed due to MCU UART \r\n");
    }
    else if (gbSuspended == 0u)
    {
        UART_readCancel(gUartHandle[CONFIG_UART0], &trans);
        DebugP_log("[IPC RPMSG ECHO] Main domain resumed from a different wakeup source \r\n");
    }
#endif
    DebugP_memLogWriterResume();
}

I think linux script should be modify as below, (add 2 sec delay)

while true; do echo mem > /sys/power/state; sleep 2; done;

As far as I know, AM62x go to LPM need ~57ms.

Once get wake up event, it need ~277ms to finish this procedure.

(test on SK-AM62-LP)

[   74.934434] PM: suspend entry (deep)
[   74.940791] Filesystems sync: 0.002 seconds
[   74.962326] Freezing user space processes
[   74.968598] Freezing user space processes completed (elapsed 0.002 seconds)
[   74.975654] OOM killer disabled.
[   74.978887] Freezing remaining freezable tasks
[   74.984862] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
[   74.992274] printk: Suspending console(s) (use no_console_suspend to debug)

--> 57ms


[   75.008295] ti-sci 44043000.system-controller: ti_sci_cmd_set_device_constraint: device: 179: state: 1: ret 0
[   75.008486] ti-sci 44043000.system-controller: ti_sci_cmd_set_device_constraint: device: 178: state: 1: ret 0
[   75.016090] omap8250 2800000.serial: PM domain pd:146 will not be powered off
[   75.016769] ti-sci 44043000.system-controller: ti_sci_cmd_set_device_constraint: device: 117: state: 1: ret 0
[   75.016961] ti-sci 44043000.system-controller: ti_sci_cmd_set_latency_constraint: latency: 100: state: 1: ret 0
[   75.040593] Disabling non-boot CPUs ...
[   75.043143] psci: CPU1 killed (polled 0 ms)
[   75.047649] psci: CPU2 killed (polled 0 ms)
[   75.051339] psci: CPU3 killed (polled 0 ms)
[   75.053039] Enabling non-boot CPUs ...
[   75.053478] Detected VIPT I-cache on CPU1
[   75.053534] GICv3: CPU1: found redistributor 1 region 0:0x00000000018a0000
[   75.053603] CPU1: Booted secondary processor 0x0000000001 [0x410fd034]
[   75.054937] CPU1 is up
[   75.055238] Detected VIPT I-cache on CPU2
[   75.055274] GICv3: CPU2: found redistributor 2 region 0:0x00000000018c0000
[   75.055323] CPU2: Booted secondary processor 0x0000000002 [0x410fd034]
[   75.056354] CPU2 is up
[   75.056650] Detected VIPT I-cache on CPU3
[   75.056686] GICv3: CPU3: found redistributor 3 region 0:0x00000000018e0000
[   75.056738] CPU3: Booted secondary processor 0x0000000003 [0x410fd034]
[   75.057784] CPU3 is up
[   75.058442] ti-sci 44043000.system-controller: ti_sci_resume: wakeup source: 0x90
[   75.075295] am65-cpsw-nuss 8000000.ethernet: set new flow-id-base 19
[   75.084626] am65-cpsw-nuss 8000000.ethernet eth0: PHY [8000f00.mdio:00] driver [TI DP83867] (irq=POLL)
[   75.084650] am65-cpsw-nuss 8000000.ethernet eth0: configuring for phy/rgmii-rxid link mode
[   75.249195] OOM killer enabled.
[   75.252343] Restarting tasks ... done.
[   75.258819] random: crng reseeded on system resumption
[   75.266115] k3-m4-rproc 5000000.m4fss: Core is on in resume
[   75.271949] k3-m4-rproc 5000000.m4fss: received echo reply from 5000000.m4fss
[   75.285786] PM: suspend exit

-->277ms

Thank You.

Gibbs

0 Anshu Madwesh 4 months ago in reply to Gibbs Shih

TI__Mastermind 19465 points

Hi Gibbs,

Can you clarify if this continuous execution of LPMs is an expected use case of the final product?

Please do respond to Nick's questions so we can debug further.

The latency numbers that you mentioned are for the print statements. As we discussed in a previous thread, the latency will go down when we reduce the number of print statements.

Thanks,

Anshu

0 Gibbs Shih 4 months ago in reply to Anshu Madwesh

TI__Expert 5475 points

Hi, Anshu & Nick

Replies as below.

Q1/A1 : Is it true that you had multiple successful low power mode cycles before running into an issue?

Yes, so far, we have two round testig. 4k times suceess has one faild, and 9k times suceess has one failed.

so try to avoid race condition potential issues, I ask cutomer add some delay (~1 sec) in code before MCU trigger A53 wakeup.

Q2/A2: What wakeup source are you using for these tests? How are you triggering the M4F to wake up the system?

Follow I mention before, because we found that has some potential chance (case) which caused RPMsg crash when (exits from / re-enter in) LPM, so we design "stress test" to find root cause.

(1) Stress test :

No gpio trigger, add simple delay and then wakeup. I think I already share the code in previous post.

(2) Real Use case

But in real use case, MCU (MCU only mode or deep sleep mode) will read multiple GPIO status (MCU_GPIO0_19 / MCU_GPIO0_20 / MCU_GPIO0_3 / MCU_GPIO0_14) to trigger A53 linux waleup, these gpio signals come from vehicle IGN/WIFI/5G module/CAN BUS/USB wakeup... etc , so it is very possible that multiple wakeup event trigger almost at the same time. Because we suspect too intensive gpio wakup trigger with enter/exit LPM may caused RPMsg communicaton falied, so we design "Stress test" for deep debugging.

I think AM62 will not (naver) power off in field vhicle application, it only works between LPM and Normal operation. so we need to make sure there is no serious issue for LPM wakeup.

Q3/A3 : Are you using the unmodified ipc_rpmsg_echo_linux example? If you are using custom code, please have Gibbs share that custom code with us offline.

Follow I mention before, I already share code here. We use default image build (tisdk-default-image-am62xx-lp-evm-10.01.10.04.rootfs.wic.xz) with a litte bit modification sample code for ipc_rpmsg_echo_linux. All test on SK-AM62-LP EVM, you can direct download my project code, and runs it on EVM.

Q4/A4 : If you are using custom M4F code, are you storing any program data in DDR?

This question is follow Q3/A3. I think we do not storing program data in DDR, because we use default sample project (rpc_rpmsg_echo_linux) with a little bit modification to test it.

Q5/A5 : Do we have any way to anlysis watch dog reset?

This is a new question, when we do "Stress test", we found system reboot some times, but it need long time testing. we suspect it seems trigger watchdog reset when AM62 hit some error exception, and then reboot after 3 minutes. We have very similiart discussion before.

Once system reboot, we lost any logs. Could we have any way to log some information when AM62 has WDT error, and try to know what's reason let AM62 reboot.

By the way, I also suspect "continuous execution of LPMs" is reasonable or not, so discussion with customer is keep on going.

Thank You Very Much

Gibbs

0 Anshu Madwesh 4 months ago in reply to Gibbs Shih

TI__Mastermind 19465 points

Hi Gibbs,

For question 5, we will continue this discussion on this thread: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1534178/am620-q1-tifs-timeout

Thanks,

Anshu

0 Nick Saulnier 4 months ago in reply to Anshu Madwesh

TI__Guru** 103580 points

Hello Gibbs,

1) Status update when adding 1 sec before Linux low power transitions?

How many test runs has the customer done? With a 1 second delay before Linux starts the next low power transition, have we seen any failures?

2) thank you for sharing test code

Comments based on the code posted on June 27:
https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1532839/am620-q1-ipc-crash-in-resume-from-mcu-only-mode/5896055#5896055

a) the system might not work as expected if something else wakes up the processor other than the M4F core (since the code to handle that usecase has been commented out)

b) The code is still calling UART_Transaction_init. I think this should be ok, since it seems like the function is just initializing the UART transaction structure and not actually starting anything running like UART_read https://software-dl.ti.com/mcu-plus-sdk/esd/AM62X/10_01_00_33/exports/docs/api_guide_am62x/group__DRV__UART__MODULE.html#ga07908a73787a9ef3e8356c51524b8c44

3) Capturing log output

You could capture logs on the EVM, but if you have a computer connected to the EVM, you can also capture logs on the computer (e.g., if you have a Linux PC that is connected to the EVM for the entire time that the test is running). There should be multiple different ways to do this, depending on what terminal application you are using to connect to the EVM.

Regards,

Nick

0 Nick Saulnier 4 months ago in reply to Nick Saulnier

TI__Guru** 103580 points

Walter Wang

1) Do you have any status updates on whether you are still seeing failures when Linux adds a 1 second delay?

2) Tell me more about your MCU GPIO wakeup sources.

Does the M4F have to do any logic to decide whether a GPIO signal should cause a wakeup? Or do you just want any change in GPIO status to trigger a wakeup?

It sounds like Linux is controlling the MCU GPIO module during runtime. So Linux is the owner for the MCU GPIO, not the M4F.

Instead of using M4F to watch the MCU GPIO signals, I would suggest configuring each of the GPIO inputs to be wakeup sources for Linux. That will also simplify your M4F software development.

For more information, refer to https://software-dl.ti.com/processor-sdk-linux/esd/AM62X/10_01_10_04/exports/docs/linux/Foundational_Components/Power_Management/pm_wakeup_sources.html

Regards,

Nick

0 Nick Saulnier 4 months ago in reply to Nick Saulnier

TI__Guru** 103580 points

Where this is an example usecase that does not require logic:

4 different GPIO signals are defined as wakeup sources.
If any of them has a rising edge, wake the processor

and this is an example usecase that WOULD require the M4F to do some logic:

4 different GPIO signals are wakeup sources.
if GPIO1 has rising edge, wake processor
else, if GPIO2 has 3 pulses, wake processor
else, if GPIO3 has a rising edge, wake processor, but only if GPIO4 is already high
else, if GPIO4 has a rising edge, wake processor, but only if GPIO3 is already high

0 Walter Wang 4 months ago in reply to Nick Saulnier

Prodigy 200 points

Hi Nick,

At present, we only have the problem of automatic restart in 3 minutes.

For more details, please synchronize with you by gibbs.

Thanks

Processors

Processors forum

AM620-Q1: IPC crash in resume from mcu-only mode