This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

SK-AM62A-LP: SDK 10.1: Attempting deep sleep seems to cause DM assertion in FreeRTOS kernel

Part Number: SK-AM62A-LP


Tool/software:

Hi,

With SDK 10.1, we've observed that when we attempt to put the EVM (revision PROC135E3) into deep sleep via "rtcwake -s 30 -m mem" or "echo mem > /sys/power/state", the DM R5 often (always?) attempts to print an assertion failure and the system appears to hang. Power readings across the sense resistors in this state largely match the values when awake. Some further debug details for SDK 10.1 follow below, after the divider.

With console_suspend disabled, the other symptoms we see are the same as PROCESSOR-SDK-AM62A: Deep Sleep Error in SDK 10.00: mailbox timeout first in ti_sci_suspend(), then followed by an error in the e5010_jpeg_enc kernel module during resume, or further ti-sci errors and MMC timeouts if e5010_jpeg_enc is unloaded before attempting to sleep. Perhaps the e5010_jpeg_enc and MMC errors are simply a result of the system getting into a bad state from the failed suspend and attempted resume?

Aside from similar symptoms in Linux, the DM behavior in SDKs 10.0 and 10.1 does appear to be different: in 10.0, we don't ever observe the assertion, and the DM does appear to make it to WFI (whether or not the e5010_jpeg_enc driver is unloaded first); attaching to the R5 via JTAG (or subsequently unpausing and pausing its execution), its PC is always one instruction after a WFI; presumably the debugger wakes it from the WFI? However, even if the DM is making it to WFI in SDK 10.0, power readings across the sense resistors are much closer to the values when awake.

In SDK 9.2, after unloading the DSP remoteproc module (since that version of the DSP firmware doesn't support graceful shutdown), sleep does appear to succeed (and combined, I read ~20-30 mW across the sense resistors), so this seems like a regression in SDKs 10.0/10.1. In 9.2, it's not necessary to unload the e5010_jpeg_enc driver and no mailbox or other errors are printed. As in SDK 10.0, attaching to the R5 via JTAG shows it does appear to make it to WFI.

All three SDK versions we tested on the EVM for this were the edgeai images from here with no modifications.

Note also that the DM appears to have a placeholder version string in at least SDKs 9.2, 10.0, and 10.1.

9.2:

##DM Built On: Apr  2 2024 18:17:23
##Sciserver Version: v2023.11.0.0REL.MCUSDK.MM.NN.PP.bb
##RM_PM_HAL Version: vMM.NN.PP

10.0:

##DM Built On: Aug 13 2024 21:19:51
##Sciserver Version: v2023.11.0.0REL.MCUSDK.MM.NN.PP.bb
##RM_PM_HAL Version: vMM.NN.PP

10.1:

##DM Built On: Dec 10 2024 20:25:19
##Sciserver Version: v2023.11.0.0REL.MCUSDK.MM.NN.PP.bb
##RM_PM_HAL Version: vMM.NN.PP


Unfortunately the assertion message in SDK 10.1 gets truncated to ~36 or 37 characters pretty consistently, which is only enough to print the "FreeRTOS-Kernel/t" part of the file where it occurred, but once, it did manage to print "FreeRTOS-Kernel/ta", so it seems the assertion may be in tasks.c.

Inspecting the R5 via JTAG, a few of the core registers contain addresses of relevant-sounding strings that could be the rest of the assertion message:

  • R1: 0x9D051194 -> "FreeRTOS-Kernel/queue.c"
  • R2: 0x9D050E79 -> "xQueueGiveMutexRecursive"
  • R5: 0x9D0524E6 -> "vTaskSwitchContext"
  • R6: 0x9D0511AC -> "FreeRTOS-Kernel/tasks.c"
  • R9: 0x9D04A559 -> "(uint32_t)(( ( &( pxReadyTasksLists[ uxTopPriority ] ) )->uxNumberOfItems ) > 0)"

Based on the sources in MCU+ SDK 10.01.00.33, the assertion condition pointed to by R9 does appear to occur in vTaskSwitchContext() via a call to the macro taskSELECT_HIGHEST_PRIORITY_TASK(), so it seems plausible that this was indeed what the DM was trying to print.

I see that the logging function uses a semaphore and that releasing it could have called xQueueGiveMutexRecursive(); perhaps a second assertion occurred there, which could be why the first didn't finish printing? The DM ends up in a tight loop polling a memory location on the stack (with value 1, comparing to 0); maybe one of the forever loops in _DebugP_assertNoLog() or _DebugP_assert()?

  • Hi Zach,

    Apologies for the delay. I'm not an RTOS expert so I'll need to consult the development team of the DM firmware on this item. Please be aware that Monday is a holiday at TI.

    Best Regards,

    Anshu

  • Thanks, and enjoy the holiday!

  • Hi Zach,

    There is a potential solution to the JPEG codec suspend/resume. There is a patch being developed right now and I can share it as soon as possible.

    Lets try the patch when its ready and see if anything gets fixed. Its possible since the JPEG codec isn't being suspended, the Linux kernel to crashes prior to putting the last A53 into WFI (or it catches the error during the suspend sequence but isn't able to resume). I think  that would result in the SoC saying its suspend yet the power consumption doesn't reflect that.

    Regarding the output from the RTOS/DM Firmware, I didn't get an update on this yet.


    Best Regards,

    Anshu

  • Hi Anshu,

    Is there an expected timeline for the driver patch?

    Have you heard anything from the DM firmware team? Since the JPEG driver error occurs during resume, after there's already been a mailbox timeout, it seems like the FreeRTOS assertion could be the root cause here.

    Thanks,
    Zach

  • Hi Zach,

    The patch was put on hold for the moment because the Edge AI Stack on AM62A doesn't support suspend/resume at the moment. There isn't an expected timeline to fix this issue.

    Best Regards,

    Anshu

  • Hi Anshu,

    the Edge AI Stack on AM62A doesn't support suspend/resume at the moment.

    Could you please elaborate on this?

    Our understanding was that as of SDK 9.2, the only thing preventing out-of-the box suspend/resume was that the DSP didn't respond to the graceful shutdown request from the remoteproc driver; as I noted above, with the DSP remoteproc driver unloaded (which doesn't even really stop the DSP if I understand correctly, since the shutdown request still times out), we see that the 9.2 edgeai image does successfully suspend and resume (based on both power measurements and SYSFW tracing via JTAG).

    Was it just coincidental that that happened to work?

    There isn't an expected timeline to fix this issue.

    Is this specifically for the JPEG driver patch, or the suspend/resume feature as a whole (including the FreeRTOS assertion)?

    Thanks,
    Zach

  • Hi Zach,


    Not all of the Edge AI components have support for low power modes meaning the save/restore of context or the graceful shutdown mechanism doesn't exist for all of the components. As the Edge AI components continued to develop more features and increase complexity, this lack of suspend/resume support has become more apparent, where a simple workaround like removing the DSP Remoteproc driver may not fix the overarching issue.

    The original discussion with the development team led to 'fixing' the JPEG codec driver as a possible solution. But after further evaluation, the JPEG codec may involve more Edge AI components than initially realized so 'fixing the driver' wasn't going to solve the issue.

    This may not be directly related to the FreeRTOS assertion that you've discussed in this thread, but it difficult to offer debugging steps/solutions when the overall LPM+EdgeAI isn't working as expected. 

    Best Regards,

    Anshu

  • Hi Anshu,

    Thanks for the clarifications regarding the Edge AI image, though we're somewhat disappointed to hear that support for sleep there is uncertain at the moment.

    Is suspend/resume intended to be supported on the default (non-edgeai) image? I've been able to test the 10.1 default image on our custom hardware, and it also fails to suspend with a mailbox timeout, although I don't see any assertion print out from the DM in this case.

    If the default images for the EVM are posted somewhere, I could see if the EVM fails the same way with the default image.

    Thanks,
    Zach

  • I've been able to test the 10.1 default image on our custom hardware, and it also fails to suspend with a mailbox timeout, although I don't see any assertion print out from the DM in this case.

    Apologies, this may not be fully accurate. We've noticed that our 10.1 default image still has the 10.0 firmware.

    I was able to test on our hardware with a 10.1 base image that has the 10.1 firmware though, and it still behaved the same: fails to suspend with a mailbox timeout, and no assertion printed by the DM.

  • Hi, the expert of the topic is out of office. Expect a response next week.

    Regards,

  • Hi Zach,

    The development team builds a 'default' non-Edge AI image for internal testing, but its not intended to be shared externally.

    Thanks,

    Anshu