PROCESSOR-SDK-AM64X: R5f core gets stuck after stopping and starting via Linux remoteproc interfaces a few times

Dominic Rath

Dear TI team,

we're seeing issues when starting and stopping an R5f application via the Linux command line interface.

Depending on the memory layout of the R5f application, the issue manifests as a complete freeze of the Linux system or as a R5f that is unresponsive / not actually running.

The default linker scripts from the IPC examples in MCU+ SDK 08.00 put the exception vectors in ACTM at address 0x0. In that case the A53 core freezes when re-starting the R5f application. The Linux kernel is stuck in memcpy_fromio which got called from rproc_elf_load_segments, and the CCS IDE is only able to retrieve a few registers. I believe this is what LCPD-20006 "AM64x: remoteproc may be stuck in the start phase after a few times of stop/start" from the Processor SDK Linux 08.00 release notes refers to.

Unfortunately the TRM for the AM64x is still very incomplete, but I believe the power sleep controller (PSC) is quite similar to the AM65x. I'm therefor using the AM65x TRM as a reference to the PSC0 registers.

The actual problem is that the R5f isn't properly shut down before. I've monitored the PSC0 MDSTAT24 register (0x400860) across the sequence of start-stop-start-stop... until it eventually fails after 2-5 cycles.

If everything is still well after stopping the R5f core via the remoteproc interface, the MDSTAT24 register reads 0x10a00. Starting the R5f the next time via the remoteproc interface works flawlessly.

After a few start-stop cycles, the MDSTAT24 register reads 0x11e0a - apparently the module got stuck somewhere on the way to turning it off. Starting the R5f the next time causes the A53 to freeze when it tries to write to the TCM. If I halt Linux before it actually loads the TCMs the MDSTAT register reads 0x11f0a.

If I link my application completely into MSRAM and avoid any use of a TCM, Linux happily starts-stops the R5f again and again, but the R5f doesn't really start. The application isn't actually running, I can't connect the R5f core via CCS, and the MDSTAT24 register reads 0x11f0a. Linux (the remoteproc driver) tells me everything's alright, but that is certainly not the case.

I came across a module with MDSTATn[STATE] = 0xa more than a year ago on the AM65x, see this thread: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/904954/processor-sdk-am65x-enabling-a53-core-fails-with-sr2-0-device-sysfw. At that time Dave Bell explained that the "TRM is not up to date with some updates that were done to the PSC module and not all states are reflected", but that 0x1f0a is a valid state. Unfortunately he never explained what state 0xa actually meant, and the TRM was never updated (even though that update was promised numerous times...). The problem on the AM65x was apparently triggered because our bootloader was using some older PDK code that directly wrote to the PSC registers. Using the Sciclient instead solved this issue, but I never got an explanation of what went wrong in the PSC.

The last thing I noticed is that the R5f apparently only locks up when it's being halted AFTER the MCU+ SDK application executed the System_init() and Board_init() functions. If I stall the application BEFORE the call to System_init() I can start-stop the application again and again, and the R5f remains accessible. If I stall the application AFTER the call to Board_init(), the R5f locks up after start-stopping it for a few times. Right now I'm suspecting that the problem only shows when the R5f modifies some clocks via the Sciclient, but since that code is auto-generated I haven't yet managed to stall the application somewhere inside System_init() and Board_init().

For my tests I've been using R5f subsystem #0 in Single-Core mode, but we've seen failures in normal Split-Core mode, too.

I'm running my tests on a mailine 5.14 kernel on custom hardware, but we've seen this issue with TI kernel 5.4 (SDK 07.03) on an EVM, too. We haven't see this issue with TI kernel 5.10 (SDK 08.00) so far - this is one of things I'm going to look into next, but since the LCPD-20006 issue is documented for SDK 08.00 I guess that this version should be affected, too.

I've been using SYSFW release 2021.01a before, but switching to release 2021.05 (the latest available in k3-image-gen) didn't solve this issue.

Are there any plans to fix LCPD-20006? When can we expect this to be fixed?
What do MDSTAT states 0x11e0a and 0x11f0a mean?

Regards,

Dominic

over 4 years ago

0 Dominic Rath over 4 years ago

Mastermind 7450 points

Ping?

It would be great to have some feedback on this issue, since this is a huge problem for development turn-around times. Right now I'm rebooting the target Linux every time I make a change to the R5f application.

0 Nick Saulnier over 4 years ago in reply to Dominic Rath

TI__Guru** 101460 points

Hello Dominic,

Yes, in AM64x Linux Processor SDK 8.0 The RemoteProc driver does not support a graceful shutdown of R5 cores. For now, it is recommended to reboot the board before loading new binaries into an R5F core (as per note in https://software-dl.ti.com/processor-sdk-linux/esd/docs/08_00_00_21/linux/Foundational_Components_IPC64x.html).

We have requirements to add graceful shutdown of remote cores to the RemoteProc driver, but I do not see a timeframe for implementation. I just reached out to the developers. Please ping the thread if I have not come back with their response by the end of the week.

There is a new revision of the AM64x TRM coming out soon, not sure whether it will contain more details about the PSC registers. Checking with the team on the timeframe for the new TRM revision as well.

Regards,

Nick

0 Nick Saulnier over 4 years ago in reply to Nick Saulnier

TI__Guru** 101460 points

Update, the next revision of the AM64x TRM should be online by tomorrow.

-Nick

0 Dominic Rath over 4 years ago in reply to Nick Saulnier

Mastermind 7450 points

Hello Nick,

I was refering to the AM65x TRM. The issue there is that 1.5 years ago, someone from TI told me that the TRM wasn't updated with regard to the PSC registers, and today there's still no updated TRM.

But it's good to know that a AM64x TRM update is about to come, thanks for that information.

Regards,

Dominic

0 Dominic Rath over 4 years ago in reply to Nick Saulnier

Mastermind 7450 points

Hello Nick,

not sure what "graceful shutdown" refers to. I understand that R5f might have configured stuff on the SoC level that needs to be halted/reset, too, but right now my issue is that the core is stuck somewhere in power management. I'm looking forward to hearing back from you regarding the timeline for this feature.

Is "not supporting graceful shutdown" the same as LCPD-20006?

I understand that software can't support every use case from the start, but with the lack of documentation it's impossible to diagnose or fix these things ourselves, too.

Regards,

Dominic

0 Dominic Rath over 4 years ago in reply to Nick Saulnier

Mastermind 7450 points

Hello Nick,

have you heard back from the developers regarding a timeframe for the R5f RemoteProc "graceful stop" feature?

Also, you mentioned that a new AM64x TRM was supposed to be available on Tuesday. I haven't been able to find that new version. Even worse, the AM64x product page doesn't list a TRM anymore at all, and the link that I believe should yield the "latest" version of the AM64x TRM doesn't work anymore either (Error 404):

https://www.ti.com/lit/pdf/spruim2

Is there some way for you to check if this is a bug somewhere on the TI site, or just delayed?

Regards,

Dominic

0 Dominic Rath over 3 years ago in reply to Dominic Rath

Mastermind 7450 points

TRM Rev. C is online since Saturday, contains the PSC0 registers, but unfortunately doesn't explain the meaning of MDSTAT.STATE / MDCTL.NEXT bits.

Is there any hope you could get that information?

Regards,

Dominic

0 Nick Saulnier over 3 years ago in reply to Dominic Rath

TI__Guru** 101460 points

Hello Dominic,

Yes, "Graceful shutdown" means that everything in the remote core is properly de-initialized so that the remote core can be re-started later if needed. The PSC error you observe is in line with what I expect when the R5 is not shut down properly.

I do not have a timeframe for adding graceful shutdown yet. It looks like the feature will NOT be added in time for Processor Linux SDK 8.1. At the moment, we are still in the "planning" phase as the R5 RTOS/bare metal developers work with the Linux RemoteProc developers to coordinate the changes needed to add a graceful shutdown.

In the meantime, let me send you over to a hardware engineer to comment on the PSC register values. They can send the thread back to me once yall are done from a register side.

Regards,

Nick

0 Colin Callaghan over 3 years ago in reply to Nick Saulnier

TI__Expert 8275 points

Dominic, Nick,

I have escalated this inquiry to managers and they are looking into this. I expect to reply to this thread shortly.

Regards,

Colin

Processors

Processors forum

PROCESSOR-SDK-AM64X: R5f core gets stuck after stopping and starting via Linux remoteproc interfaces a few times