Dear TI team,
we're seeing issues when starting and stopping an R5f application via the Linux command line interface.
Depending on the memory layout of the R5f application, the issue manifests as a complete freeze of the Linux system or as a R5f that is unresponsive / not actually running.
The default linker scripts from the IPC examples in MCU+ SDK 08.00 put the exception vectors in ACTM at address 0x0. In that case the A53 core freezes when re-starting the R5f application. The Linux kernel is stuck in memcpy_fromio which got called from rproc_elf_load_segments, and the CCS IDE is only able to retrieve a few registers. I believe this is what LCPD-20006 "AM64x: remoteproc may be stuck in the start phase after a few times of stop/start" from the Processor SDK Linux 08.00 release notes refers to.
Unfortunately the TRM for the AM64x is still very incomplete, but I believe the power sleep controller (PSC) is quite similar to the AM65x. I'm therefor using the AM65x TRM as a reference to the PSC0 registers.
The actual problem is that the R5f isn't properly shut down before. I've monitored the PSC0 MDSTAT24 register (0x400860) across the sequence of start-stop-start-stop... until it eventually fails after 2-5 cycles.
If everything is still well after stopping the R5f core via the remoteproc interface, the MDSTAT24 register reads 0x10a00. Starting the R5f the next time via the remoteproc interface works flawlessly.
After a few start-stop cycles, the MDSTAT24 register reads 0x11e0a - apparently the module got stuck somewhere on the way to turning it off. Starting the R5f the next time causes the A53 to freeze when it tries to write to the TCM. If I halt Linux before it actually loads the TCMs the MDSTAT register reads 0x11f0a.
If I link my application completely into MSRAM and avoid any use of a TCM, Linux happily starts-stops the R5f again and again, but the R5f doesn't really start. The application isn't actually running, I can't connect the R5f core via CCS, and the MDSTAT24 register reads 0x11f0a. Linux (the remoteproc driver) tells me everything's alright, but that is certainly not the case.
I came across a module with MDSTATn[STATE] = 0xa more than a year ago on the AM65x, see this thread: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/904954/processor-sdk-am65x-enabling-a53-core-fails-with-sr2-0-device-sysfw. At that time Dave Bell explained that the "TRM is not up to date with some updates that were done to the PSC module and not all states are reflected", but that 0x1f0a is a valid state. Unfortunately he never explained what state 0xa actually meant, and the TRM was never updated (even though that update was promised numerous times...). The problem on the AM65x was apparently triggered because our bootloader was using some older PDK code that directly wrote to the PSC registers. Using the Sciclient instead solved this issue, but I never got an explanation of what went wrong in the PSC.
The last thing I noticed is that the R5f apparently only locks up when it's being halted AFTER the MCU+ SDK application executed the System_init() and Board_init() functions. If I stall the application BEFORE the call to System_init() I can start-stop the application again and again, and the R5f remains accessible. If I stall the application AFTER the call to Board_init(), the R5f locks up after start-stopping it for a few times. Right now I'm suspecting that the problem only shows when the R5f modifies some clocks via the Sciclient, but since that code is auto-generated I haven't yet managed to stall the application somewhere inside System_init() and Board_init().
For my tests I've been using R5f subsystem #0 in Single-Core mode, but we've seen failures in normal Split-Core mode, too.
I'm running my tests on a mailine 5.14 kernel on custom hardware, but we've seen this issue with TI kernel 5.4 (SDK 07.03) on an EVM, too. We haven't see this issue with TI kernel 5.10 (SDK 08.00) so far - this is one of things I'm going to look into next, but since the LCPD-20006 issue is documented for SDK 08.00 I guess that this version should be affected, too.
I've been using SYSFW release 2021.01a before, but switching to release 2021.05 (the latest available in k3-image-gen) didn't solve this issue.
- Are there any plans to fix LCPD-20006? When can we expect this to be fixed?
- What do MDSTAT states 0x11e0a and 0x11f0a mean?
Regards,
Dominic