AM5728: DSP Firmware Swapping

Gerard

Part Number: AM5728

I'm encountering problems starting the DSP after it has previously been started with a different firmware file. My initial clue that something was wrong was getting into an endless MessageQ_open() loop, trying to open the DSP slave messageq.

System Background:

Dynamic DSP Loading
- On our system we dynamically start and stop DSP1 and DSP2
- We utilize the /sys/bus/platform/drivers/omap-rproc/[bind,unbind] nodes to do this
- We always stop the DSP (unbind) before starting the DSP the next time (bind)
Multiple DSP Firmware Files
- When we start the DSPs we specify different firmware files to load by utilizing symlinks. For example, /lib/firmware/dra7-dsp1-fw.xe66 points to /tmp/dsp1_firmware which points to whatever DSP firmware we want to load at that instant.
- Thus, we regularly change the DSP firmware running on a given DSP
- The flow would be stop DSP (unbind), update symlink to new firmware file, start DSP (bind)
Versions
- Processor SDK 03.02.00.05

This works (my example focuses on DSP1 for brevity):

Fresh power up
Set DSP1 firmware symlink to firmware 'A'
Load rproc/rpmsg kernel modules, automatically loading DSP1 with firmware 'A'
Stop DSP1 (via unbind node)
Start DSP1 (via bind node)

This does not work:

Fresh power up
Set DSP1 firmware symlink to firmware 'A'
Load rproc/rpmsg kerneo modules, automatically loading DSP1 with firmware 'A'
Stop DSP1 (via unbind node)
Set DSP1 firmware symlink to firmware 'B'
Start DSP1 (via bind node) with firmware 'B'

If I look at the LAD log I see the following for the case that does not work:

[18.539938] NameServer_attach: connect failed: procId=4, errno=22 (Invalid argument)
[18.539961] closing send socket: 5
[18.539989] NameServer_attach: <-- refCount=0, status=-1

I've found that adding a huge delay of 15 seconds in between starting the DSP (via bind node) and doing the Ipc_start()/MessageQ calls the problem goes away. Unfortunately, the 15 second delay is unacceptable from a system performance perspective. I don't understand why this problem doesn't pop up if I stick to a single firmware file. The resource table is identical between firmware 'A' and 'B'.

Any suggestions?

Thanks

over 7 years ago

0 Biser Gatchev-XID over 7 years ago

TI__Guru**** 393215 points

The software team have been notified. They will respond here.

0 Gerard over 7 years ago in reply to Biser Gatchev-XID

Expert 1290 points

Any updates?

To add, when I switch firmware images, everything from the perspective of remoteproc kernel logging looks the same regardless of whether I'm going from firmware 'A' to firmware 'B' or firmware 'A' to firmware 'A' again. The issue is when I start to use IPC/MessageQ - it's as if internally the remoteproc core is in a bad state that fixes itself after ~15 seconds.

0 Jason Reeder over 7 years ago in reply to Gerard

TI__Genius 10415 points

Gerard,

This is strange behavior. I have reached out to the kernel developers and will let you know what they come back with.

Could you provide any deltas in your code between the 'A' and 'B' firmware leading up to the first use of IPC/MessageQ?

Jason Reeder

0 Jason Reeder over 7 years ago in reply to Jason Reeder

TI__Genius 10415 points

Gerard,

One of the developers thinks the root-cause is repeated attempts from userspace in re-establishing sockets used for querying NameServer requests, especially if there are open applications. Similar behavior has been seen before with error recovery.

Do you have any open applications/processes in userspace that are attempting to use the host side of the communication channel during the unbind and bind operations?

Jason Reeder

0 Gerard over 7 years ago in reply to Jason Reeder

Expert 1290 points

Regarding differences in firmware 'A' and 'B':

Firmware 'A' and 'B' software are essentially functionally equivalent for the period in time we're discussing in this thread - both go through some HW setup procedures and then idle, waiting for IPC/MessageQ communication. Both firmware images are exercising common code from that regard. It's not until later in time (once IPC/MessageQ is up and running) that the firmware images diverge and start doing their custom functionality.

Regarding any other processes that may be attempting to use the host side of the communication channel during the unbind/bind:

No, right now for this controlled test we are only running a single application that interacts with IPC/MessageQ. Longer-term we'll have 2 instances of this application - 1 for each DSP - but right now it's just the single instance.

Additional information:

Here's a little more detail added to the procedure:

Fresh power up
Shell script executes at boot-up
1. Set DSP1 firmware symlink to firmware 'A'
2. Load rproc/rpmsg kernel modules, automatically loading DSP1 with firmware 'A' (note: DSP1, IPU1, and IPU2 firmware doesn't exist so they time out)
3. sleep for 10 seconds
4. Power off DSP1 via unbind node (to save power - we may not need it for many minutes later)
5. Start IPC daemon
Custom application starts (only thing that interacts with remote processors/IPC/MessageQ/etc)
1. Set DSP1 firmware symlink to firmware 'B'
2. Power up DSP1 via bind node
3. Wait for /sys/bus/platform/drivers/omap-rproc/40800000.dsp to show up on file system (my attempt to know when DSP1 was ready to go)
4. Sleep for 1500 ms (I found I couldn't trust the node above to be an indication of when DSP1 was ready to go - changing this to 15000 ms "fixes" the firmware switch problem)
5. Start IPC/MessageQ (like example apps)
  1. Ipc_start()
  2. Create host queue (MessageQ_create())
  3. Open slave queue (MessageQ_open() - this is what ultimately fails - SW endlessly loops with MessageQ_open() attempts for the slave queue of DSP1)
  4. Send initial MessageQ message to DSP1
  5. ...

Hopefully that will shed a little more light.

Thanks

0 Gerard over 7 years ago in reply to Gerard

Expert 1290 points

The kernel logs provide a visual of the extra delay I see when I switch firmware images:

Case 1 - DSP1 loaded with firmware 'A', stopped, and reloaded with firmware 'A':

[ 222.564962] remoteproc2: stopped remote processor 40800000.dsp

[ 222.571247] remoteproc2: releasing 40800000.dsp

[ 223.593908] remoteproc2: 40800000.dsp is available

[ 223.598903] remoteproc2: Note: remoteproc is still under development and considered experimental.

[ 223.607932] remoteproc2: THE BINARY FORMAT IS NOT YET FINALIZED, and backward compatibility isn't yet guaranteed.

[ 223.653173] remoteproc2: powering up 40800000.dsp

[ 223.658121] remoteproc2: Booting fw image dra7-dsp1-fw.xe66, size 15366040

[ 223.672422] omap_hwmod: mmu0_dsp1: _wait_target_disable failed

[ 223.678325] omap-iommu 40d01000.mmu: 40d01000.mmu: version 3.0

[ 223.684267] omap-iommu 40d02000.mmu: 40d02000.mmu: version 3.0

[ 223.804556] remoteproc2: remote processor 40800000.dsp is now up

[ 223.811750] virtio_rpmsg_bus virtio0: rpmsg host is online

[ 223.817296] remoteproc2: registered virtio0 (type 7)

[ 223.869634] virtio_rpmsg_bus virtio0: creating channel rpmsg-proto addr 0x3d

Case 2 - DSP1 loaded with firmware 'A', stopped, and reloaded with firmware 'B':

[ 116.642082] remoteproc2: stopped remote processor 40800000.dsp

[ 116.648215] remoteproc2: releasing 40800000.dsp

[ 117.667211] remoteproc2: 40800000.dsp is available

[ 117.672227] remoteproc2: Note: remoteproc is still under development and considered experimental.

[ 117.681253] remoteproc2: THE BINARY FORMAT IS NOT YET FINALIZED, and backward compatibility isn't yet guaranteed.

[ 123.636295] remoteproc2: powering up 40800000.dsp

[ 123.641146] remoteproc2: Booting fw image dra7-dsp1-fw.xe66, size 17317496

[ 123.655324] omap_hwmod: mmu0_dsp1: _wait_target_disable failed

[ 123.661227] omap-iommu 40d01000.mmu: 40d01000.mmu: version 3.0

[ 123.667164] omap-iommu 40d02000.mmu: 40d02000.mmu: version 3.0

[ 123.782077] remoteproc2: remote processor 40800000.dsp is now up

[ 123.789302] virtio_rpmsg_bus virtio0: rpmsg host is online

[ 123.794868] remoteproc2: registered virtio0 (type 7)

[ 123.847911] virtio_rpmsg_bus virtio0: creating channel rpmsg-proto addr 0x3d

You can see that in the 2nd case, the time between the DSP being available and the remote processor framework starting the power up sequence is nearly 6 seconds.

Again, any input as to what the problem is would be greatly appreciated.

Thanks

0 Jason Reeder over 7 years ago in reply to Gerard

TI__Genius 10415 points

Gerard,

If you reload firmware 'A' multiple times you do not see this 6 second delay, right? What if you unbind and then bind firmware 'B' multiple times, do you see the same delay each time?

Are there any deltas between the resource tables of the two firmwares, or are they identical? What about code size differences between the two?

My assumption is that your original issues (NameServer_attach connection failure and MessageQ_open() loop) are symptoms of firmware 'B' taking more than 6 seconds to load and come online. Would you agree?

I'm asking the developers if they know what could cause a delay like this and I will forward them your responses when you post them.

Jason

0 Gerard over 7 years ago in reply to Jason Reeder

Expert 1290 points

Jason Reeder said:

If you reload firmware 'A' multiple times you do not see this 6 second delay, right? What if you unbind and then bind firmware 'B' multiple times, do you see the same delay each time?

Correct, I do not see the 6 second delay when reloading the same firmware over and over.

Jason Reeder said:

Are there any deltas between the resource tables of the two firmwares, or are they identical? What about code size differences between the two?

The resource tables are identical. The xe66 file sizes are 14.7 MB vs. 16.5 MB.

Jason Reeder said:

My assumption is that your original issues (NameServer_attach connection failure and MessageQ_open() loop) are symptoms of firmware 'B' taking more than 6 seconds to load and come online. Would you agree?

Yes, exactly!

I think I see what's happening now: the big delays seem to happen the first time a firmware file is loaded. In other words, after the initial transition from firmware 'A' to firmware 'B', the subsequent transition is fast - I would guess this is because of caching at some level.

So, this leaves me wondering if there is a programmatic way for application SW to know when it's safe to start IPC. I'd hate to have to keep in a 6 second delay in to account for what hopefully is an infrequent case. Is there a way that application SW can minimize the time between powering up/loading a given DSP and commencement of IPC/MessageQ communication?

Thanks

0 Jason Reeder over 7 years ago in reply to Gerard

TI__Genius 10415 points

Gerard,

The developer is in agreement that the reading of the file is where time is being spent here. Where is your filesystem being accessed from in your system? NFS? SD Card?

I am still awaiting a response on a programmatic method to ensure the firmware is loaded and running and will let you know when I get one.

Jason Reeder

0 Gerard over 7 years ago in reply to Jason Reeder

Expert 1290 points

Jason Reeder said:
Gerard,

The developer is in agreement that the reading of the file is where time is being spent here. Where is your filesystem being accessed from in your system? NFS? SD Card?

The DSP firmware is being accessed from an EXT4 file system on top of an eMMC.

Thank you for your support!

0 Jason Reeder over 7 years ago in reply to Gerard

TI__Genius 10415 points

Gerard,

One step to decrease the size of your firmwares being loaded (and hopefully the long load time) is to strip down the symbols and debug information, since the firmware class will copy the entire firmware. Can you try running the strip6x command with the -p option to reduce the built firmware image size?

On the programmatic-checking-for-firmware-load-complete side, the remoteproc core uses asynchronous firmware loading as the first step so you really cannot use the bound device as a means of saying the remoteproc has finished booting (as you have seen). There is a debugfs ‘state’ variable (at /sys/kernel/debug/remoteproc/remoteprocN/state) that gives the status of the remoteproc, but even this is marked up once the remote processor is released from reset (doesn’t accurately represent that the remoteproc boot itself is complete but should be closer than the beginning of the bind you currently use).

In userspace, any reattempts on MessageQ_open should use a delay, otherwise you are continuously triggering a whole bunch of NameServer requests (which I think is what you are already seeing).

The rpmsg devices themselves are published from the remote processor side during it’s boot and in turn gets bound to the rpmsg-proto driver. MessageQ stack with a particular processor is up only after the corresponding rpmsg-proto device is bound to the rpmsg-proto driver. One can indirectly use this bound device status, but otherwise, there is no easier way to accomplish this.

Jason Reeder

0 Gerard over 6 years ago in reply to Jason Reeder

Expert 1290 points

Jason Reeder said:

One step to decrease the size of your firmwares being loaded (and hopefully the long load time) is to strip down the symbols and debug information, since the firmware class will copy the entire firmware. Can you try running the strip6x command with the -p option to reduce the built firmware image size?

Great suggestion; we're going to experiment with this. Also, I'm going to transfer all of the firmware images into RAM (tmpfs) first so they are going from RAM to RAM instead of ext4 eMMC to RAM. I'll report back results.

Jason Reeder said:

In userspace, any reattempts on MessageQ_open should use a delay, otherwise you are continuously triggering a whole bunch of NameServer requests (which I think is what you are already seeing).

I do have delay in between requests - I essentially carried over the loop/delay logic from IPC/MessageQ demos. It seems that IPC/MessageQ never recovers if IPC/MessageQ functions are called before the DSP is fully powered on.

Thanks for the suggestions, as noted I will report back.

Gerard

Processors

Processors forum

AM5728: DSP Firmware Swapping