This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Ipc_Start() and IPC configuration

Other Parts Discussed in Thread: SYSBIOS

We have been using MessageQ components for some time to share messages between the ARM and DSP cores on the Keystone II. Recently we discovered that with heavy usage, we are occasionally seeing both corrupted messages and queue dropouts (MessageQ_put from DSP is being called, MessageQ_get on ARM stops seeing messages). We are looking over our configuration and initialization.

First, we have not been calling Ipc_start() from the DSP, but only from the ARM. According to the documentation, we would expect that MessageQ would not function this way, yet it has, to a large degree.

Second, including a call to Ipc_start() in the DSPs' main.cpp gives us the following linker error:

Undefined Symbol:

ti_ipc_transports_TransportRpmsgSetup_sharedMemReq__E

Reference in file:

C:\Users\mcobb04\Documents\SVN_Checkouts\working\sw\DSP\projects\Standard\Debug\configPkg\package\cfg\app_pe66.oe66

The following lines from our .cfg seemed most relevant:

var MessageQ  = xdc.useModule('ti.sdo.ipc.MessageQ');
var VirtioSetup = xdc.useModule('ti.ipc.transports.TransportRpmsgSetup');
var NameServer = xdc.useModule("ti.sdo.utils.NameServer");
var Ipc = xdc.useModule('ti.sdo.ipc.Ipc');
var SharedRegion = xdc.useModule('ti.sdo.ipc.SharedRegion');

Cache.setMarMeta(0xA0000000, 0x0FFFFFFF, 0);

xdc.loadPackage('ti.ipc.ipcmgr');
BIOS.addUserStartupFunction('&IpcMgr_ipcStartup');

var params = new HeapBuf.Params;
params.align = 8;
params.blockSize = 512;
params.numBlocks = 256;
var msgHeap = HeapBuf.create(params);

MessageQ.registerHeapMeta(msgHeap, 0);

xdc.loadPackage('ti.ipc.transports').profile = 'release';

MessageQ.SetupTransportProxy = VirtioSetup;

Does it make sense that we are getting some message functionality without Ipc_start, perhaps through IpcMgr_ipcStartup()? Is Rpmsg the appropriate transport for ARM->DSP?

We are using the TCI6638K2K EVM (I realize that TCI is supported through FAEs, but we are only using the broad market functionality of the chip -- it is what was available),

IPC version 3_22_00_05

  • Matt,

    When running on Keystone II with Linux on the host, the IPC startup configuration is different. You should have the following two lines of code in your DSP configuration.

    xdc.loadPackage('ti.ipc.ipcmgr');
    BIOS.addUserStartupFunction('&IpcMgr_ipcStartup');

    This replaces the need for calling Ipc_start(). The second line above will call IpcMgr_ipcStartup() which is responsible for initializing the IPC layer.

    ~Ramsey

  • Thanks Ramsey,

    That explains why what we have been doing is at least somewhat functional -- we do have those lines. Is this well documented though? My concern is that something else isn't setup correctly ... maybe we don't have a cache configuration correct (this appeared to be necessary when we first got it running), or maybe we haven't setup some other module correctly, such as TransportRpmsg or TransportRpmsgSetup.

    I have not seen the information you just posted on any IPC or Keystone II documentation, I only managed to cobble it together from an example belonging to a different platform. So is IPC properly documented for Keystone II?

  • Matt,

    I just filed a request to add IPC documentation for Keystone II. I also suggested we add an example.

    I don't actually work on Keystone (I do SYS/BIOS only). I'll ask around for suggestions. But your fundamental problem is message loss and data corruption under heavy traffic between ARM and DSP. Can you give us some idea on the message rate in both directions. Message size? How large is the message poll? What memory are you using?

    A typical source of problem is the cache management. Can you disable cache and see if the problem is reproducible?

    ~Ramsey

  • Hi Ramsey,


    Thank you very much for looking into this.

    We started seeing the problem with a queue from the ARM to each of the DSPs, and two queues on the ARM that all DSPs can write to. The messaging to the DSPs is ~125 msgs/sec to each DSP (aggregate 1000/sec). Each DSP replies to each msg, but all replies go to the same queue on the ARM. In addition, there is a status msg that DSPs send to the ARM every 10 secs. The high-rate messages are only 8 bytes larger than the message header. The status messages are 260 bytes plus the header.

    We have tried eliminating the larger status message, but still get message drops.

    The cache configuration is very confusing for me. Our .cfg borrows this line from an example:

    Cache.setMarMeta(0xA0000000, 0x0FFFFFFF, 0);

    Without this in the .cfg, our example fails to even get past the Ipc_start() on the ARM. But there's no indication of that memory being used by the DSP. It is part of DDR3, and the DSP .map shows that as being entirely unused. (The ARM code certainly uses the DDR3, but I don't think we have something like the *.map for the ARM that we know how to use.)

    You asked how large is the message poll -- I don't understand that question.In terms of time, we use MessageQ_FOREVER, in terms of memory, that's handled internally to the IPC, as far as I'm aware.

    That leaves the question "What memory are you using?" I'm trying to figure that out myself. MessageQ_alloc on the DSP pulls from L2SRAM, I believe, but I'm trying to figure out how this is transferred between the ARM and DSP. Our transport is ti.ipc.transports.TransportRpmsgSetup, but again here I can't find documentation to tell me much more.

    As a test, we eliminated all references in the DSP code to MSMCSRAM (we prefer to use it).This did appear to make our message dropout much less frequent, but not entirely gone. The DSP is not accessing DDR3 as far as we can tell.

    I've tried disabling cache as follows:

    Cache.setMarMeta(0x0c000000, 0x00600000, 0);
    Cache.setMarMeta(0x80000000, 0x80000000, 0);

    but this did not appear to help.

  • Matt,

    Re: 0xA0000000 address. This points to a block of external memory used by the vring transport. This block is partitioned up such that each DSP has its own piece. This is the memory shared between host and dsp. The application does not use this and it probably is not reported in the memory map.

    When the host sends a message to the dsp, it first acquires a message buffer. This buffer comes from a local message pool. When calling MessageQ_put(), the transport layer copies the message payload into a vring buffer. The message buffer is then returned to the local pool. An interrupt is raised to the dsp.

    On the dsp, the interrupt is taken which invokes the vring transport. The transport acquires a message buffer from its local pool (L2SRAM as indicated by you). The data is copied out of the vring buffer into the message buffer. The message is then placed on the receiving message queue and the semaphore is posted.

    The reverse path is very similar.

    Re: message pool size. I'm sorry, I had a typo in my question. I was curious how many messages are available in your local message pool (not poll). Looking at your earlier post, I see that the dsp has 256 messages in the pool, each of size 512.

    Your message rates don't seem extreme. I would expect IPC to handle this. How many DSPs are in your application?

    You indicate that you have two issues: 1) dropped messages, and 2) message data corruption.

    I think the dropped messages might be related to dropped interrupts. However, from what little I know, I expect that a subsequent interrupt would cause all pending messages to be delivered. When you experience a dropped message, does your message flow stop? Or does your message flow continue, but some messages are never delivered?

    How are you detecting message corruption? Do you see obvious bad data in the message payload or do you have some sort of checksum value? Does it look like the corruption is limited to just the message payload, or do you think the message header might also be corrupted? On which core do you see message corruption? All cores, just HOST? Does the message corruption happen only on the message queue written to by all DSPs, or does it happen on any random message queue?

    Thanks
    ~Ramsey

  • Hi Ramsey,

    Again, thank you for this valuable information.

    On 0xA000 0000 address, some notes I have indicate that using

    Cache.setMarMeta(0xA0000000, 0x01FFFFFF, 0);

    did not work (IPC failed immediately)

    but Cache.setMarMeta(0xA0000000, 0x0FFFFFFF, 0);

    did work (IPC startup message OK). The first case, with the smaller address range, is from an example for a different board. Is the 2nd case adequate? It's hard to understand the IPC using more a larger vring buffer than in the 1st case -- how is this calculated?

    It does not appear that this memory is protected from use by the DSP. Do we need to add a reservation for that buffer on the DSP? On Linux? Presently we are not using it anywhere on the DSP, but would like to do add a guard, if appropriate. On Linux, I don't know whether we could be using it. We are writing low-level drivers that do allocate memory.

    Answering other questions, when we have a dropped message, message flow stops for that DSP. The queue that it is writing to continues to receive messages from other DSPs, and the DSP that is not able to write messages appears to come out of MessageQ_put() without error. (We will recheck some of these things.)

    We detect corruption in messages that carry text to the ARM for printing. Occasionally a message gets shortened or garbled with non-ASCII text. We have not checked the header values, but will begin doing that. The bad message could come from any core. Sometimes one core will produce a bad message, another will drop out later .. so the problem does not appear isolated to one core. The longer we run the code, the more cores will drop out, but the rate of dropout seems to decrease ... 1 or 2 cores may drop out in a minute or two, If we keep testing for 10 minutes, we stop seeing cores drop out, but we haven't run it much further than 10 minutes. No hard data here.

    I don't believe we have ever seen ARM->DSP messages fail or be corrupted.These are 8 byte + header messages. Again, we'll work towards confirming that.

    EDIT:

    Regarding hardware interrupts, we have these *.cfg lines:

    var Hwi = xdc.useModule('ti.sysbios.family.c64p.Hwi');

    Hwi.enableException = true;

    Again, just cobbled together from examples. The c64p package looks suspect, but a hal.Hwi package did not offer enableException. What is correct here?

  • Matt,

    Re: cache MAR attributes. We are talking about the vring buffers here. Those sizes look odd. I would modify and interpret as follows:

    Cache.setMarMeta(0xA0000000, 0x02000000, 0) --> 0xA0000000,  32 MB non-cached
    Cache.setMarMeta(0xA0000000, 0x10000000, 0) --> 0xA0000000, 256 MB non-cached

    How many DSPs are you using? Do you have the above configuration in each DSP config file? My guess is that the ARM accesses the entire 256 MB range, but each DSP accesses only its own partition (32 MB block). It would be more correct to configure each DSP's MAR registers to reflect its own 32 MB block. I'm guessing that using one single large configuration was simply to make it easier.

    You should not be using any part of this memory block. It is reserved for and used by the vring transport layer. This is the memory used for passing messages between the HOST and each respective DSP. This memory is not used for DSP to DSP messaging.

    Each 32 MB block is partitioned into two parts: 1) HOST to DSP, and 2) DSP to HOST. The design is such that each side writes to its own partition. This eliminates the need for processor level protection (i.e. a processor gate between HOST and DSP).

    On the DSP side, I would expect local protection to be provided by MessageQ. In other words, if you have two tasks on the local DSP using MessageQ, the shared data objects should be protected. Do you have multiple tasks on each DSP using MessageQ? Do you suspect corruption due to local concurrency? Or do you have just one task on each DSP?

    On the HOST side, I would expect the Linux driver to provide local protection. However, I'm not sure how much this has been tested. Do you have multiple processes (with multiple threads) talking to a single DSP? Or do you have a single process (with multiple threads) talking to multiple DSPs? I think it would help us understand the possible issues if you could provide a picture detailing the thread partitioning and the IPC topology for your application.

    Re: dropped messages. From your reply, I understand that once a message has been dropped, the message flow stops. However, I did not understand the rest of your reply. Are the DSPs sending messages between each other as well as to the HOST? If so, then you must be using a dual-transport setup. So, each DSP has at least two message queues: 1) for receiving messages from the HOST, and 2) one for receiving messages from all other DSPs. Have I got this correct?

    I don't understand your comment about the DSP that is unable to write messages. Do you mean that when a DSP fails to receive a message it is stuck in MessageQ_get(), yet the sending DSP continues to successfully call MessageQ_put() for the stuck DSP? If so, then the messages should be piling up in the recipient's message queue. You should be able to observe this in ROV.

    Re: corrupted messages. So, only messages from DSP to HOST suffer this problem? HOST to DSP and DSP to DSP messages are okay? I think the next task is to determine if the corruption happens on the sending side (DSP) or on the receiving side (HOST). You will need to instrument IPC on the DSP for this task.

    In the transport layer, compute the message checksum before sending the message to vring. After returning from vring, but before releasing the message, compute the checksum again. This should be identical because the message buffer should have only been read (i.e. no writes). Then repeat this test at the vring layer. If this checks out, then it must be corruption on the HOST side.

    Re: Hwi. The ti.sysbios.hal.Hwi module is a generic front-end module. This provides simple to use APIs which should be the same on all processors. However, to leverage the more advanced features of a particular processor, you will need to access the processor specific Hwi module. We attempt to re-use the family specific modules when possible. The ti.sysbios.family.c64p.Hwi module is reused for the C66. So, you are doing it correctly above.

    ~Ramsey

  • Hi Ramsey,

    Thanks for helping narrow this down.

    We have changed our cache setup to this:

    Cache.setMarMeta(0xA0000000, 0x10000000, 0);

    so just added 1 to the length. All DSPs use the same image, so we're just having them all set the whole region as uncacheable.

    Let me describe our architecture more clearly.

    We have 8 DSPs. They do not message each other at all. They all use the same image.

    The ARM creates 10 message queues. 8 are used (1 per DSP) in some initial setup, and are not used again (they are left open; we intend to use them further in the future). Another ARM messageQ that we name "ARM_Reply" is used by all DSPs for very short messages every 10 msec per DSP. The 10th ARM MessageQ is named "ARM_Server" and is used for longer message every 10 seconds per DSP.

    Each DSP creates 1 message queue. The ARM writes a short message to each DSP every 10 msec per DSP.

    We have narrowed down the failure that causes an ARM queue to stop operating. Every so often a message sent from the DSP to the ARM does not arrive. We have checked the header values from the sent message (save to the DSPs local memory so that we can debug failures), and the header values look fine. Perhaps the best clues we have for you is that when that message to "ARM_Reply" does not arrive, a) other message sent by the same DSP to "ARM_Server" also stop arriving, and b) messages to "ARM_Reply" and "ARM_Server" from other DSPs continue to arrive correctly.

    I believe we are testing all status return codes, and see no errors from any of the MessageQ functions.

    Regarding the corrupted messages, it sounds like you need us to get into the IPC library and add some debug code there, or perhaps break points. So far we have not done that, but it sounds like that's where we need to go from here.

    Regards,
    Matt

  • Hi Ramsey,

    [EDIT: Deleting some bogus debug info. This part is accurate:]

    The 10 msec messaging and the 10 sec messaging occur on separate threads. Both call MessageQ_alloc with a heap ID of 0. Should the separate threads use separate heaps?

    - Matt

  • Matt,

    Thanks for the description. This helps me to understand what is going on. I'm not sure I have much to add at this moment, but here are some additional thoughts.

    When the ARM_Reply queue stops receiving messages from a particular DSP, but continues to receive messages from the other DSPs, this tells me there is a problem at the vring layer. Recall that there is a separate vring transport for each HOST to DSP pair.

    On the failed DSP, I assume that it continues to send messages to the ARM (which are never delivered to ARM_Reply queue). If so, the DSP calls MessageQ_put() which goes down to the vring layer. It acquires a new vring buffer, copies the message payload into the vring buffer, and sends an interrupt to the HOST. It then frees the local message buffer. This will continue until the vring buffer pool is empty. At this point, the DSP task will block until an empty vring buffer is returned to the pool (at which point the DSP task wakes and finally delivers the message). However, since the HOST is never returning any empty vring buffers, the DSP task will never wake again.

    On the HOST, the interrupt received from the DSP should wake a thread which will allocate a new message buffer, copy the data from the vring buffer into the message buffer, return the empty vring buffer to the pool (which should kick the DSP), and finally deliver the message to the ARM_Reply queue (potentially waking a thread blocked on this queue). Since this is not happening, my guess is that the interrupt is never received.

    I think the next step is to figure out where exactly the DSP is blocked. Does it continue to send messages as I guessed above or does it wait for a reply before sending the next message? If you see the DSP task blocked in RPMessage_send(), then I think the vring pool is empty and its waiting for a free vring buffer.

    ~Ramsey

  • Our 10 sec messaging does not wait for a reply, but continues to send messages. These do not make it to the ARM. Since it's slow messaging, we've probably never waited long enough to run out of vring buffers.

    The 10 msec messaging is really a reply to a command from the ARM. So the ARM tells the DSP to take an action, the DSP takes the action (right now that action is just a sleep to simplify things) and the DSP replies. If the ARM does not get a reply, it assumes the DSP is busy, and does not immediately re-command it. So here we see the DSP waiting in a MessageQ_get().

    Eventually the ARM decides a message was lost, and recommands the DSP. This command will be received, but the new reply from the DSP will not make it back to the ARM. This also (the ARM deciding a message was dropped) is a slow process that we have not run long enough to run out of vring buffers.

    Tomorrow I'll try digging into the IPC code to try to gather more debug information. I tried simply saving the entire vring area (256MB) to disk and looking for messages I expected to see, but found nothing. And suggestions are welcome.

    Thanks,
    Matt

  • Digging into the IPC code and finding where DSP messages are written, I see the following rough partitioning of the vring space.

    Messages from DSP core 0 to ARM begin around 0xAE881800

    Messages from DSP core 1 to ARM begin around 0xAE8C1600

    Messages from DSP core 2 to ARM begin around 0xAE901E00

    Messages from DSP core 3 to ARM begin around 0xAE941200

    and so on...

    These may be offset by a bit, but the spacing appears to be less than 256k. I was under the impression from your earlier post that these buffers would be 32 MB apart. Is there a simple way to check whether vring is configured correctly?

    We also, in saving the entire vring region to disk and looking for stray messages, see a handful of messages outside this region, for instance at 0xACBF067E. At least one of these we believe corresponds to a stalled queue.

  • Hi,

    Thanks for the informative post. I have been wondering about the same "mystery" region in our DDR3. Where is this stuff documented??? 

    Anyway, it seems that we also have 256MB (32MB for each core) reserved for ARM-DSP communication (vring thingies) in the DDR3. However, we would get by with much smaller reservation. Where is the size of those buffers defined? Is it somewhere in ARM Linux configuration?

    regards,

    Marko

  • Hi Matt/Marko,

    I'm trying to find someone from the MCSDK team to chime in on this thread for information about the VRING buffers on Keystone.

    Unlike other remoteproc-using SoCs, Keystone has a user-space loader (mpmcl) that specifies the VRING parameters (start address and size).  I'm not able to locate the source code for this at the moment, so hopefully an MCSDK developer can offer some helpful info to further this debug effort.

    Regards,

    - Rob

     

  • I came across the following pages that seem to be somehow related to the topic:

    http://processors.wiki.ti.com/index.php/MCSDK_UG_Chapter_Developing_System_Mgmt#Multiple_Processor_Manager

    http://processors.wiki.ti.com/index.php/MCSDK_UG_Chapter_Developing_Transports#MPM_Transport

    The first one states the following:

    • The default allowed memory ranges for DSP segments are as follows
    Start AddressLength
    L2 Local 0x00800000 1MB
    L2 Global 0x[1-4]0800000 1MB
    MSMC 0x0C000000 6MB
    DDR3 0xA0000000 (512MB)

    The DDR3 address seem to match the area in question. It seems that the DDR3 area can be used for loading SW with mpm. In the above pages it doesn't mention anything about the vring buffers that might reside in the same area. I understood from the mpm documentation that the DDR3 area could be left out completely (by removing it from the .json script) if not needed.

    I am still very puzzled about the DDR3 region starting from 0xa000 0000. What is it really used for? Loading DSP images from ARM? ARM-DSP communication (vring)? Who (and where) defines the memory region split between 8 DSP cores? Can the entiry region be left out? Can it be resized?

    regards,

    Marko

  • Marko Moberg said:
    I am still very puzzled about the DDR3 region starting from 0xa000 0000. What is it really used for? Loading DSP images from ARM? ARM-DSP communication (vring)?

    All I can say at this point is that the 0xA0000000 region is used for the VRINGS.  In our Ipc tests for Keystone, all code/data is placed in L2 SRAM.  remoteproc loads the ELF sections directly into the L2 memory (the ARM has access to each DSP's L2).

    As for the VRINGS, there are some data structures and the buffers themselves, all coming from 0xA0000000 and onwards.  Each DSP gets its own separate VRING area.

    I can't really provide any good answers at this point, so I will press the MCSDK folks for a response on this thread.

    Regards,

    - Rob

     

  • Hi, Marko,

    I think vring is defined and used by kernel. The DDR3 region at 0xA000 0000 is used between ARM and DSP when mpm loads binary to DSP, it copies to this area which DSP picks up from. It is defined in 2 places, one in the Linux dts file (dspmem in k2hk.dtsi) and the other the jason file used by mpm (mpm_config.json). This DDR3 area can be moved to the higher ground.

    Rex

  • Hi Rex and others,

    Do you have any contacts in mcsdk/arago guys? They might be able to provide some answers here. I am still a bit puzzled with this one.

    1) to me, it seems like a major configuration flaw to use the same memory region for vring buffers and for mpm loaded binaries. This is the default configuration in e.g. Yocto/arago release of mcsdk 03.00.04.18. Or is the logic such that once mpm has done its job, the area is no longer touched by mpm and vrings can live normally & happily ever after.

    2) What is the reason for reserving 256MB for vring buffers (32MB per core)? This seems to be quite an overkill. If someone could pinpoint the exact files where the sizes are defined I would be very happy. Searching through the entire Yocto build environment is quite a tedious task.

    Regards,

    Marko

  • Hi, Marko,

    I am not a DSP person, but there should not be any memory conflict. I am not sure if vring on DSP starts at the address 0xA000 0000. I think it does as Robert mentioned, it is on DDR3B memory for DSP. On the ARM core, 0xA000 0000 refers to DDR3A memory. So, MPM running on ARM core uses 0xA000 0000 on DDR3A to download images to DSP, but DSP allocates vring buffers at 0xA000 0000 on DDR3B. Please see the memory map on the datasheet for the ARM and DSP views of the memory.

    Rex

     

  • Hi Rex,

    Ok, that might explain a few things. I must have overlooked the DDR3A/B views since in our case all the DSP accesses to region starting from 0x8000 0000 are taken to 0x08 0000 0000 through MPAX settings.

    Anyway, do you have any info where the vring buffer sizes are defined and why they are so huge?

    Marko

  • Hi, Marko,

    As what was said, Linux on ARM core also maps 0x8000 0000 to 0x08 0000 0000 if lpae is enabled (which is by default). If DSP does not map 0x8000 0000 to 0x08 0000 0000, then 0x8000 0000 stays in DDR3B, but if it maps to 0x08 0000 0000, then that will be DDR3A memory. So, it still can cause a conflict. The DSP may want to map its 2GB space to the next 2GB block away from where Linux uses.

    I am not familiar with vring buffer size on DSP side. Robert may have info, but let me check around.

    Rex