This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Sys/Link MessageQ crashing

Other Parts Discussed in Thread: SYSBIOS

I have a TI81xx EVM board and am running into issues with the MessageQ crashing on the DSP side.  I am getting an assert in ListMP_getHead around line 411.  It appears to me that is an issue where the Q isnt locking out the other processor properly when accessing the ListMP object.  It happens in the SWI and looks to happen when the Q first tries to grab the message from the ListMP object.  I have verified that the next and prev pointers put into the queue via the ARM processor are correct so somewhere between the ARM calling TransportShm_put and the SWI on the DSP side the next/prev pointers get messed up.  Next gets set to INVALID and prev is 10Mb away.  I dont believe this is a memory overrun issue in my software but I am not certain of that.

I should note that I do run for a few iterations before this happens.

I am sending 4 messages from the DSP to the ARM every 500 usec.  I also am sending messages from the ARM to the DSP at a similar rate.  They are not ping pong messages.  They are sent independently.

When debugging this issue trying to determine what is going on I have found that slowing the messaging rate down (debug statements) fixes the assert issue but my software is then running too slow to function properly.

I am currently using Syslink 2.0.0.56 and IPC ...20 (I am at home now so I dont have the exact numbers).  They are the versions that shipped with the EVM.  I have seen there are newer versions.

 

Any advice?  Thanks

Dan

  • A related question:  how are you loading the DSP code?  I am having problems getting ProcMgr_load to work.  Whenever I attempt to have it to load any DSP code having IPC functions, say just an IPC_start() in main, the ioctl call to load the DSP image hangs.

    Lee Holeva

     

  • You do have the syslink module loaded before you do ProcMgr_load?

  • I think so (appologies for hijacking the thread)::

    lsmod
    Module                  Size  Used by
    bufferclass_ti          4742  0
    omaplfb                 7994  0
    pvrsrvkm              126601  2 bufferclass_ti,omaplfb
    TI81xx_hdmi            10392  0
    ti81xxfb               21223  2
    vpss                   38214  2 omaplfb,ti81xxfb
    syslink              1083435  1 vpss
    ipv6                  205968  12

    I've been using procId = 0 to talk to the DSP.  is this correct?

    Lee Holeva

     

     

  • I took the ProcMgrApp sample and utilized it to load the DSP software from my ARM application. 

    Back to the orignal topic....

    ipc_1_22_00_10_eng and syslink_02_00_00_56 are what I am using.  I tried moving to the latest and greatest for those and having a heck of a time getting the DSP to start properly and it looks like there are problems with the IPC all around so I am going to rollback to the tried and true setup and work from there.

  • If ProcMgrAppDrv.c is anything to go by it's 0 for the DSP, 2 for the media controller.

    Ralph

  • I have been attempting to use IPC 1_22_03_23 that came with CCS4.2.3 with Syslink 02_00_00_56 from the EZSDK.  Perhaps this is the source of my difficulties.  I am also using the ProcMgrApp code example.  I also built the syslink library from 02_00_00_56 source.  My strategy has been to build DSP code in CCS and ARM code from a Makefile on the Linux host.

    Lee Holeva

     

  • After a bit of experimentation, I have concluded that it is unlikely that IPC is the issue here.  I was able to build the rtos-side notifyApp in CCS using two versions of IPC:

    1.22.00.19

    and

    1.22.03.23

    using CCS4.2.3.  I have not been able to build the rtos samples in the SDK.  1.22.00.10 is not available for Windows.

    For both IPC versions, this is what I get running procmgrapp:

    ./procmgrapp.exe 0 notifyapp.out
    ProcMgrApp sample application
    Entered ProcMgrApp_startup
    ProcMgr_attach status: [0x97d2000]
    After attach: ProcMgr_getState
        state [0x1]
    ProcMgr_load status: [0x3046000]
    After load: Error in Ipc_control Ipc_CONTROLCMD_LOADCALLBACK: -1
    Leaving ProcMgrApp_startup
    Press enter to continue and perform shutdown ...

    Entered ProcMgrApp_shutdown
    Ipc_control Ipc_CONTROLCMD_STOPCALLBACK status: [0xffffffff]
    ProcMgr_stop status: [0x6a85000]
    After stop: ProcMgr_getState
        state [0x2]
    ProcMgr_unload status: [0x0]
    After unload: ProcMgr_getState
        state [0x2]
    ProcMgr_detach status: [0x6a85000]
    After detach: ProcMgr_getState
        state [0x0]
    ProcMgr_close status: [0x0]
    Leaving ProcMgrApp_shutdown

    Sometimes it hangs on load, but I always get the error message.  I cannot say why the C674 app that I am working on always results in procmgrapp hanging.  I think that we're all just spinning our wheels till the new version of the EZSDK come out on May 11.

    Lee Holeva

     

  • Dan,

    The first thing I would suggest is that you make sure your buffers shared between ARM and DSP used by ListMP are aligned to a minimum of 128 bytes as this is the requirement for the DSP when the buffers are from external memory and cacheable in L1 and L2 caches.  I've seen a case where someone was hitting the same asseration and it happend to be a buffer that was not cache aligned and corrupting the ListMP elem pointer.

    The second suggestion would be to try to catch the bad pointer with a hardware watch point.  If you know what the bad value is...you can try to setup a hardware watch point to catch it.

    When its about to assert, do you know what the ListMP pointer looks like?  It would be good to determine whether the pointer looks like its corrupted? or if it looks like a local pointer versus a SRPtr (SharedRegion pointer).

    Another suggestion if you are able to rebuild the IPC packages, would be to add some code to catch the assert before it asserts.

    Judah

  • Thanks Judah,

    I will take a look at my memory alignment, I believe I remember seeing that I wasn't aligning the memory but I am not sure.  I have been rebuilding the IPC and SYSLINK source to get debug statements and to catch the assert.  

    Next definately is getting set to the Invalid SR Pointer (~0) I am not sure if that is just coincidence though.  I haven't done anything with hardware break points at this point but I will give the alignment a shot and then look into them. 

    Some new tests have found the following as well... if I never call MessageQ_free on the DSP side it runs until it blows the heap up.  This runs much longer than I was getting before.  I also notice that it seems to die on the same SharedMemoryRegion pointer.  My assumptions are that there is something going on between freeing the message and it getting allocated again.  I have also separated the GPP and DSP message Qs into different Heaps within the shared region.  This verified that I wasn't crashing each side into each other on allocate.

     

    Dan

     

  • Alright it appears that the memory allocation for this is happening in the MessageQ and it is not using memory alignment so this would have to be an issue for everyone I would think. 

    I am utilizing 512 bytes messages, including the header so I would think I would be OK not forcing memory alignment.  I would assume it would happen right away if it were a caching problem too. I am by no means a chaching expert so I can't rule it out. 

    I will see about modifying the MessageQ if you think that sounds like the direction to go.

    Dan

  • Dan,

    MessageQ itself does not do alignment, but the heap from which MessageQ alloc's should be aligned.  Typically the user creates and registers the heap with MessageQ.  If you're creating that heap, then you should definitely make sure its alignment parameter is at minimum 128 bytes.  Now, if youre using of the Multicore heaps from Ipc, we will align the heap at very minimum to the alignment parameter of the Shared Region from which it is created.

    Judah

  • I see what you are talking about now.  I am using a Multicore Heap and I was setting it to 0, now 512.  Same thing unfortunately.

    I have been running some more tests and think I found a clue.  When running this the head->next pointer to locks this up is always the same shared region pointer value (99% sure its always the same).  I am printing out the values of the shared region pointers that MessageQ is putting into the transport and that value is is never being printed out.  It appears that either that message is being lost between my ARM code and the Usr Level MessageQ.  Or somehow the DSP side is getting a MessageQ item put it in that doesnt exist.  I am print out in the User Level MessageQ just before it calls the ioctl to the MessageQ driver.

    I am also maintaining an array of Messages sent from the ARM core so I can view their memory in code composer.  I see a few messages in a row that have the same next pointer in them.  Not sure how valid that is but interesting.

  • Now sometimes I am getting another assert on the DSP side.  This one happens in MessageQ_put which is being called by TransportShm_swi.  I am looking at all the fields on the GPP side and the MessageQ_Msg looks like all the data is correct,  on the DSP side of things the MsgId is invalid and the DstId is invalid.  So this looks like the Message isnt being recovered properly on the DSP side.

    The Next and Prev SR pointers for this message do not match what is pointed to on the GPP side.

    If I look at the memory where the DSP Side of the message should be looking at, it looks like a good ListMP list item.

    Dan

  • Dan,

    This looks like a cache coherence related problem. Can you post your SharedRegion configuration from your DSP-side cfg file? And the cache configuration of the DSP as well?

    Regards,
    Mugdha

  • Mugdha,

    Here is the info you requested.  This was taken mostly from sample apps.  Where would I find the cache settings on the DSP?  I am not manually setting a caching configuration for anything. 

    Dan

    var SHAREDREG_0_MEM     = 0x8E000000;
    var SHAREDREG_0_MEMSIZE = 0x01000000;
    var SHAREDREG_0_ENTRYID = 0;
    var SHAREDREG_0_OWNERPROCID = MultiProc.getIdMeta ("HOST");

    SharedRegion.setEntryMeta(SHAREDREG_0_ENTRYID,
        { base: SHAREDREG_0_MEM,
          len: SHAREDREG_0_MEMSIZE,
          ownerProcId: SHAREDREG_0_OWNERPROCID,
          isValid: true,
          name: "internal_shared_mem",
        });

  • Dan,

    If you are not changing the cache configuration, it should be ok.

    The SharedRegion configuration looks fine. Have you changed the DSP memory map (for DSP code/data or SharedRegion 0) from the default one in the SDK? If there's some conflict/mismatch, this kind of issue could occur.

    Regards,
    Mugdha

  • The memory map is the same. I do remember seeing in one of the example applications that the app used the wrong memory mapping for the shared region and that caused issues right off the bat for me.  Once I upated the sample app it to be 0x8E000000 and 0x01000000 it worked fine.

    Dan

  • Dan,

    Also, if you can post your exact MessageQ API flow in the application, that would help find out if there is any issue in API usage. One thing you need to be aware of, is that once you send a message using MessageQ_put to the remote core, the sending core no longer 'owns' that message, and has transferred ownership to the remote core. So the sending core must not do anything with the message after that (including MessageQ_free, or even peeking into the message contents for debug purposes). For example, if you peek into the message contents on DSP-side, and that pulled the contents into the DSP cache, it could result in issues.

    Since you mentioned that you were peeking into the contents from A8 side, I wanted to check if you are doing the same on DSP-side too. Since, on A8, the data is currently not in cacheable memory, peeking into the message for debug (while not advised), would not cause any issues. But it might cause issues if done from the DSP-side due to a known issue in ListMP that causes a problem if the message was somehow pulled into DSP cache after it was sent using MessageQ_put.

    Regards,
    Mugdha

  • Mugdha,

    I am not looking at anything in the debugger until after my application has crashed.  I added some debug code to ListMP.c.  The code I added is just a check to see if localNext==NULL then do an infinite loop.  I also added similar code to MessageQ.c on the MessageQ_put to catch when the MsgId was invalid.  I had to do this since the assert just exits and you get no stack history to see what happened.

    Here is the conceptual flow from ARM to DSP, this is the only direction that has issues.

    On startup the ARM will allocate 300 messages and store them in an array. Each time I need to send a message to the DSP I grab one and then increment the index.  Fill out the fields then send it to the DSP, the DSP utilizes it and then frees the message. 

    After the ARM sends the message it will reallocate a message for currentIndex - 100.  It basically sends a message and then 100 messages later it will reallocate that index.  I have been doing this because it has made things much more reliable. (The issue I thought I was solving with this was, that it was always reallocating the same SR memory addresses on the heap and that was conflicting because of the frequency of the messages. Maybe they were allocated on top of each other). 

    The larger my message queue (currently 300) the longer it will run. It will currently go about 700-900 messages before it crashes.  I had tried reallocating the currentIndex message right after use and also allocating a message, populating it, then sending it.  But both scenarios seemed to have issues where it will be very unreliable and therefore a pain to debug.

    Something I have also recently noticed is that say the ARM sends message 912 and all the fields are populated and it uses SR memory address 0x10.  SR memory address 0x10 was previously used for message 711.  It crashes and I look at the DSP side of things after its died, the DSP says it is trying to get SR memory address 0x10 off the queue but its information is pointing to message 711 with some invalid fields.  This definately seems like cache coherency issue.

    Dan

  • Dan,

    What is your message size? Just a reminder that the message size returned by API MessageQ_getMsgSize (and the size used when calling MessageQ_alloc) includes the message header. So the data size that you get for your message would be (size - sizeof (MessageQ_MsgHeader)).

    Dan Chizek said:

    Something I have also recently noticed is that say the ARM sends message 912 and all the fields are populated and it uses SR memory address 0x10.  SR memory address 0x10 was previously used for message 711.  It crashes and I look at the DSP side of things after its died, the DSP says it is trying to get SR memory address 0x10 off the queue but its information is pointing to message 711 with some invalid fields.  This definately seems like cache coherency issue.

    This may also occur if somehow you have a 'freed' message also remaining in use. For example, if DSP called MessageQ_free on a message, but somehow ARM had not cleared the message pointer in its array and later on happened to send the same message again using MessageQ_put. Can you recheck your code to make sure this is not happening in any scenario?

    Judah,

    Is there any known issue in IPC 1.22.00.10 related to Heap alloc / free & cache coherence?

    Regards,
    Mugdha

  • Mugdha,

    My messages are a total of 512 bytes (478 for my stuff, the rest for the header).  I am certain that the message is getting reallocated and not just resent.  The memory address is used a few times, it just so happens that after the 3-4 usage is when it isnt updating.  (By the way memory address 0x10 was just an example, 254416 I believe is a real address that I have seen this, it does change though).

    Dan

  • Dan,

    Dan Chizek said:

    (By the way memory address 0x10 was just an example, 254416 I believe is a real address that I have seen this, it does change though).

     

    I guessed that, and hence didn't complain about the fact that 0x10 is not a cache-aligned size  (to 0x80) :-)

    One quick way to see if this is a cache coherence related issue is to disable the DSP cache and check if you still see the problem. Can you try that?

    You can do this by updating your DSP CFG file to configure the Cache module accordingly (it's enabled by default). I've not tried this, but please check if this works.

        var Cache = xdc.useModule('ti.sysbios.family.c64p.Cache');
        Cache.initSize.l1pSize = Cache.L1Size_32K;
        Cache.initSize.l1dSize = Cache.L1Size_0K;
        Cache.initSize.l2Size = Cache.L2Size_0K;
        Cache.MAR128_159 = 0x0;
        Cache.MAR160_191 = 0x0
        Cache.MAR192_223 = 0x0;
        Cache.MAR224_255 = 0x0;

    Regards,
    Mugdha

  • Just noticed, 254416 is also not aligned to 0x80, which seems to be a problem. All messages must be aligned to 0x80. If your message size is 512, you should never get a non-aligned message from MessageQ_alloc. This is something you can look into further. All your message addresses must end with 0x80 or 0x00.

    Regards,
    Mugdha

  • I apologize again, I meant to put 524416. Which is aligned on 128.  I have tried turning off caching on the DSP.  It unfortunately breaks right away when I start my application.  Apparently the IPC is unable to communicate with the GPP at that point.  It starts throwing a bunch of errors about being unable to communicate.

     

     

  • Dan Chizek said:

    I apologize again, I meant to put 524416. Which is aligned on 128.  I have tried turning off caching on the DSP.  It unfortunately breaks right away when I start my application.  Apparently the IPC is unable to communicate with the GPP at that point.  It starts throwing a bunch of errors about being unable to communicate.

    Dan,

    524416 is better ... Exactly what problems did you get when you turned off caching on the DSP? What kinds of errors? Ideally, there shouldn't be any problems with caching disabled ... it should still continue to work. But one thing you can additionally do is to specify 'cacheEnable:false' for the SharedRegion configuration and check again.

    Regards,
    Mugdha