This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

MessageQ_get function is getting crashed

hi,

My application is getting crashed in MessageQ_get call. It is happening after running for considerable time. (means, The issue is not happening in the first call to MessageQ_get. The MessageQ_get funtion call returns properly for quite a lot of time).

Setup: TI 6678 EVM, mcsdk_2_00_02_14, bios_6_32_04_49, ipc_1_23_01_26 (I tried latest version of MCSDK. this also shows same behaviour).

Any help in debugging this issue is appreciated.

Thanks and regards,

Lijo

 

  • Lijo,

    Find out from observation whether the problem is time-related (always happens at the same time after start or always after a certain amount of time) or external event-related or internal event-related.

    Use CCS to discover the state of your application when the crash occurs.

    Insert debug code to catch the situation before the crash occurs. This requires some method to detect the crash is about to occur, and then set a breakpoint when that condition has been detected.

    Just to guess something to look for, I would suggested looking for a memory leak. Some stack or heap memory may not be recovered correctly due to some missing component of the code.

    Regards,
    RandyP

     

    If you need more help, please reply back. If this answers the question, please click  Verify Answer  , below.

  • RandyP,

    Thank you for your suggestion. the issue is not consistent with time.

    What we can see here is, the crash is happening in the previous call to MessageQ_put. Seems like assertion failure.  Below are my doubts.

    How can I debug if there is an assertion failure in one of the IPC module (here inside MessageQ)?

    Is it possible to use ROV tool to isolate the issue?

    Can we modify the MessageQ module to add prints to debug this?

    Regards,

    Lijo

     

  • hi,

    Related to the above issue, we are seeing issue with the address returned by messageq_get call.. the call returned with 0x2c08. We are struggling to find what causes this. Please help in this regard.

    regards,

    Lijo

     

  • Lijo,

    Based on the following definition of the MessageQ_get function:

    Int MessageQ_get(MessageQ_Handle handle, MessageQ_Msg *msg, UInt timeout)

    I assume that the "address" that you are referring to is the value returned in the "msg" variable. Is this correct?

    Also, could you please let me know what value is being returned by the function? Based on the function definition in C:\Program Files\Texas Instruments\ipc_1_23_01_26\packages\ti\sdo\ipc\MessageQ.c, it should return one of the values below. 

    #define MessageQ_S_SUCCESS               0

    #define MessageQ_E_FAIL                 -1

    #define MessageQ_E_TIMEOUT              -6

    #define MessageQ_E_UNBLOCKED            -19

    If it returns any value except MessageQ_S_SUCCESS, then the return error code may help point you in the right direction.

    You may also want to check the value of the MessageQ_Handle that you are passing into the function for a the "pass" case and the "fail" case to make sure that the value of your handle is not getting corrupted.

    Lastly, in an earlier post you said that your program was being crashed. Does this mean that the failure that you were getting was due to the incorrect address being returned? Are you getting any interrupts or exceptions that cause your code to jump to an unexpected location?

    Regards,

    Derek

  • Derek,

    Thank you for the reply.

    The address that I referred is the value returned in 'msg' variable only.

    messageq_get returns success only

    The messageq_handle passed to the function remains same.

    The crash is s follows

    * While the application is running in 8 cores, I am seeing the crash that i mentioned in first post. Call to messageq_get is made. the application got crashed somewhere inside the messageq_get call. This crash can happen pretty fast (after executing ~ 1 minute of execution). The crash is not happening at the same time always. It can occur in any core.

    * While the application is running in 2 cores, i am seeing the crash mentioned in my last reply. The messageq_get call returned with wrong address.This always occurs in core 0 only. This occurs after executing long time (~10 minutes after).

    i doubt the memory for HeapBufMP or List used in MessageQ is getting corrupted. (I am not sure about it). Is there any way to debug messageq ? or find the memory leak?

    Thanks and regards,

    Lijo

     

  • Lijo,

    Can you please provide more information about what you are seeing when your program crashes? Does this mean that the function is simply returning the wrong address? Are you getting any interrupts or exceptions that cause your code to jump to an unexpected location?

    To debug MessageQ, if you are not doing it already, you may find it helpful to include the MessageQ.c source file in your project, so that the source code is included in the project build. Then you can step into the function to figure out what could be happening.

    Identifying memory leaks/corruption and tracking down they are happening can be very difficult to do. One debugging option is to include the source file and step through the code to try to determine what is happening.

    Regards,

    Derek

  • Derek,

    As I mentioned in the previous post, if I am running the application on 8 cores, messageq_get function is called. this will not be returned. it will be ended up in abort().

    If i run the same application with only 2 cores (the application rtsc cfg file is configured for 2 cores), then one of the call to messageq_get returns a wrong address and then the messageq_free is getting crashed.

    I am not sure about any interrupts or exceptions that can cause the code to jump to an unexpected location.

    I will try the option that you have suggested.

    thanks and regards,

    Lijo


  • Lijo,

    For at least debug purposes, make sure your applications all run from different program locations (not identically shared memory locations) and the same for data, except where you are required to physically share a memory buffer. Since you are using MessageQ, you probably do not require physically identical memory.

    The examples in the TI Wiki Pages topic Using DSP/BIOS on Multi-Core DSP Devices are for the C6472 and DSP/BIOS and CCSv4, but the concepts show what I am talking about in terms of keeping things separated. There may be features of MCSDK that have not been incorporated into this topic.

    Your description of failures implies to me that the multiple cores are changing the other cores' memory. This is why I think it could help to make everything separate, in case you do not do that right now.

    Regards,
    RandyP

     

    If you need more help, please reply back. If this answers the question, please click  Verify Answer  , below.

  • Lijo,

    How are your program ending up at abort()? It is terminating as expected through failed function calls, or are you now sure how it has gotten to abort()? If you are not sure, you may want to set up an exception service routine. Many times, I have seen issues where an exception will occur, which will cause the program to jump to the address of the exception handler automatically setup by BIOS. However, BIOS does not automatically provide a default exception service routine (just the "hooks" for you to create one), so unless you have written that service routine, then the code will incorrectly start executing whatever code is after the exception service routine. This results in undefined behavior, and often eventually leads to an abort.

    It would be helpful to know if you are getting an exception, and implementing an exception service routine may help you track the error. It is good practice to implement an exception service routine, even if it is only a while(1) loop, which lets you know when an exception has occurred and prevents the code from executing at undefined locations. However, what I am suggesting will only help you debug the issue, not resolve it. 

    Based on your description of the issue, I believe that RandyP's assessment is correct and that that the multiple cores are changing the other cores' memory. I expect that executing the steps that RandyP has recommended will help you solve your issue.

    Regards,

    Derek

  • Thank you Derek and RandyP.

    I will try the suggestions.

    In the mean time, we are trying to merge the code from previous working version.

    regards,

    Lijo

     

  • Here is one observation regarding the crash.
    There is one function which returns 0. There is no processing inside. (This is added for future extension). We are passing 5 parameters to the function. One of the parameter is accessed directly from shared region. (the data got through messageq_get).
    * If we pass the parameter from shared region, then we are observing the crash.
    * If we assign the value from shared region to a local variable and passed to the function, the application is not at all crashing.

    regards,

    Lijo


     

  • Lijo,

    What levels of optimization are you using when you compile the file with this function and the file with the function that calls it?

    If you set a breakpoint at the call to this function and observe the value of this parameter in shared memory, does it vary between the two cases? What about the other parameters?

    Does the value in shared memory vary after returning from this function with no processing?

    Regards,
    RandyP

  • RandyP,

    Sorry for the late reply.

    1. We tried with both 'no optimization' and 'full optimization' (o3). In both cases, the issue is coming.

    2. The parameter from shared memory is used after this function call. The value is coming proper in that location. 

    3. The value in the shared memory is not varied after returning from this function.

    Thanks and regards,

    Lijo

     

  • Lijo,

    I noticed that there has not been any update on this thread in a while, so I wanted to follow up with you. 

    1. Were you able to partition the memory as Randy suggested so that each core has it's own memory? If so, what were your results?
    2. Were you able to implement code to turn on exceptions? If so, what were your results?
    One additional thought that I had was whether or not you have cache enabled. If you have cache turned on, then data could be cached locally on the device that wrote to memory, instead of being written to shared memory. Then when the other core reads shared memory, it could read the stale data in shared memory, and not get the expected data (which is still cached in locally).
    You can test if the issue is cache related, you can try running your test with cache disabled. 

    To disable the cache, include these header CSL files:

    #include <ti/csl/csl_cache.h>

    #include <ti/csl/csl_cacheAux.h>

    Then, during the initialization of your program, call these CSL APIs to disable L1 and L2 cache:

    CACHE_setL1DSize(CACHE_L1_0KCACHE); 

    CACHE_setL2Size(CACHE_0KCACHE);

    Regards,

    Derek

  • Derek,

    Thank you for the reply.

    We tried putting exception hook. In the erroneous case, it is not gone through the exception hook.

    By disabling cache, our application will not run real time. and we cannot simulate the erroneous case.

    In our current setup, we use same image with .text directed to MSMCRAM (not in shared region 0), all variable data sections to L2SRAM and all constant sections to DDR.

    Let us know if you have any other suggestion.

    Thanks and regards,

    Lijo

     

  • Lijo,

    In reference to your previous comment you made in a previous post: 

    We are passing 5 parameters to the function. One of the parameter is accessed directly from shared region. (the data got through messageq_get). 
    * If we pass the parameter from shared region, then we are observing the crash.
    * If we assign the value from shared region to a local variable and passed to the function, the application is not at all crashing.

    Since you cannot disable the cache, would it be possible to add another parameter to the function and pass the variable from the shared region and assign the value? Then, inside the function, you could do a comparison to see when the are not equal. If the values are not equal, then add a software breakpoint to halt execution.

    Something like this should work:

    if (shared_val != assigned_val)

    asm(" SWBP 0 ");

    Then if the code halts at the software breakpoint, then you can open a memory window and view the location of the variable in shared memory to see if it is cached.

    Please let me know are able to do do this test and what your results are.

    Regards,

    Derek

  • Derek,

    We have put a check for comparing the variables and put one print. We are not able to see any change in the value of variable eventhough crash is happening :( .

    regards,

    Lijo

     

  • hi,

    After we do the temporary fix (by assigning the value to a local variable and passed to the function), the issue is not appearing.

    We are not sure when will the issue arise again. If we come across this issue, we will update it here. Till then, we will put this on hold.

    Thank you  RandyP and Derek for the help.

    Thanks and regards,

    Lijo