This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM5718: MessageQ latency

Part Number: AM5718

Hello.

My previous thread (https://e2e.ti.com/support/processors/f/791/t/849851#pi320966=1 ) is got locked, so I have to start the new one with the same question:

Are there any message delivery latency fixes scheduled in IPC framework?

Here is one of the related discussions with some details and tests.

e2e.ti.com/.../2904426

  • Hi Sergey,

    I read through your previous thread, and saw you mentioned the goal was to offload mathematical functions to the DSP.  What types of data/ functions are looking to offload?

    With the most recent Processor Linux SDK (6.3), I run a quick test using the "MessageQBench" demo application (/usr/bin/MessageQBench) on the AM572x EVM at various payload sizes (data below).  Standard IPC has a max payload of 512b, and 128b of that is used by the IPC itself, so max usable is 448b.  We have a "big data IPC" example that handles larger payload sizes. 

    If you wish to test the same on your system, you will need to setup a symbolic link to the right DSP firmware (I believe the default firmware is handling OpenCL).  The syntax is:

    MessageQBench <num iterations> <payload bytes> <remote core ID>, where DSP1 is has remote core ID 4.

    I believe you mentioned in an earlier post you saw numbers in the 1ms range.  Based on the numbers below, it looks like the IPC overhead is on the order of 130us with the latest SDK, and then the data payload itself is adding a small additional overhead (~20us for 448bytes).

    ln -s /lib/firmware/ipc/ti_platforms_evmDRA7XX_dsp1/messageq_single.xe66 /lib/firmware/dra7-dsp1-fw.xe66
    
    <reboot>
    
    root@am57xx-evm:~# /usr/bin/MessageQBench 1000 8 4
    Using numLoops: 1000; payloadSize: 8, procId : 4
    Entered MessageQApp_execute
    Local MessageQId: 0x80
    Remote queueId  [0x40080]
    Exchanging 1000 messages with remote processor DSP1...
    DSP1: Avg round trip time: 136 usecs
    Leaving MessageQApp_execute
    
    root@am57xx-evm:~# /usr/bin/MessageQBench 1000 16 4
    Using numLoops: 1000; payloadSize: 16, procId : 4
    Entered MessageQApp_execute
    Local MessageQId: 0x80
    Remote queueId  [0x40080]
    Exchanging 1000 messages with remote processor DSP1...
    DSP1: Avg round trip time: 137 usecs
    Leaving MessageQApp_execute
    
    root@am57xx-evm:~# /usr/bin/MessageQBench 1000 32 4
    Using numLoops: 1000; payloadSize: 32, procId : 4
    Entered MessageQApp_execute
    Local MessageQId: 0x80
    Remote queueId  [0x40080]
    Exchanging 1000 messages with remote processor DSP1...
    DSP1: Avg round trip time: 134 usecs
    Leaving MessageQApp_execute
    
    root@am57xx-evm:~# /usr/bin/MessageQBench 1000 64 4
    Using numLoops: 1000; payloadSize: 64, procId : 4
    Entered MessageQApp_execute
    Local MessageQId: 0x80
    Remote queueId  [0x40080]
    Exchanging 1000 messages with remote processor DSP1...
    DSP1: Avg round trip time: 143 usecs
    Leaving MessageQApp_execute
    
    root@am57xx-evm:~# /usr/bin/MessageQBench 1000 128 4
    Using numLoops: 1000; payloadSize: 128, procId : 4
    Entered MessageQApp_execute
    Local MessageQId: 0x80
    Remote queueId  [0x40080]
    Exchanging 1000 messages with remote processor DSP1...
    DSP1: Avg round trip time: 145 usecs
    Leaving MessageQApp_execute
    
    root@am57xx-evm:~# /usr/bin/MessageQBench 1000 256 4
    Using numLoops: 1000; payloadSize: 256, procId : 4
    Entered MessageQApp_execute
    Local MessageQId: 0x80
    Remote queueId  [0x40080]
    Exchanging 1000 messages with remote processor DSP1...
    DSP1: Avg round trip time: 152 usecs
    Leaving MessageQApp_execute
    
    root@am57xx-evm:~# /usr/bin/MessageQBench 1000 384 4
    Using numLoops: 1000; payloadSize: 384, procId : 4
    Entered MessageQApp_execute
    Local MessageQId: 0x80
    Remote queueId  [0x40080]
    Exchanging 1000 messages with remote processor DSP1...
    DSP1: Avg round trip time: 158 usecs
    Leaving MessageQApp_execute
    
    root@am57xx-evm:~# /usr/bin/MessageQBench 1000 448 4
    Using numLoops: 1000; payloadSize: 448, procId : 4
    Entered MessageQApp_execute
    Local MessageQId: 0x80
    Remote queueId  [0x40080]
    Exchanging 1000 messages with remote processor DSP1...
    DSP1: Avg round trip time: 160 usecs
    Leaving MessageQApp_execute

    Regards,
    Mike

  • Hello Michael,

    Unfortunately you missed the main issue that I found with TI IPC framework.

    Your tests are showing average turnaround time and thus are misleading.

    Try to modify the test to show maximum turnaround time and you'll find that the system is highly unreliable an is not suitable for any streamed data processing.

    For details - please review the initial thread:

  • Hi Sergey,

    Apologies, I had not read through the entire thread, and now understand the issue more fully.

    Looking at the example applications, I believe one issue is there is no settings for the scheduler policy or priority.

    Have you tried profiling the system with cyclictest?

    With something like 'cyclictest -p99 --policy fifo -n', I see max values in the 30us range (average 11us).

    What are you requirements in terms of scheduler latency, jitter and at what sample rate?

    Regards,
    Mike

     

  • Hello Mike,

    We have real-time audio processing application working on ARM A15 core, and even on non-RT Linux we have zero problems with it (running for days), thus I'm confident that it's not a local issue, but some problem inside IPC framework.

    Note: the IPC message turnaround test was executed on 100% idle system, thus there were no any concurring application running simultaneously with the test.

    As for your question: the requirement is to have a consistent latency mechanism for passing messages between cores.

    The question is: we are discussing this particular issue since March 2019 (so, it's more than one year in discussion).

    Is it possible for you or someone at TI to add a few string of code to MessageQ example and confirm the issue please?

    Capturing the max latency is just a few strings of code, really.

  • Hi Sergey,

    I believe the issue stems from the fact the IPC example was not designed to showcase hard realtime performance.  To achieve this, we would need to introduce scheduler policy and task priority.  Setting the nice value from the command line will only go so far.

    Have you experimented with cyclictest to see what level of consistency is achievable as a baseline?  The scheduler policy and priority values can be applied to an application to get better results.

    You can also run cyclictest in silent mode as a background task, and run another program like htop to see what other kernel tasks are running and can have a significant influence on a process that is not given highest priority.

    Going back to the discussion you had with Rex, I saw he provided data that showed there were still a few outliers in the 30-50us range when he adjusted the nice value.  Do you need zero outliers?  If yes, you are essentially requiring hard realtime, and there are techniques to approach that with Linux, but it is not something that we are providing out of the box.

    Regards,
    Mike

  • Hello Mike,

    we definitely do not search for realtime performance here, but we are looking for consistency.

    I'll try to explain.

    For example, let's assume that we have some data captured in real time from ADC, that we need to process it every 1000 usec on ARM core.

    So, if ARM core tries to offload some computations to DSP core - it needs to send this data to DSP (through shared memory), notify DSP about new task, and then wait for notification about DSP task completion to read processed data back from shared memory.

    Thus, if we have guaranteed notification turnaround time under 100usec - we can have 1000usec - 100usec = 900usec of guaranteed DSP processing time.

    If we have notification turnaround time under 50usec in average but with up to 1200usec random spikes - the DSP core is unusable for us (unless we want to implement some data buffering and increase the overall device processing latency)

    Actually, I'm not sure if IPC is really suitable even for video processing on DSP as I remember I've seen latency spikes higher than 40milliseconds  if I added some moderate load on A15 core.

    Taking in account that 30 frames per second video leaves about 30milliseconds for one frame processing - it would not work 100% stable even for video processing.

    In my opinion the whole IPC framework is extremely oversized. I'd be much happier to have an easy/lightweight/fast core-to-core notification mechanism and just use shared memory for passing data between cores.

    Personally, I suspect there is some kind of message queue and probably some kind of notification scheduler is hidden deep in the IPC sources - and that is the cause of random latency spikes.

    I'd say that for an average software engineer it's much easier to build his own message FIFO tailored for his particular task on top of simple-and-fast notification framework, than to dive deep into details of am57xx mailboxes, and implement these on linux and rtos sides.

    PS as for your questions about profiling - sorry we just have no resources and enough expertise for that. We choose TI am57xx as a platform because we thought that software support will be good, and it's worth higher cost than Chinese competitors.

    We'd like to just build our software on top of reliable and bug-free framework.

    We used the MessageQ test as a base for making our decision on choosing this platform and unfortunately the tests results are not reliable. 

    I even proposed to Rex to change the MessageQ test to shoe average and max\min notification turnaround time measurements to save other developers from these problems.

    It was a year ago. Hope you'll understand my frustration.

    made some work about providing mailbox usage examples/tests (here https://e2e.ti.com/support/processors/f/791/t/849851#pi320966=1) , but unfortunately this task seems to be frozen now.

  • may I ask if you have any response please?

  • Hi Sergey,

    Apologies for the delay.

    My feeling on this issue is the application must utilize the Linux scheduler policy and priority hooks.  The test application we provide is there to demonstrate how to setup IPC, and it is up to the end user to tailor the implementation to meet timing requirements for the end product.

    The profiling tests I mentioned are basic Linux tools that can be used to establish the baseline task execution jitter/ latency under various scheduler policies and priorities.  If you run the test with no scheduler or policy specified, you will see very large numbers for max latency, and correlates to what you're seeing with the messageq application.  One you add in the scheduler policy and task priority, the latency will drop dramatically, and the large spikes will go away.  The same policy and priority values can be used in your application to achieve reliable timing.

    Regards,
    Mike

  • Hello Mike,

    in the initial thread that was locked, you can find that I already tried to increase the priorities:

    Here is the cite: "I tried to increase the priority of the test application and [irq/48-mbox_dsp] process but I can't get the issue fixed."

    http://e2e.ti.com/support/processors/f/791/p/783901/2904426#pi320966=1

    Even tried it under RT Linux (described in the same thread)

    Regards.

  •  may I ask if you have any response please?

  • Hi Sergey,

    Unfortunately there is nothing more I can offer on this issue.  There are mechanisms provided by the Linux kernel to improve your latency issues and you will need to determine what will work for your system and application.

    Regards,
    Mike