This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

OpenMAX: parallel scheme causes video frame rate drop

Hi OpenMAX experts,

I am using a parallel OMX chain scheme and get a video frame rate drop which i can't explain. Can anyone please help?

I am working with an OMX chain that look as follows :

The connection between the capture and DEI is done as usuall by the function IL_ClientConnectComponents() and the tread IL_ClientConnInConnOutTask().
The connection between the capture and scalar2 is done as follows: input and output buffers of scalar 2 are not onnected to any OMX component and scalar2 allocates its input and output buffers. During IL_ClientCbCaptureFillBufferDone() the pointer to the filled buffer is stored in a mailbox, and the scalar2 input thread copies the buffer from the mailbox using EDMA into scalar input buffer. The outputs of scalar1 and scalar2 are copied into buffers allocated in Linux memory area using CMEM (memory area which is cached by the A8) and then averaged by the Neon, the result is copied to the display input buffer using EDMA. The frame rate for this OMX chain is unstable and it is changes constantly from 19-30 fps, also causing bouncing in video.
Reducing buffers size doesn't seem to change anything.(it is not the EDMA copying or the Neon that causes the problem).
When i have tested the OMX chains separately, the upper OMX chain capture->dei->dsp scalar1->copy in and out of Linux area buffer using EDMA ->display i had 30 fps (because of the DSP processing).
when i tested the lower OMX chain capture-> copy in and out of Linux area buffer using EDMA->scalar2->copy in and out of Linux area buffer using EDMA->display i had 59 fps.
Can anyone please advice?

Thanks,
Gabi 

  • Gabi,

    If you dod't do dsp and neon processing, are you able to get required performance ? Also why don't you use dual o/p of DEI instead of second scalar ? Any color format limitation ? Also Scalar 1 and Scalar 2 are same components SCWB or DEI ?

    DEI is doing de-interlacing ie algenabled ? Then you can try increasing number of o/p buffers of DEI .

    Regards

    Vimal

  • Hi Vimal,

    Thanks for your replay, sorry it took me so much time to reply.

    Vimal Jain said:

    If you dod't do dsp and neon processing, are you able to get required performance ? 

     

    I can put the DSP in loopback mode, no change in frame rate results, i must use the Neon for summing the 2 video streams, because i want to check if parallel scheme such as this can help me improve performances by offloading the DSP and use the Neon for further processing.

    Vimal Jain said:

     Also why don't you use dual o/p of DEI instead of second scalar ? Any color format limitation ? Also Scalar 1 and Scalar 2 are same components SCWB or DEI ?

    I am using second scalar even though i could have used the second output of the DEI for testing purposes, i want to test if it is possible to work this way in parallel unconnected path and if working this way can help in using better the SOC abilities.
    Scalar1 and Scalar2 are same components SCWB.

    Vimal Jain said:

    DEI is doing de-interlacing ie algenabled ? Then you can try increasing number of o/p buffers of DEI .

    DEI is not doing de-interlacing, and increasing number of o/p buffers of DEI doesn't seem to help. It looks like some kind of a SOC (TI8168) limitation.
    I have used some more parallel schemes which theoretically should have been giving better frame rate results then performing all the processing in the DSP, but i have never got better frame rate performances using a parallel scheme for some reason, even though every OMX component separately should give a higher frame rate. This make me think that maybe there is an overall data transfer bandwidth limitation or something like this.

    Vimal, i am very interested to hear your opinion.

    Thanks,
    Gabi 

  • Hi Gabi.

    Few suggestions ( Though I am not sure, how much of it is really applicable in your use case).  

    1. If you use DEI dual o/p mode, they are two different parallel path in HW, so You could save some DDR b/w by using dual o/p instead of EDMA.

    2. If you are doing averaging ( some read/write) in Neon and some read/write  in DSP, It would be better to do it at one place either in DSP or Neon so that data is read only once. It saves good amount of CPU/DDR b/w.

    3. Why EDMA for avg o/p to display buffer ? Is it not possible to do o/p in display buffer itself, rather than EDMA ? 

    4. Instead of using two instances of SCWB, may be DEIM can be used, it has different Scalar in this path? each IP is supposed to give max 1080p60 ip or o/p. So Same SC would give half the rate. ( based on actual usage, you might see bit better performance)

    Regards

    Vimal

  • Hi Vimal,

    Thank you very much for your answer.

    Vimal Jain said:

    1. If you use DEI dual o/p mode, they are two different parallel path in HW, so You could save some DDR b/w by using dual o/p instead of EDMA.

    Done that didn't make any difference.

    Vimal Jain said:

    2. If you are doing averaging ( some read/write) in Neon and some read/write  in DSP, It would be better to do it at one place either in DSP or Neon so that data is read only once. It saves good amount of CPU/DDR b/w.

    In the DSP i am processing decimated size video, decimation (down scale) is performed by the DEI and after processing in the DSP i am using scalar for interpolation (Up scale) back to original size, at the Neon i am averaging 2 original size videos, in you suggestion i need to perform the up scaling of the reduced video in the DSP instead of external HW (scaler) and also perform the averaging in the DSP, my goal is to reduce the load at the DSP and to spread it on other processors (HDVPSS and A8).

    Vimal Jain said:

    3. Why EDMA for avg o/p to display buffer ? Is it not possible to do o/p in display buffer itself, rather than EDMA ? 

     

    Because making SR2 cachable for A8 gave poor performances at the Neon, i had to allocate working buffers for Neon at the CMEM memory area (using CMEM) and copy data from SR2 to working buffers at CMEM and copy averaged data from CMEM area to SR2 for display all using EDMA. I will be happy if you can suggest a better way.

    Vimal Jain said:

    4. Instead of using two instances of SCWB, may be DEIM can be used, it has different Scalar in this path? each IP is supposed to give max 1080p60 ip or o/p. So Same SC would give half the rate. ( based on actual usage, you might see bit better performance)

    Didn't seem to make any difference.

    Thanks,
    Gabi 

  • Hi Gabi,

    I feel, It must be Neon processing that is reducing the frame rate. Based on above post, I assume you have not tried both chains without neon processing. As per the posts, upper chain worked at 30 fps. ( so this is max you would get when both chains are up), but in this you are not doing any neon processing ? isn't  it ? Could ypuplease try upper chain with some dummy neon processing with one stream ?

    Also if it helps, I have come across JPEG 720x1280 decoding and YUV to RGB conversion of same with 30+ fps, EDMA was used to get YUV data in DSP internal memory before conversion.

    Regards

    Vimal

  • Hi Vimal, 

    Thank you for your respond, I was hoping to stay with the 30 fps, it is OK for me, I have tried both OMX chains without Neon processing, as i have written in the first post, the upper one was 30 fps and the lower was 59 fps, when i combined both scheme together but used only one scheme for the display and performed loop-back via the Neon, the result were still 30 fps for upper scheme and 59 fps for lower scheme. Deterioration in frame rate occurred when i was doing averaging. 

    Vimal Jain said:

    Also if it helps, I have come across JPEG 720x1280 decoding and YUV to RGB conversion of same with 30+ fps, EDMA was used to get YUV data in DSP internal memory before conversion.

    I want to look at this code please if it is possible. Is this a DSP code or Neon code? 

    Thanks,
    Gabi 

  • Hi,

    I do not have the code, It was dsp code. May be you can do averaging on DSP. You could DMA few lines of both buffers in internal memory of DSP, do processing and do output DMA for those lines. Neon accesses to DDR would be very slow compared to DMA. And yes, if averaging can be done on smaller resolution, you could combine other algorithm as well.

    Regards

    Vimal

  • Hi Vimal,

    Thanks you very much for your answer.

    Vimal Jain said:

    May be you can do averaging on DSP. You could DMA few lines of both buffers in internal memory of DSP, do processing and do output DMA for those lines. 

    The DSP OMX component is connected after the DEI which down-scale the video and its output is up-scaled with scalar 1, i need to average the video after scalar 1 and the video after scalar 2, what would be the best thing to do in order to perform averaging on those two outputs using the DSP? should i connect another DSP OMX component with two inputs? Do you have any other suggestion?

    Vimal Jain said:

    Neon accesses to DDR would be very slow compared to DMA. 

     

    Can TI please recommend a way of using the Neon efficiently for video processing? Where and How to allocate the Neon working buffers? Is it possible to access Neon L2 cache similar to accessing DSP L2 cache? Please answer this questions because i believe many TI816X user will need this kind of information.

    Thanks,
    Gabi