TDA2HG: 【OpenGL】QT Weston Rendering bad performance

user5312037

Part Number: TDA2HG

Hello:

we'are working on visionSDK 3.05 with QT 5.6, and we found the fps output of AVM only about 17 fps. see the diagram following:

one QT application has two windows(summ &surround)
each window has individual OpenGl instance
each OpenGl instance shared the camera image data(memory addr)

and from camera side, the fps is 25, each opengl instance have different fps, and final output to weston is only about 17 fps.

that mean, two opengl window is not rendering parallel, bu concurrency. am right?

and how we can let these two opengl window rendering quickly, for each, can reach 25 fps?

thanks

over 3 years ago

0 Hemant Hariyani over 3 years ago

TI__Expert 8385 points

Hello,

Can you please run PVRTune, save the pvrtune to a file and share it? More details here:

https://www.imgtec.com/developers/powervr-sdk-tools/pvrtune/

0 user5312037 over 3 years ago in reply to Hemant Hariyani

Genius 4320 points

Hi；

please check the attachment.

thanks....

gpu-perf.rar

0 Hemant Hariyani over 3 years ago in reply to user5312037

TI__Expert 8385 points

Hello,

Thanks a lot for PVRTune.

Although it is not very clear from PVRTune but it seems like there are three renders from qt for 1 weston render. Render #1 and render #3 seem to be from the same qt application and render #2 seems different.

I also see that Weston and qt app run in parallel. But qt app tasks do not run in parallel. While it is difficult to guess the exact reason for this, it could be how qt windows and tasks are interacting with each other. Do you have details on this? Might be worth checking on the qt side. Do you have more details?

Regards

Hemant

0 user5312037 over 3 years ago in reply to Hemant Hariyani

Genius 4320 points

Hi Hemant:

render #2 is the weston?

0 user5312037 over 3 years ago in reply to Hemant Hariyani

Genius 4320 points

Hi Hemant：

To make it simple, we disable one the QT OpenGL window, and found these paint code will consume about 60~80ms

void SummWidget::paintGL()
{
    // Clear color and depth buffers
    auto start = std::chrono::system_clock::now();
    printf("SummWidget::paintGL begin \n");
    ShowCurrentTime();
    glClear( GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);

    m_avmGpu->avm_gpu_process(1);

    m_overlay->overlay_render_start();
    m_overlay->bottom_render_process();
    m_overlay->assist_line_process(0);
    m_overlay->radar_render_process();
    m_overlay->overlay_render_end();

    m_carModel->render_process(1);
    
    GLenum gl_err = glGetError();
    if (gl_err != GL_NO_ERROR)
    {
        //SUMMWGT_LOG("SummWidget::paintGL. gl_error=%d\n", gl_err);
    }
    auto end = std::chrono::system_clock::now();
    PaintOneFrameTIme = std::chrono::duration_cast<std::chrono::microseconds>(end-start).count();
    std::chrono::duration<double> diff = end-start;
    std::cout<<"paint  time consume us"<<PaintOneFrameTIme<<std::endl;
    std::cout<<"paint diff time consume s"<<diff.count()<<std::endl;
    printf("SummWidget::paintGL end \n");
    ShowCurrentTime();
}

and check the pvrtune tool, it seems qt instance frame per second always12.5fps, it these any configuration to block QT openGL?

and weston is also 12.5 fps

please check the attachment.single qt window.rar

0 Hemant Hariyani over 3 years ago in reply to user5312037

TI__Expert 8385 points

Hello,

Thanks for the experiment and PVRTune. We see that when Weston render task is running, qt tiler task can run in parallel. But there is no parallel execution when qt tasks are running. Not sure if there is some serialization is happeing (e.g glFinish, glFlush or glReadPixels).

Can we run some non qt tasks to check? May be something that runs slower than 60fps.

Regards

Hemant

0 user5312037 over 3 years ago in reply to Hemant Hariyani

Genius 4320 points

Hi Hemant

we run weston-simple-egl to check, pvrtune show about 60fps,but the log show about 40fps

root@dra7xx-evm:~# weston-simple-egl 
wlpvr: PVR Services Initialised
179 frames in 5 seconds: 35.799999 fps
156 frames in 5 seconds: 31.200001 fps
171 frames in 5 seconds: 34.200001 fps
159 frames in 5 seconds: 31.799999 fps
179 frames in 5 seconds: 35.799999 fps
172 frames in 5 seconds: 34.400002 fps
163 frames in 5 seconds: 32.599998 fps
172 frames in 5 seconds: 34.400002 fps
163 frames in 5 seconds: 32.599998 fps
^Csimple-egl exiting
wlpvr: PVR Services DeInitialised
root@dra7xx-evm:~#

weston-simple-egl.rar

0 user5312037 over 3 years ago in reply to Hemant Hariyani

Genius 4320 points

Hi Hemant：

we force the QT to run single opengl instance with period 33ms.

by the pvrtune, we see the total fps output is about 33fps and avm is about 15fps and weston fps is about 16 .

if we kill avm, both fps is 0.so that mean the weston block the avm GPU?

and correct my mistake that the painting of OpengGL is very fase, only need 1~2 ms.avm-weston.rar

0 user5312037 over 3 years ago in reply to Hemant Hariyani

Genius 4320 points

Hi Hemant:

Any update?

Is it blocked by weston? can we remove weston and use eglfs instead?

0 Yong Zhang over 3 years ago in reply to user5312037

TI__Genius 9401 points

Dear Customer.

we've tried run weston-simple-egl on VSDK03.05 + TI J6 EVM. the kernel log and PVR tune log all show that FPS reach to 60.

thanks for your previous help. your log shows that FPS is only 35. This should be a difference.

suggest to run below command to check GPU frequency firstly. and would you please check why there is such difference?

"omapconf show opp"

thanks a lot!

yong

0 Yong Zhang over 3 years ago in reply to Yong Zhang

TI__Genius 9401 points

Dear Hemant.

one question, we tested weston-simple-egl demo on TI J6 EVM. log shows 60 fps. would you please tell us how to calculate this fps?

root@dra7xx-evm:/opt/vision_sdk# weston-simple-egl &\
[2] 888

301 frames in 5 seconds: 60.200001 fps
301 frames in 5 seconds: 60.200001 fps

301 frames in 5 seconds: 60.200001 fps
301 frames in 5 seconds: 60.200001 fps
301 frames in 5 seconds: 60.200001 fps
301 frames in 5 seconds: 60.200001 fps
301 frames in 5 seconds: 60.200001 fps

Thanks a lot!

yong

0 user5312037 over 3 years ago in reply to Yong Zhang

Genius 4320 points

Hello All:

we're migrate to visionsdk 3.08, here are the latest result

root@dra7xx-evm:~# weston-simple-egl 
wlpvr: PVR Services Initialised
wlpvr: Creating Wayland Client surface 2 buffers for process pid=1216!
293 frames in 5 seconds: 58.599998 fps
283 frames in 5 seconds: 56.599998 fps
190 frames in 5 seconds: 38.000000 fps
254 frames in 5 seconds: 50.799999 fps
292 frames in 5 seconds: 58.400002 fps
296 frames in 5 seconds: 59.200001 fps
282 frames in 5 seconds: 56.400002 fps
211 frames in 5 seconds: 42.200001 fps
215 frames in 5 seconds: 43.000000 fps
286 frames in 5 seconds: 57.200001 fps
294 frames in 5 seconds: 58.799999 fps
286 frames in 5 seconds: 57.200001 fps
193 frames in 5 seconds: 38.599998 fps
222 frames in 5 seconds: 44.400002 fps
291 frames in 5 seconds: 58.200001 fps
292 frames in 5 seconds: 58.400002 fps
283 frames in 5 seconds: 56.599998 fps
218 frames in 5 seconds: 43.599998 fps
234 frames in 5 seconds: 46.799999 fps
288 frames in 5 seconds: 57.599998 fps
294 frames in 5 seconds: 58.799999 fps
285 frames in 5 seconds: 57.000000 fps
210 frames in 5 seconds: 42.000000 fps
219 frames in 5 seconds: 43.799999 fps

0 user5312037 over 3 years ago in reply to Yong Zhang

Genius 4320 points

HI:

see the pvrtrace in attachment.ecarx.log.rar

0 Yong Zhang over 3 years ago in reply to user5312037

TI__Genius 9401 points

Dear Hemant.

Customer uploaded the PVRTrace log for their avm_qt_app application, use case like below that posted in the beginning of this ticket.

please help comment on it.

Thanks a lot!

yong

0 Hemant Hariyani over 3 years ago in reply to user5312037

TI__Expert 8385 points

Hello,

Thank you for the trace. I can see the UI rendered along with some lines but no textures. Is it possible for you to provide a trace with TEXTURE_2D instead of TEXTURE_EXTERNAL_OES. This is only for debug.

Can you also open this trace in PVRTrace GUI and confirm if the output looks okay?

Regards

Hemant

0 user5312037 over 3 years ago in reply to Hemant Hariyani

Genius 4320 points

Hi hemant：

Hemant Hariyani said:
Can you also open this trace in PVRTrace GUI and confirm if the output looks okay?

Yes, we open it in PVRTrace GUI, it can show parts of the log file, see the log

0 user5312037 over 3 years ago in reply to Hemant Hariyani

Genius 4320 points

Hemant Hariyani said:
I can see the UI rendered along with some lines but no textures.

as the GUI only show the parts of the log, in that scene, the camera data is not load.

Hemant Hariyani said:
Is it possible for you to provide a trace with TEXTURE_2D instead of TEXTURE_EXTERNAL_OES. This is only for debug.

will update later.

0 user5312037 over 3 years ago in reply to Hemant Hariyani

Genius 4320 points

Hi Hemant :

the attachment is the pvt trace that use GL_TEXTURE_2D instead of GL_TEXTURE_OES.

for the unexpected EOF, we find it's the bug of PVRRecorder， we tried the latest version 2019R2, that not work. only the revision in PSDLA 6.03 works.

0 Hemant Hariyani over 3 years ago in reply to user5312037

TI__Expert 8385 points

Hello,

Seems like the attachment is missing. Can you please check and re-upload?

Regards

Hemant

0 user5312037 over 3 years ago in reply to Hemant Hariyani

Genius 4320 points

Hi Hemant:

uploaded

ecarx.log-GL_TEXTURE_2D.rar

0 Yong Zhang over 3 years ago in reply to user5312037

TI__Genius 9401 points

Dear Hemant.

Tried to use PVRTraceGUI 3.10 and PVRTraceGUI3.13 to open the PVRTrace log, ecarx.log-GL_TEXTURE_2D.rar. Still find the prompt dialog "unexpected end of file, topen partial trace?". So please check if the log is good for your check.

and aligned with customer, they used the build-in img-powervr-sdk from PSDKLA 6_00_00_03 to capture that PVRTrace log.

detail of the img-powervr-sdk folder like below. I only can print the version of PVRPerfServerDeveloper.

if the log is not enough, please help guide us to capture good PVRTrace log. for example, which version of PVRTrace record tool can be used?

Thanks a lot!

yong

0 Hemant Hariyani over 3 years ago in reply to Yong Zhang

TI__Expert 8385 points

Hello,

Thanks a lot for the trace. We are working with Imagination to analyze it and to see if unexpected end of file error is a problem.

Can you please confirm the screen resolution? I believe we are targeting 1280x720.

Can you also try running PVRTracePlayback on J6 to see that it looks okay visually and that we are not missing anything?

Also, can you please capture PVRTune of this application running on J6 (not of the PVRTracePlayback, we need to run the application natively).

Regards

Hemant

0 Hemant Hariyani over 3 years ago in reply to Hemant Hariyani

TI__Expert 8385 points

Hello,

Here is the feedback from Imagination:

There are two hard synchronisation points in each frame that causes a lack of task overlap. This is due to the two glFinish() calls in every frame. Without these the tasks would be packed much tighter, reducing frame times.
There is a glTexSubImage2D() call in most frames which creates a series of GPU transfer tasks and causes one CPU core to hit 100% usage. Could be an application side CPU bottleneck around this image upload.

Can you please get rid of the two glFinish calls? Do we really need glTexSubImage2D calls every frame? It is best to avoid updating textures every frame. Is this really needed?

Regards

Hemant

0 user5312037 over 3 years ago in reply to Hemant Hariyani

Genius 4320 points

Hi Hemant:

Thanks very much, we'are checking now.

our screen is 1920*720, has two render area

summ:670x720
surround:1280x720

0 user5312037 over 3 years ago in reply to Hemant Hariyani

Genius 4320 points

Hi Hemant:

As your known that AVM is based on QT framwork, just check that we have two QOpenGLWidget windows that each QOpenGLWidget will call the glFinish() .

and glTexSubImage2D() also called in QT framework.

0 Hemant Hariyani over 3 years ago in reply to user5312037

TI__Expert 8385 points

Hello,

Understandable. Is it possible for you to check with qt for their recommendation? We will also look to see if we can find more.

Regards

Hemant

0 user5312037 over 3 years ago in reply to Hemant Hariyani

Genius 4320 points

Hi Hemant:

Hemant Hariyani said:
Also, can you please capture PVRTune of this application running on J6 (not of the PVRTracePlayback, we need to run the application natively).

how about, we share you with our AVM software, that you can debug on TI evm board?

0 user5312037 over 3 years ago in reply to Hemant Hariyani

Genius 4320 points

Hi:

please see the pvrtune in attachment.AVM-PVRTUNE-ECAR.rar

0 Hemant Hariyani over 3 years ago in reply to user5312037

TI__Expert 8385 points

There are some inefficiencies that are leading to the performance drop. Some of these seem to be in QT framework:

The app is leaving around 12ms on the table due to the lack of task overlap resulting from the glFinish() calls. Without this it can reduce the average 44ms frame time to below 33ms, hitting the 30FPS target.

The tune indicates small clusters of frames separated by large 0.2 second gaps filled with 22 transfer kicks which is probably a result of the glTexSubImage() call every 2-3 frames. During this period a single CPU core always goes up to 100% usage which could be expected driver overhead since this tune recording is from a trace playback and so should not show any application CPU bottlenecks.

The shaders are mostly quite simple, but a few fragment shaders have branching on a varying input to select the appropriate texture to sample from, which could lead to dependent texture reads. It would be more ideal to instead split up the draws into multiple smaller draws for each material/texture instead of branching.

Blending is enabled for a majority of the frame, even for opaque draws which has an impact on performance.

Alpha blended draws with zero alpha are also submitted which have a significant Tiler workload. This can be minimised by culling any invisible draws. It may also be beneficial to draw opaque scene elements first, followed by alpha blended scene elements and finally alpha blended UI elements.

The application is generally renderer limited but we cannot pinpoint whether it is texture limited because of the lack of texturing related counters, but can assume it is looking at the shaders. There is one 1920x720 texture (attached below) that is drawn on a fullscreen quad at the end of the frame which is quite bandwidth heavy, so it might help to compress this and any other pre-authored textures with PVRTC. This is also the most expensive draw according to Trace, as the application is simply rendering a fullscreen quad and sampling a texture for each fragment and alpha blending it with the contents of the framebuffer. This is quite wasteful as you can see from the attached image, it is mostly transparent, but these texels are still sampled. A more optimal approach would be to split the image into smaller sub images and drawing them separately with alpha blending enabled only for the parts that make sense. This combined with texture compression would greatly reduce the overhead of the final draw.
This may be weston/qt and not in application control

We need to work on getting eglfs working in this scenario and that should get us the performance boost. QT also seems to have bottlenecks that need to be addressed.

Regards

Hemant

0 Hemant Hariyani over 3 years ago in reply to Hemant Hariyani

TI__Expert 8385 points

Posted response from IMG in the previous post.

0 user5312037 over 3 years ago in reply to Hemant Hariyani

Genius 4320 points

Hi:

thanks for you update.

Hemant Hariyani said:
The tune indicates small clusters of frames separated by large 0.2 second gaps filled with 22 transfer kicks which is probably a result of the glTexSubImage() call every 2-3 frames. During this period a single CPU core always goes up to 100% usage which could be expected driver overhead since this tune recording is from a trace playback and so should not show any application CPU bottlenecks.

we have fix it

0 user5312037 over 3 years ago in reply to Hemant Hariyani

Genius 4320 points

Hi Hemant:

we've optimise the glfinish() for two opengl widget and glTexSubImage2d(). but no performance improve.

please help analyse the trace with IMG.(forward by local FAE)

thanks very much!!

0 Yong Zhang over 3 years ago in reply to user5312037

TI__Genius 9401 points

Dear Hemant.

PVRTUNE log:

1300.avm_qt_app-pvrtune.rar

PVRtrace log, as log is large and exceed the E2E limitation of uploading file, I compress it again to four subdivisions. you can extract all 4 files together.

avm_qt_app-pvr-1.zip.001

avm_qt_app-pvr-1.zip.002

avm_qt_app-pvr-1.zip.003

avm_qt_app-pvr-1.zip.004

And you can access TI driver to download them directly.

https://tidrive.itg.ti.com/a/WCbtNPmm7TFKJu0_/698c24bc-bb2e-436e-9a57-446ee44ff6aa?l

https://tidrive.itg.ti.com/a/NcwOvpRMDronzA61/8aef069b-7151-4e71-be97-458ceea242c5?l

Thanks

yong

0 Hemant Hariyani over 3 years ago in reply to Yong Zhang

TI__Expert 8385 points

Hello,

Some of those improvements can bee seen. But the app is still heavy on the render. Any updates on other suggestions that were made?

Also, weston does cause additional overhead. Did EGLFS work?

Regards

Hemant

0 user5312037 over 3 years ago in reply to Hemant Hariyani

Genius 4320 points

Hemant Hariyani said:
Did EGLFS work

not work, as we have two app, and the avm_qt_app should run in background for fast reboot, and the eglfs can not support multiple instances, right?

0 Hemant Hariyani over 3 years ago in reply to user5312037

TI__Expert 8385 points

Hello,

As discussed earlier, the recommendation is to optimize the application rendering before the framework. We hope you were able to do that. If you face any more issues with the SDK, please let us know in a new thread.

Regards

Hemant

Processors

Processors forum

TDA2HG: 【OpenGL】QT Weston Rendering bad performance