TDA4VM: Coredump at glBindFramebuffer

li daming

Prodigy 200 points

Part Number: TDA4VM

Hi Ti

gpu version is 1.13.5776728

SDK 7.2

Linux

The glBindFramebuffer coredump problem reappears. The current situation is the same as the original problem. See the link below for relevant information

https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1161957/tda4vm-coredump-appears-when-glbindframebuffer-is-executed/4385393?tisearch=e2e-sitesearch&keymatch=%252525252520user%25252525253A486294#4385393

According to the last communication, we collect the PVR related log

run：

pvrdebug -loggroups main,mts,hwr

exceute app，coredump

pvrlogdump_7002010852.gz

over 2 years ago

0 li daming over 2 years ago

Prodigy 200 points

To add a question, there is a kernel dump problem today, which is also in the PVR content

Please see the following information

0 Xu(SH) Liu over 2 years ago in reply to li daming

TI__Expert 3875 points

Since this GPU issue with two signatures, as suggested to increase vdd_core and vdd_cpu by 50mv

0 li daming over 2 years ago in reply to Xu(SH) Liu

Prodigy 200 points

Hello , we have 2 problems

1 . glBindFramebuffef appears coredump

2. kernel dump

Which problem does this solution “suggested to increase vdd_core and vdd_cpu by 50mv” correspond to?

0 Xu(SH) Liu over 2 years ago in reply to li daming

TI__Expert 3875 points

for both, i'm suspecting a GPU powering issue instead of software ones. Let's try it out.

0 li daming over 2 years ago in reply to Xu(SH) Liu

Prodigy 200 points

Hello

This method doesn't work，we found that coredump is most likely to occur when starting an app

So we conducted a test on startup
Test method: Start the app and Wait 200 seconds and restart
7 coredump occurrences in 20 hours

0 Erick Narvaez over 2 years ago in reply to li daming

TI__Mastermind 36307 points

Hello,

I've looked through the previous thread, but I am not sure what the latest state of your SDK version + GPU driver version is and what patches you have currently applied. Can you please summarize this, below are my current assumptions:

1) SDK version - 7.2

2) GPU Driver version - 1.13.5776728 from SDK 7.3

3) Patches on GPU Driver - None

I will take the current logs that you have provided and check with our team. In the meanwhile, can you please make sure that you have this patch applied to your GPU Kernel driver:

Fullscreen 3000.pp132915-caching-fw-ro-structures-v2_1_13a.diff Download

diff --git a/services/server/devices/rogue/rgxccb.c b/services/server/devices/rogue/rgxccb.c
index 3a8db74..369226e 100644
--- a/services/server/devices/rogue/rgxccb.c
+++ b/services/server/devices/rogue/rgxccb.c
@@ -598,7 +598,9 @@ PVRSRV_ERROR RGXCreateCCB(PVRSRV_RGXDEV_INFO	*psDevInfo,
 								PVRSRV_MEMALLOCFLAG_DEVICE_FLAG(FIRMWARE_CACHED) |
 								PVRSRV_MEMALLOCFLAG_GPU_READABLE |
 								PVRSRV_MEMALLOCFLAG_GPU_WRITEABLE |
-								PVRSRV_MEMALLOCFLAG_CPU_WRITE_COMBINE |
+								((RGX_IS_FEATURE_VALUE_SUPPORTED(psDevInfo, NUM_OSIDS) &&    \
+								(RGX_GET_FEATURE_VALUE(psDevInfo, NUM_OSIDS) == 8)) ? \
+								PVRSRV_MEMALLOCFLAG_CPU_CACHE_INCOHERENT : PVRSRV_MEMALLOCFLAG_CPU_WRITE_COMBINE) | \
 								PVRSRV_MEMALLOCFLAG_ZERO_ON_ALLOC |
 								PVRSRV_MEMALLOCFLAG_KERNEL_CPU_MAPPABLE;
 
diff --git a/services/server/devices/rogue/rgxfwutils.h b/services/server/devices/rogue/rgxfwutils.h
index 7906647..4f7018a 100644
--- a/services/server/devices/rogue/rgxfwutils.h
+++ b/services/server/devices/rogue/rgxfwutils.h
@@ -475,7 +475,9 @@ static INLINE IMG_UINT64 RGXReadHWTimerReg(PVRSRV_RGXDEV_INFO *psDevInfo)
                                       PVRSRV_MEMALLOCFLAG_GPU_CACHE_INCOHERENT | \
                                       PVRSRV_MEMALLOCFLAG_CPU_READABLE | \
                                       PVRSRV_MEMALLOCFLAG_CPU_WRITEABLE | \
-                                      PVRSRV_MEMALLOCFLAG_CPU_WRITE_COMBINE | \
+                                      ((RGX_IS_FEATURE_VALUE_SUPPORTED(psDevInfo, NUM_OSIDS) &&    \
+                                      (RGX_GET_FEATURE_VALUE(psDevInfo, NUM_OSIDS) == 8)) ? \
+                                      PVRSRV_MEMALLOCFLAG_CPU_CACHE_INCOHERENT : PVRSRV_MEMALLOCFLAG_CPU_WRITE_COMBINE) | \
                                       PVRSRV_MEMALLOCFLAG_KERNEL_CPU_MAPPABLE | \
                                       PVRSRV_MEMALLOCFLAG_ZERO_ON_ALLOC)

Regards,

Erick

0 li daming over 2 years ago in reply to Erick Narvaez

Prodigy 200 points

Hello

Confirm that this patch “3000.pp132915-caching-fw-ro-structures-v2_1_13a”

NOT printed on TDA4VM - SDK7.2 （in use）

I‘ll test again after printed

1) SDK version - 7.2 ： yes

2) GPU Driver version - 1.13.5776728 from SDK 7.3 ： from this document

3) Patches on GPU Driver - None ：yes not printed

0 Xu(SH) Liu over 2 years ago in reply to li daming

TI__Expert 3875 points

Please update your test result after applied those GPU kernel driver patch. thanks.

0 li daming over 2 years ago in reply to Xu(SH) Liu

Prodigy 200 points

Hello Erick 、Xu

patch “3000.pp132915-caching-fw-ro-structures-v2_1_13a” not work

coredump problem occurs once in two hours

0 Xu(SH) Liu over 2 years ago in reply to li daming

TI__Expert 3875 points

Daming, thanks.

As i suggested last day, kindly please make sure to invoke "appEglWindowOpen" and it's earlier dependency before your this native program under test. I'm asking this to rule out the potential dependency between dma-fd surfaces and kernel's page allocate.

Erick, please go ahead and provide more analysis comes from PVR dump.

0 Erick Narvaez over 2 years ago in reply to Xu(SH) Liu

TI__Mastermind 36307 points

Thank you for trying this patch, will get back to you shortly with the results of the analysis.

Regards,

Erick

0 li daming over 2 years ago in reply to Erick Narvaez

Prodigy 200 points

Thank you
We will also check all version changes that caused the problem

0 li daming over 2 years ago in reply to Erick Narvaez

Prodigy 200 points

Hello

Can you provide so for all gpu related signed tables and add the following compilation options：

-lasan

Let me add some logs for finding problems:

1.stdout :

(1132) PVR:(Error): PVRSRVFenceWaitI: sync_wait failed on fence 370 (22 Invalid argument) [ :222 ]

(1132) PVR:(Error): RGXSubmitTA: Failed to submit kick to kernel (205) [ :1741 ]

(1132) PVR:(Error): RGXKickTA: RGXSubmitTA failed (0xcd) [ :1834 ]

(1132) PVR:(Fatal): DoKickTA: RGXKickTA() failed with error 205 [ :5472 ]

sig_int_handler caught sig 6

sig_int_handler exit

sig_int_handler releasing pavaro...

2. syslog:

476 Sep 12 20:16:34 buildroot kernel: [ 1.639298] virtio_rpmsg_bus virtio3: creating channel ti.ipc4.ping-pong addr 0xe

477: Sep 12 20:16:34 buildroot kernel: [ 1.681084] pvrsrvkm: loading out-of-tree module taints kernel.

478: Sep 12 20:16:34 buildroot kernel: [ 1.701871] PVR_K: 144: Read BVNC 22.104.208.318 from HW device registers

479: Sep 12 20:16:34 buildroot kernel: [ 1.701878] PVR_K: 144: RGX Device registered with BVNC 22.104.208.318

480: Sep 12 20:16:34 buildroot kernel: [ 1.702144] [drm] Initialized pvr 1.13.5776728 20170530 for 4e20000000.gpu on minor 0

481 Sep 12 20:16:34 buildroot kernel: [ 1.704619] remoteproc remoteproc6: powering up 64800000.dsp

615 Jan 2 00:00:02 buildroot kernel: [ 5.475700] random: 7 urandom warning(s) missed due to ratelimiting

616: Jan 2 00:00:02 buildroot kernel: [ 5.769046] PVR_K: 444: RGX Firmware image 'rgx.fw.22.104.208.318' loaded

0 Xu(SH) Liu over 2 years ago in reply to li daming

TI__Expert 3875 points

Erick, do you have any update on this thread please?

0 Xu(SH) Liu over 2 years ago in reply to Xu(SH) Liu

TI__Expert 3875 points

Erick would update the here for log capture.

Daming, do you have any update about the back steps to your code base? what's the result now?

0 li daming over 2 years ago in reply to Xu(SH) Liu

Prodigy 200 points

Hello

We are currently narrowing the scope to two commits, can you come here and review the code together ，because it is an occasional occurrence and align the testing method again?

0 li daming over 2 years ago in reply to Erick Narvaez

Prodigy 200 points

Hello Erick

Can you provide .SO for all gpu related debug information and add the following compilation options：

-lasan

we have been working together to identify issues

0 Xu(SH) Liu over 2 years ago in reply to li daming

TI__Expert 3875 points

Daming, kindly please update your atype patch and the GPU thread priority changes here so that Erick can review and suggest next steps.

Erick, Baidu updated to us that the CPU priority increase doesn't fix this issue and atype change causes initialization failed around share memory. need your advice to move on and also please provide the analysis from IMG by the logs shared previously.

0 Erick Narvaez over 2 years ago in reply to Xu(SH) Liu

TI__Mastermind 36307 points

Hello,

So the a-type patch causes a failure? This is one of the downsides of the patch, it will not work around some shared-memory usecases.

IMG has reviewed the logs, but do not see any failure in the driver. They are requesting that you re-collect a fresh set of logs with more options enabled to determine if there are any other issues in the firmware:

pvrdebug -loggroups main,mts,csw,bif,pm,rtd,spm,pow,hwr

PVR_SRVKM_PARAMS="EnablePageFaultDebug=1" /etc/init.d/rc.pvr start

li daming said:
Can you provide .SO for all gpu related debug information and add the following compilation options：

Unfortunately, it is not something we can share due to licensing of the driver. Usually, we have 2 options:
1) Sharing Application, and we can take care of collecting logs and analysis

2) Sharing Logs, and we can do the analysis

Regards,

Erick

0 Xu(SH) Liu over 2 years ago in reply to Erick Narvaez

TI__Expert 3875 points

Daming, please update the ticket by more test result and logs, otherwise we will close this ticket.

0 li daming over 2 years ago in reply to Erick Narvaez

Prodigy 200 points

Hello Erick

pvrdebug -loggroups main,mts,csw,bif,pm,rtd,spm,pow,hwr OK

PVR_SRVKM_PARAMS="EnablePageFaultDebug=1" /etc/init.d/rc.pvr start ??

Execute this command before running the app, or edit the file?

do I still execute pvrlogdump after coredump?

0 Erick Narvaez over 2 years ago in reply to li daming

TI__Mastermind 36307 points

Hello Li,

li daming said:
PVR_SRVKM_PARAMS="EnablePageFaultDebug=1" /etc/init.d/rc.pvr start ??

We need to start the GPU driver with this command. In order to do this, we need to blacklist the GPU driver so that it does not auto-load. Can you please follow the instructions on this FAQ on how to blacklist the GPU driver from auto-loading:

https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1218307/faq-how-can-i-analyze-a-gpu-graphics-process-unit-driver-or-hardware-issue

Look for "Disabling auto-loading of the GPU driver".

This way, you can manually load the GPU driver with the debug arguments PVR_SRVKM_PARAMS="EnablePageFaultDebug=1".

This is executed before you run the app, because otherwise the GPU driver would not be loaded and you couldn't run your app.

Then you run your app.

Then you collect the logs using pvrlogdump as before.

Thank you,

Erick

0 li daming over 2 years ago in reply to Erick Narvaez

Prodigy 200 points

ok，I'll give it a try

0 li daming over 2 years ago in reply to Erick Narvaez

Prodigy 200 points

Hello Erick

I have added running parameters in rc.pvr, as shown in the following figure

because our underlying software will automatically load driver and start, and I will manually execute "/ rc.pvr stop ./ rc.pvr start" after the system starts, then run app

Is this method effective ？

0 Xu(SH) Liu over 2 years ago in reply to li daming

TI__Expert 3875 points

Erick, this issue happens frequently in the recent fan out programs, so please kindly need your support on priority now. thanks.

0 li daming over 2 years ago in reply to Erick Narvaez

Prodigy 200 points

Hello Erick

We use

PVR_ SRVKM_ PARAMS="Enable PageFaultDebug=1 and

pvrdebug - loggroups main, mts, csw, bif, pm, rtd, spm, pow, hwr parameters get new log.

other ： stdout ，syslog

Please transfer to IMG for analysis as soon as possible. Thank you

pvrlogdump_7002010802.txt.gz

other.zip

0 Erick Narvaez over 2 years ago in reply to li daming

TI__Mastermind 36307 points

Hello,

li daming said:
because our underlying software will automatically load driver and start, and I will manually execute "/ rc.pvr stop ./ rc.pvr start" after the system starts, then run app

This method is not effective at this point, we will need to ensure that the system does not load the GPU driver automatically. Is there any way to disable your system from loading the driver automatically? The blacklist steps I shared take care of default system initialization, but if you have something else in your system that does this it will be a problem.

0 li daming over 2 years ago in reply to Erick Narvaez

Prodigy 200 points

Hello Erick

1. We will disable the automatic loading of drivers for the underlying software and conduct testing again

2. Due to the uncertain recurrence time of the problem, PLEASE ANALYZE THE LOGS provided through the "./rc.pvr stop,./rc.pvr start" method that we provided it yesterday.

After calling the rc. pvr stop, we observe that the driver has been unloaded

3. On the problematic version, we will use the default loading method (./rc. pvr) instead of loading from the system for testing

0 li daming over 2 years ago in reply to Erick Narvaez

Prodigy 200 points

Hello Erick

“but if you have something else in your system that does this it will be a problem."

Disable our software to automatically load drivers, there are still issues

follow “blacklist” new log below

505_2.zip

0 Erick Narvaez over 2 years ago in reply to li daming

TI__Mastermind 36307 points

Hello,

Thanks for collecting, I've delivered both set of logs for analyzing.

Question, do you run OpenVX framework? Or are you rendering to a display that is controlled by Linux?

Thanks,

Erick

0 li daming over 2 years ago in reply to Erick Narvaez

Prodigy 200 points

Hello Erick

do you run OpenVX framework？

yes，Our app's other functions will call tiovx

are you rendering to a display that is controlled by Linux?

no，the rendered output is a shared memory，this shared memory is bind to eglimage,

We also observed that the bound fbo is not display (share memory) when the glBindFramebuffer problem occurs

0 li daming over 2 years ago in reply to Erick Narvaez

Prodigy 200 points

Hello Erick

The following problems are found in our tests in the past two days related to the simultaneous call of "tda4_performance" when our app runs. At present, the call frequency is once every 5s

Is calling 'tda4_performance' unsafe?

--------------------------------------------------------------------------------------------------------------------------------------------------

Hello Erick

We found that the problem of glFinish() blocking for more than 5 seconds, with a maximum duration of 500 seconds, almost appeared together with the current issue in the latest version. The original discussion link for this issue is as follows:

https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1163802/tda4vm-calling-glfinish-blocks-for-more-than-a-minute/4387011?tisearch=e2e-sitesearch&keymatch=%252525252520user%25252525253A486294#4387011

we got pvrlogdump stdout. syslog and "PVRTune data"，The most common occurrence of glfinish block in the last 10 seconds of "PVRTune data"

8664.zip

We suspect that the blocking of the glFinish for a few seconds is closely related to the current discussion of the coredump situation, as they almost appear together in the same version. Please also check this log.

This issue has a very serious impact. Do I need to create a new order?

0 Erick Narvaez over 2 years ago in reply to li daming

TI__Mastermind 36307 points

Hello Li,

After the analysis of the logs, I have the following comments:

So, the second close() is actually closing a handle that GLES created after the first close. It is the application that could be corrupting the userspace GLES.

Could you review the application please.

Are you aware of this in your application? Please let me know.

Thanks,

Erick

0 Erick Narvaez over 2 years ago in reply to li daming

TI__Mastermind 36307 points

li daming said:
We suspect that the blocking of the glFinish for a few seconds is closely related to the current discussion of the coredump situation, as they almost appear together in the same version. Please also check this log.

Perhaps these analysis findings can help with this issue as well. The "close()" function seems to be doing something strange in your application and it needs to be reviewed.

It could be un-related as well, but we won't know until the concern from the analysis is addressed.

Thanks,

Erick

0 li daming over 2 years ago in reply to Erick Narvaez

Prodigy 200 points

Hello Erick

Does ' second close 'refer to the "FD" operations performed inside the bind fbo or “gles FBO"? ”FD“ are not visible to the upper layer.

1. We conducted the following tests first and found that there will be no coredump

2. From the screen printing and above result, it can be seen that a Bad file descriptor has occurred in PVR, and it is suspected that it may be ”the handle has been damaged“？

3. Currently, it has been found that a module in our app used for resource monitoring in the application (such as monitoring CPU, IO, etc.) may be related to this issue, and is currently being analyzed and tested

If it is an fd issue, how can we effectively monitor fd?

0 li daming over 2 years ago in reply to Erick Narvaez

Prodigy 200 points

Hello Erick

Could you review the application please.

Are you aware of this in your application? Please let me know.

We found that the 'resource monitoring module' in the app has a decisive impact temporarily based on current testing，The sub functions in the "Resource Monitoring Module" are highly relevant as follows：

1.CPU monitoring

Through the proc file system/ proc/stat

2.IO monitoring

Netlink implementation, once a second

3.Memory monitoring

Obtained through the interface provided by Jemalloc, once a second

4.DDR_ Bandwidth monitoring

popen calls tda4_proformance Performance, every 5 seconds

This resource monitoring module is within the app and currently only affects the GPU. Please check if the current information is relevant?

0 Erick Narvaez over 2 years ago in reply to li daming

TI__Mastermind 36307 points

Li,

It looks like from your logs that there is a glError reported with index 0x506:

There is a framebuffer that is not complete, and this is what the initial analysis of the logs indicates as well. You can try checking the following function:

https://registry.khronos.org/OpenGL-Refpages/es3.0/html/glCheckFramebufferStatus.xhtml

This can give more insight to the framebuffer attachments. And we can see when the FRAMEBUFFER_COMPLETE is being returned, and when it suddenly changes to something else.

Regards,

Erick

0 li daming over 2 years ago in reply to Erick Narvaez

Prodigy 200 points

Hello Erick

FBO complete check has been added to the code, and this function needs to be used after glBindFrameBuf，We haven't found any issues yet. We will pay attention to this

Are there any doubts about the following functions related to coredump?

1.CPU monitoring

Through the proc file system/ proc/stat

2.IO monitoring

Netlink implementation, once a second

3.Memory monitoring

Obtained through the interface provided by Jemalloc, once a second

4.DDR_ Bandwidth monitoring

popen calls tda4_proformance Performance, every 5 seconds

0 Xu(SH) Liu over 2 years ago in reply to li daming

TI__Expert 3875 points

Erick, according to communication with customers, there were significant system monitors about the system level performance via both tda4 performance profiler and Linux fs entries invocation. After disable those monitors, the coredump issue within the glbindframebuffer would NOT appear any more.

The missing part is, how the system level to cause a coredump at a fixed gl API every time? any clue from the source code implementation perspective?

0 Erick Narvaez over 2 years ago in reply to Xu(SH) Liu

TI__Mastermind 36307 points

Hello,

Can you please share what the tda4_performance program is? Is it part of our SDK offering?

I'll check these points.

Thanks,

Erick

0 li daming over 2 years ago in reply to Erick Narvaez

Prodigy 200 points

Hello Erick

Is it part of our SDK offering

Yes

0 Xu(SH) Liu over 2 years ago in reply to li daming

TI__Expert 3875 points

Daming, please upload the tda4-performance part of source code here if you need Erick to review and find some clues. thanks.

0 Erick Narvaez over 2 years ago in reply to Xu(SH) Liu

TI__Mastermind 36307 points

Daming,

I spoke with the team, and I don't see how the tool could impact the performance of the test. Would it be possible to only remove the GPU loading from the tda4-performance tool and see if this helps with the symptoms?

I believe that part of the code will read the GPU loading from the debug filesystem, which will try to read this file:

/sys/kernel/debug/pvr/status

Please disable reading this file to see if anything changes.

Thanks,

Erick

0 li daming over 2 years ago in reply to Xu(SH) Liu

Prodigy 200 points

Hello Xu

We directly used the bin file in SDK without recompiling it

0 Xu(SH) Liu over 2 years ago in reply to li daming

TI__Expert 3875 points

Suspecting caching issue b/w A72 and GPU. Need to revisit the atype patch and make sure which not conflict to customer's DDK revision build and watch the test result.

0 Xu(SH) Liu over 2 years ago in reply to Xu(SH) Liu

TI__Expert 3875 points

After week of stress test, customer confirmed that the issue is disappeared when fixed the double close(close folder first then close the file within it...) within the heavy applications of source code.

it sounds like the abnormal operations at application level would corrupt the kernel space in range of DDK for some reason.

Thanks all the debug across of teams and appreciate Erick's support on IMG trace analysis and communication.

+1 li daming over 2 years ago in reply to Xu(SH) Liu

Prodigy 200 points

Hello Xu ，Erick

We found that a thread in the application layer performs a double close fd, which can cause the finish to not return and occasionally cause the glbindframebuff core dump problem.

At present, the issue with the double close fd can be directly reproduced with the rendering test program.

The problem has been fixed and long-term observation has been conducted.

Thank you, Eric Liu Xu, for your help

Processors

Processors forum

TDA4VM: Coredump at glBindFramebuffer