This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA4VM: gpu hang up when surround view app is running

Part Number: TDA4VM

Hi TI experts,

we meet with a GPU hangup issue. when surround view app is running, sometimes rendering stops. this issue happens rarely. and below dmesg log related to pvr is found, 

0326-2136.txt|281 col 16| [ 2.951460] pvrsrvkm: loading out-of-tree module taints kernel.
0326-2136.txt|309 col 16| [ 3.108464] PVR_K: 141: Read BVNC 22.104.208.318 from HW device registers
0326-2136.txt|310 col 16| [ 3.108474] PVR_K: 141: RGX Device registered with BVNC 22.104.208.318
0326-2136.txt|311 col 34| [ 3.108857] [drm] Initialized pvr 1.13.5776728 20170530 for 4e20000000.gpu on minor 0
0326-2136.txt|793 col 16| [ 6.956717] PVR_K: 907: RGX Firmware image 'rgx.fw.22.104.208.318' loaded
0326-2136.txt|796 col 16| [ 832.506974] PVR_K:(Error): 216: CheckForStalledCCB (force): CCCB has not progressed (ROFF=49256 DOFF=49256 WOFF=50272) for "3D-P778-T907-Rendering Wakee" [2298]
0326-2136.txt|797 col 16| [ 832.523496] PVR_K: 216: Possible stalled client RGX contexts detected: 3D
0326-2136.txt|798 col 16| [ 832.523504] PVR_K: 216: Trying to identify stalled context...(force) [0]
0326-2136.txt|799 col 16| [ 832.523512] PVR_K: 216: Fence found on context 0xc00c00e0 '3D-P778-T907-Rendering Wakee' @ 49256 has 1 UFOs
0326-2136.txt|800 col 16| [ 832.523516] PVR_K: 216: 1/1 FWAddr 0xc0120008 requires 0x3afd
0326-2136.txt|801 col 16| [ 842.746972] PVR_K:(Error): 216: CheckForStalledCCB (force): CCCB has not progressed (ROFF=49256 DOFF=49256 WOFF=50272) for "3D-P778-T907-Rendering Wakee" [2298]
0326-2136.txt|802 col 16| [ 852.986968] PVR_K:(Error): 216: CheckForStalledCCB (force): CCCB has not progressed (ROFF=49256 DOFF=49256 WOFF=50272) for "3D-P778-T907-Rendering Wakee" [2298]

our SDK is 7.3. please have a check and support to identify the issue.

thanks a lot.

  • 1563.log.txt

    a complete log attached

  • Hello,

    First thing with the SDK 7.3 is that there is a known issue, can you please apply the following patch to your kernel module, re-build it and install it:

    diff --git a/services/server/devices/rogue/rgxccb.c b/services/server/devices/rogue/rgxccb.c
    index 3a8db74..369226e 100644
    --- a/services/server/devices/rogue/rgxccb.c
    +++ b/services/server/devices/rogue/rgxccb.c
    @@ -598,7 +598,9 @@ PVRSRV_ERROR RGXCreateCCB(PVRSRV_RGXDEV_INFO	*psDevInfo,
     								PVRSRV_MEMALLOCFLAG_DEVICE_FLAG(FIRMWARE_CACHED) |
     								PVRSRV_MEMALLOCFLAG_GPU_READABLE |
     								PVRSRV_MEMALLOCFLAG_GPU_WRITEABLE |
    -								PVRSRV_MEMALLOCFLAG_CPU_WRITE_COMBINE |
    +								((RGX_IS_FEATURE_VALUE_SUPPORTED(psDevInfo, NUM_OSIDS) &&    \
    +								(RGX_GET_FEATURE_VALUE(psDevInfo, NUM_OSIDS) == 8)) ? \
    +								PVRSRV_MEMALLOCFLAG_CPU_CACHE_INCOHERENT : PVRSRV_MEMALLOCFLAG_CPU_WRITE_COMBINE) | \
     								PVRSRV_MEMALLOCFLAG_ZERO_ON_ALLOC |
     								PVRSRV_MEMALLOCFLAG_KERNEL_CPU_MAPPABLE;
     
    diff --git a/services/server/devices/rogue/rgxfwutils.h b/services/server/devices/rogue/rgxfwutils.h
    index 7906647..4f7018a 100644
    --- a/services/server/devices/rogue/rgxfwutils.h
    +++ b/services/server/devices/rogue/rgxfwutils.h
    @@ -475,7 +475,9 @@ static INLINE IMG_UINT64 RGXReadHWTimerReg(PVRSRV_RGXDEV_INFO *psDevInfo)
                                           PVRSRV_MEMALLOCFLAG_GPU_CACHE_INCOHERENT | \
                                           PVRSRV_MEMALLOCFLAG_CPU_READABLE | \
                                           PVRSRV_MEMALLOCFLAG_CPU_WRITEABLE | \
    -                                      PVRSRV_MEMALLOCFLAG_CPU_WRITE_COMBINE | \
    +                                      ((RGX_IS_FEATURE_VALUE_SUPPORTED(psDevInfo, NUM_OSIDS) &&    \
    +                                      (RGX_GET_FEATURE_VALUE(psDevInfo, NUM_OSIDS) == 8)) ? \
    +                                      PVRSRV_MEMALLOCFLAG_CPU_CACHE_INCOHERENT : PVRSRV_MEMALLOCFLAG_CPU_WRITE_COMBINE) | \
                                           PVRSRV_MEMALLOCFLAG_KERNEL_CPU_MAPPABLE | \
                                           PVRSRV_MEMALLOCFLAG_ZERO_ON_ALLOC)
     
    

    Please try this and let me know if you continue seeing these issues.

    Thanks,

    Erick

  • Dear Erick,

    thanks for your patch. after applying it and rebuilding it, we have continuous test. and today we replicate similar issue. behavior of app layer is same, but demesg output is different.

    the demsg log is attached, could you please have a check if it's still gpu issue, and how to fix it.

    best regards新建文本文档 (2).txt

  • Hello,

    The console log is good, but we will need some more logs. Could you please capture some more logs with the following steps:

    1) Before running you application, please run the following command:

    pvrdebug -loggroups main,mts,hwr

    2) After experiencing the issue, please run the following:

    pvrlogdump

    Please also let me know when the GPU logs come through the console, does your application keep running normally or does it freeze and not come back?

    Thanks,

    Erick

  • hello Erick,

    1. when the above issue happens, application is abnormal, rendering thread stops and can't resume.

    2. due to the issue is hard to be replicated, and most of these issues are from custom production line with formal release, we didn't get pvrdebug info. if we add "pvrdebug -loggrpus main,mts,hwr" into formal release, will there be any influence on the fps/system loading/stability, or any other aspects?

    thanks.

  • Hello,

    1. when the above issue happens, application is abnormal, rendering thread stops and can't resume.

    Thanks for confirming. Unfortunately this means the GPU was not able to recover successfully.

    2. due to the issue is hard to be replicated, and most of these issues are from custom production line with formal release, we didn't get pvrdebug info. if we add "pvrdebug -loggrpus main,mts,hwr" into formal release, will there be any influence on the fps/system loading/stability, or any other aspects?

    Finding a way to replicate would be the highest priority so we can gather logs to analyze the issue. If you enable the pvrdebug -loggroups, it can slightly affect performance since the firmware will be saving logs, but the interference is minimal. Once the issue re-appears, you will be able to gather the necessary logs.

    Regards,

    Erick

  • Hello Erick,

    we got a new modification suggestion about ATYPE value for sbl startup to handle the issue. but due to that we can't update sbl binary via OTA, we want to check with you if there is a way from user driver/space, or linux kernel to achieve same effect.

    modification suggestion we got is as below:

    QOS_GPU0_M0_RD_ATYPE 0 -> 3
    QOS_GPU0_M0_WR_ATYPE 0 -> 3
    QOS_GPU0_M1_RD_ATYPE 0 -> 3
    QOS_GPU0_M1_WR_ATYPE 0 -> 3

    Best regards,

    xianchao

  • Hello,

    This modification is usually done at U-boot. I have the patch attached here:

    https://e2e.ti.com/cfs-file/__key/communityserver-discussions-components-files/791/1680.0001_2D00_HACK_2D00_j721s2_2D00_QoS_2D00_workaround_2D00_for_2D00_GPU_2D00_cache_2D00_incoherency.patch

    However, you are using SBL, so this is not applicable. The same operation would be needed. I've recently tried doing this myself from the command-line, but it does not seem work, the GPU driver panics and is not useable. Please see the command below for your reference.

    https://e2e.ti.com/cfs-file/__key/communityserver-discussions-components-files/791/a_2D00_type_2D00_loop.sh

    I believe we will need to run this configuration before the kernel loads, so it will require you to update the SBL binary. Is there any possibility to do this?

    Regards,

    Erick

  • Xianchao,

    One other thing to note, if you do apply this patch mentioned, it cannot be applied along-side the previous patch mentioned here:

    First thing with the SDK 7.3 is that there is a known issue, can you please apply the following patch to your kernel module, re-build it and install it:

    You will need to un-apply this patch and apply the a-type patch.

    Regards,

    Erick

  • Hello Xianchao,

    I've manged to get this working. Please follow the following steps:

    1) Blacklist the GPU driver, so you can load it manually following the steps detailed in the FAQ on "Disabling auto-loading of the GPU driver":

    https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1218307/faq-how-can-i-analyze-a-gpu-graphics-process-unit-driver-or-hardware-issue

    2) Run the following script in your console, this will apply the patch to the correct registers:

    #!/bin/bash
    
    devmem2 0x45dc5100 w 0x30000000
    devmem2 0x45dc5104 w 0x30000000
    devmem2 0x45dc5108 w 0x30000000
    devmem2 0x45dc510C w 0x30000000
    devmem2 0x45dc5110 w 0x30000000
    devmem2 0x45dc5114 w 0x30000000
    devmem2 0x45dc5118 w 0x30000000
    devmem2 0x45dc511C w 0x30000000

    3) Load the GPU driver by running: ./rc.pvr start

    4) Run your application

    Please let me know if you still see your GPU issue.

    Thanks,

    Erick

  • Hello Erick,

    we applied your method, and the app works well currently. we will have continuous test, and i will update you with further info.

    thanks and best regards,

    xianchao