This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA4VM: After updating the GPU driver 1.15 version, 275 pressure tests were conducted ,frame rate of 0, and it was found that it was stuck in the opengl node

Part Number: TDA4VM

Tool/software:

Hi TI Experts,

After updating the GPU driver  1.15 version, 275 pressure tests were conducted ,frame rate of 0, and it was found that it was stuck in the opengl node.

We have conducted GPU debugging. Please use the attached gpu_error_log.zip  to help analyze the possible cause? Is it stuck in any opengl rendering command?

The testing background is that GPU diver in version 1.13 will not block, while GPU driver upgraded to version 1.15 will block within the opengl_node.


Upgraded version reference link:

https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1316731/faq-tda4vl-q1-what-are-the-gpu-driver-bug-fixes-for-sdk-8-6-or-earlier

gpu_erro_log.zip

  • Hello Yang,

    It seems there is an error in the GPU driver installation. Can you please run this command and see if the GPU driver is responsive at all? If it runs correctly, please provide the logs:

    rgx_kicksync_test -ver -nc 16 -loop 100 -n 10000 -r -seed 81576

    This will check that the application of the GPU driver was fine. Otherwise, we will debug the installation of the driver.

    Regards,

    Erick

  • Hello, expert
    The attachment is a log of running the above command and outputting it. Is it helpful?
  • Hello,

    In rgx_kicksync_test_2.log, there is a concerning error:

    ----------------------- Loop 52 / 100 -----------------------
    Initialising contexts
    Submitting 10000 commands, each with 3 dev var updates
    PVRSRVFenceWait timed out. Start Thu Jun 23 10:41:27 2022
    , End Thu Jun 23 10:41:27 2022
    ( 975) PVR:(Error): HW operation timeout occurred. [ :37 ]
    ( 975) PVR:(Error): HW operation timeout occurred. [ :37 ]
    ( 975) PVR:(Error): HW operation timeout occurred. [ :37 ]
    ( 975) PVR:(Error): HW operation timeout occurred. [ :37 ]
    PVRSRVFenceWait retry timed out. FAIL - PVRSRV_ERROR_TIMEOUT(9)
    sutu_fail_if_error_quietI: unittests/services/rogue/common/srv_unittest_utils.c:171 ERROR EXIT

    But it seems that the test completed in rgx_kicksync_test_1.log. Is there any information

    ,frame rate of 0, and it was found that it was stuck in the opengl node.

    So in all 275 pressure tests, the frame rate was always 0?

    Regards,

    Erick

  • Hi,Erick

    So in all 275 pressure tests, the frame rate was always 0?

    The frame rate was not always 0 in all 275 pressure tests. The first 274 frame rate were correct, and the 275th frame rate was 0. 

    How to debug the issues in rgx_kicksync_test_2.log. 

    Regards

  • The frame rate was not always 0 in all 275 pressure tests. The first 274 frame rate were correct, and the 275th frame rate was 0. 

    Thanks for the clarification, this changes our understanding of the issue.

    How to debug the issues in rgx_kicksync_test_2.log. 

    Firstly, I will need to try and reproduce on my end. Let me see if I can replicate your setup.

    Regards,

    Erick

  • Hi Erick

    I had a discussion with customer today and give you some updates here. 

    They update GPU to 1.15 refer to the FAQ you provided. There are two modifications in KM, one is cache fix and another is a git revert. After update they do the pressure test bug get stuck soon. 

    What’s Interesting is when they remove the two modifications of KM, pressure test is good for now, and it will go on. So are these two modifications are must? 

    I also advise customer to enable PHR based on 1.15, seems it is not enabled so far. Can you give them PHR enabled UM?

    Regards

    Zekun

  • Zekun,

    They update GPU to 1.15 refer to the FAQ you provided. There are two modifications in KM, one is cache fix and another is a git revert. After update they do the pressure test bug get stuck soon. 

    What’s Interesting is when they remove the two modifications of KM, pressure test is good for now, and it will go on. So are these two modifications are must? 

    Yes, we are moving away from the QoS workaround to the proper fix. The proper fix involves the patch in the KM (and reverting an old hack that is still present). We need to check the installation steps the customer took again, because the proper fix should fix all issues related to cache coherency.

    Regards,

    Erick

  • Hi Erick

    My installation steps as below:

    1、km installation steps:

    1)git clone km repository from https://git.ti.com/git/graphics/ti-img-rogue-driver.git

    2)checkout branch to 1.15.6133109_unified_fw_pagesize

    3)applying two patches refer to the FAQ you provided

    4)build the driver using the following command:

          make ARCH=arm64 CROSS_COMPILE=aarch64-none-linux-gnu- KERNELDIR=/home/heyb/ti-processor-sdk-linux-j7-evm-08_01_00_07/board-         support/kernel-source RGX_BVNC="22.104.208.318" BUILD=release PVR_BUILD_DIR=j721e_linux WINDOW_SYSTEM=wayland

    5) copy pvrsrvkm.ko  to /lib/modules/5.10.99-yocto-standard/extra of my system

    2、um installation steps:

    1) download the tar package of um from the FAQ you provided 

    2) unpack the um package, copy all of the files located in latest-1.15-umlibs/j721e to the root directory of my system.

    After installation, I run rgx_compute_test command to test, the log looks right.

    root@ti-j72xx:~# rgx_compute_test
    ------------------ RGX compute test -----------------
    ----------------------- Start -----------------------
    Call PVRSRVConnectionCreateDevice with a valid argument:
    Connecting to first (0) default pvr device
    OK
    Create dev var context:
    OK
    Looking up General heap handle
    OK
    Getting event object
    OK
    Creating robustness buffer
    OK
    Mapping robustness buffer
    OK
    Creating Compute Context
    OK
    Creating Buffer
    Creating DWord for CDM Event Object
    OK
    OK
    Create PDS Heap
    OK
    Create USC Heap
    OK
    Reset event object value
    Creating NOP instruction
    Creating Data Segment
    Creating Code Segment
    Write Kernel 0
    Creating Fence Data Segment
    Creating Code Segment
    Write Fence Kernel
    Write Terminate
    Call services to kick CDM
    OK
    Poll for CDM event object data
    Event object value: 0xa1b2c3d4
    OK
    OK
    Destroy Compute Context
    OK

    Total time: 0ms
    Disconnect from services:
    OK
    ------------------------ End ------------------------

    Additionally, when I remove the two patches of km, the pressure test is good. The pressure test has been going on for 2 days and it's still good.

    When I take pressure test with the two patches, I get the stuck soon. At the same time I run rgx_kicksync_test command, I get the failed log:

    rgx_kicksync_test_failed.log

    Finally, can you give me the um with PHR enabled to test.  the um of FAQ you provided seems not enabled PHR.

     

    Regards

  • Hello,

    The procedure looks fine, but I have some clarification questions:

    1)git clone km repository from git.ti.com/.../ti-img-rogue-driver.git

    2)checkout branch to 1.15.6133109_unified_fw_pagesize

    Can you also share if you removed the QoS workaround (or you never had it in place to begin with)?

    Finally, can you give me the um with PHR enabled to test.  the um of FAQ you provided seems not enabled PHR.

    This could be explored, but currently you seem to be facing another issue, the PHR enabled configuration helps with some temporary tearing that happens, not getting stuck.

    Regards,

    Erick

  • Hi Erick

    Can you also share if you removed the QoS workaround (or you never had it in place to begin with)?

    I never used the Qos workaround.

    the PHR enabled configuration helps with some temporary tearing that happens

    Because I encountered tearing issue with 1.13 GPU drivers, I used the 1.13 um with PHR enabled to test, the tearing issue not reproduced.

    So, after update to 1.15 GPU drivers,  I also want to test with the 1.15 um with PHR enabled. Prerequisite for this test is to remove the two patches of km (Because it won't get stuck under these conditions).

    As for the stuck issue, we can continue to analyze it.

    Regards

  • Hi Erick

    Firstly, I will need to try and reproduce on my end. Let me see if I can replicate your setup.

    Did you try to reproduce this stuck issue on your end?

    Let me clarify my reproduce issue environment. I think this will help you reproduce it quickly.

    My SDK is 8.1 and  the chip is TDA4VM.

    After updating the latest 1.15 GPU drivers, run the following command to test.

    rgx_kicksync_test -ver -nc 16 -loop 100 -n 10000 -r -seed 81576

    I guess you can reproduce the stuck issue.

    Regards

  • Did you try to reproduce this stuck issue on your end?

    Let me clarify my reproduce issue environment. I think this will help you reproduce it quickly.

    Thanks for the details, I can give this a try with your SDK version 8.1 and update the driver to 1.15.

    Thanks,

    Erick

  • Hello,

    I have tried this with SDK 8.1 on my J721E EVM, and it see it running OK. The steps I followed were as follows:

    1) Updated the UM libraries with the ones provided in this FAQ: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1316731/faq-tda4vl-q1-what-are-the-gpu-driver-bug-fixes-for-sdk-8-6-or-earlier

    - Copying them into the filesystem can be as follows: sudo cp -av j721e/* /media/<username>/rootfs/

    2) Updated the KM libraries, as follows:

    cd ti-processor-sdk-linux-j7-evm-08_01_00_07/
    cd board-support/extra-drivers/ti-img-rogue-driver-1.13.5776728/
    git fetch
    git stash
    git checkout linuxws/dunfell/k5.10/1.15.6133109_unified_fw_pagesize
    git revert c901804e8221d477983a6f7224a9cdc6e832f050
    git stash pop
    patch -p1 < ~/Documents/<path_to_KM_patch_file_from_faq>/CL6529585_Enable_cached_mappings_in_KM_on_ARM64_for_DDK_1.15_with_snooping_mode_update.patch
    cd ../../../
    make linux -j4
    make ti-img-rogue-driver -j4
    cd board-support/extra-drivers/ti-img-rogue-driver-1.13.5776728/binary_j721e_linux_wayland_release
    sudo ./install.sh --root /media/<username>/rootfs

    Then my SD card is ready to boot and worked for me. Can you please give this a try? I've made updates to the patches and libraries provided in the FAQ, so make sure to download the new ones.

    Regards,

    Erick

  • Hi, Erick

    Thank you for your reply.

    I get the steps you provided, but I have a question.

    According to the FAQ: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1316731/faq-tda4vl-q1-what-are-the-gpu-driver-bug-fixes-for-sdk-8-6-or-earlier

    There are two patches with KM driver in FAQ, but you only used one patch in your steps, Is the second patch not needed?

    Regards

  • Hello,

    Thanks for double-checking the steps, yes I had already reverted the commit during my testing. I've updated the steps in that post.

    Thanks,

    Erick