PROCESSOR-SDK-J721S2: Runtime error occurs from OpenACC based TIDL

Seunghun Hong

Prodigy 10 points

Part Number: PROCESSOR-SDK-J721S2

Tool/software:

This is Seunghun at Stradvision.
I encounter runtime errors in the resulting executable built by BUILD_WITH_OPENACC.

Environment Details

HPC SDK 23.7

PSDK 9.2 for J721S2

Docker image: nvidia/cuda:11.8.0-devel-ubuntu22.04 provided by NVIDIA

NVIDIA-related environment (tested on two different PCs):

- (4-1) RTX 4080

- - NVIDIA Graphics Driver: 535.183
  - CUDA Driver: Pre-installed in HPC SDK (12.2)

- (4-2) RTX 3070, TITAN X (two GPUs in the same machine)

- - NVIDIA Graphics Driver: 530.41
  - CUDA Driver: Pre-installed in HPC SDK (12.2)

Modified build settings: In our concerto build configuration for the executable that uses the TIDL library, we added the following link information:

LDIRS += /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/compilers/lib
LDIRS += /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/cuda/12.2/lib64
SHARED_LIBS += acccuda acchost accdevice accdevaux cudart

Runtime Errors Observed

- RTX 4080

Accelerator Fatal Error: No CUDA device code available
 File: /home/seunghun/strad/svnet3/src_tda4x/platforms/92_j721s2/c7x-mma-tidl/ti_dl/algo/src/tidl_conv2d_base.c
 Function: _Z24TIDL_refConv2dKernelFastILi3EaaiiEvPT0_PT1_PT2_PT3_S7_S7_iiiiiiiiiiiiiiiiiiiiiiiiii:463
 Line: 473

- RTX 3070

(Changed nvc++ build option from -gpu=ccall to -gpu=cc86)

Accelerator Fatal Error: No CUDA device code available
 File: /home/seunghun/strad/svnet3/src_tda4x/platforms/92_j721s2/c7x-mma-tidl/ti_dl/algo/src/tidl_conv2d_base.c
 Function: _Z24TIDL_refConv2dKernelFastILi3EaaiiEvPT0_PT1_PT2_PT3_S7_S7_iiiiiiiiiiiiiiiiiiiiiiiiii:463
 Line: 473

-TITAN X (same executable for the RTX 3070)

(Changed nvc++ build option from -gpu=ccall to -gpu=cc86)

Accelerator Fatal Error: This file was compiled: -acc=gpu -gpu=cc80 -gpu=cc86 -acc=host or -acc=multicore
Rebuild this file with -gpu=cc61 to use NVIDIA Tesla GPU 0
 File: /home/seunghun/strad/svnet3/src_tda4x/platforms/92_j721s2/c7x-mma-tidl/ti_dl/algo/src/tidl_conv2d_base.c
 Function: _Z24TIDL_refConv2dKernelFastILi3EaaiiEvPT0_PT1_PT2_PT3_S7_S7_iiiiiiiiiiiiiiiiiiiiiiiiii:463
 Line: 473

Additional Information

This indicates we are using the CUDA 12.2 version that comes pre-installed with HPC SDK.

- When running ldd on the executable, it shows OpenACC and CUDA-related libraries linked as follows:

libacccuda.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libacccuda.so (0x00007fe543400000)
libacchost.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libacchost.so (0x00007fe543000000)
libaccdevice.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libaccdevice.so (0x00007fe542800000)
libaccdevaux.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libaccdevaux.so (0x00007fe542400000)
libcudart.so.12 => /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/cuda/12.2/lib64/libcudart.so.12 (0x00007f19bac00000)

- Build output (using nvc++ -v) confirms:

Export PGI_CURR_CUDA_HOME=/opt/nvidia/hpc_sdk/Linux_x86_64/23.7/cuda/12.2
Export NVHPC_CURRENT_CUDA_HOME=/opt/nvidia/hpc_sdk/Linux_x86_64/23.7/cuda/12.2
Export NVHPC_CURRENT_CUDA_VERSION=12.2.53
Export NVCOMPILER=/opt/nvidia/hpc_sdk/Linux_x86_64/23.7
Export PGI=/opt/nvidia/hpc_sdk

- The runtime error occurs in the function TIDL_refConv2dKernelFast despite the build logs showing “Generating NVIDIA GPU code” and the generation of .ptx , fat binary files.

void TIDL_refConv2dKernelFast<1, unsigned short, signed char, int, int>(unsigned short*, signed char*, int*, int*, int*, int*, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int):
    473, Generating present(pCoeffs[:((numInChannels-1)*(coeffsWidth*coeffsHeight))+((coeffsWidth*(coeffsHeight*(numInChannels*(numOutChannels-1))))+(numOutChannels*(coeffsWidth*((numInChannels*(numGroups-1))*coeffsHeight))))+1],pInChannel[:((width%strideWidth)+(width-strideWidth))+((inImPitch*((height%strideHeight)+(height-strideHeight)))+((inChPitch*(numInChannels-1))+((inBatchPitch*(numBatches-1))+(inChPitch*(numInChannels*(numGroups-1))))))+1],pBias[:numOutChannels+((numGroups-1)*numOutChannels)],accPtr[:(((width%strideWidth)+(width-strideWidth))/strideWidth)+((((height%strideHeight)+(height-strideHeight))*outImPitch)+(((numOutChannels-1)*outChPitch)+(((numBatches-1)*outBatchPitch)+(((numGroups-1)*numOutChannels)*outChPitch))))+1])
         Generating implicit firstprivate(numGroups,strideHeight,topPad,width,pInChannel,numInChannels,numBatches,leftPad,inWidth,isOTFpad,inHeight,strideWidth,inImPitch,height,numOutChannels)
         Generating NVIDIA GPU code
        496, #pragma acc loop gang, vector(128) collapse(5) /* blockIdx.x threadIdx.x */
        498,   /* blockIdx.x threadIdx.x collapsed */
        500,   /* blockIdx.x threadIdx.x collapsed */
        502,   /* blockIdx.x threadIdx.x collapsed */
        504,   /* blockIdx.x threadIdx.x collapsed */
             Generating reduction(min:_min)
             Generating reduction(max:_max)
        519, #pragma acc loop seq
        524, #pragma acc loop seq
        527, #pragma acc loop seq
    504, Generating implicit firstprivate(enableBias,inBatchPitch,inChPitch,outBatchPitch,outImPitch,outChPitch)
    519, Generating implicit firstprivate(coeffsHeight,coeffsWidth)
    527, Generating implicit firstprivate(dilationHeight,startRowNumberInTensor,padVal,dilationWidth)
    
    
    
    ...................
    
 /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/compilers/bin/tools/nvdd -dcuda /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/cuda/12.2 -usenvvm -nvvm70 -reloc /tmp/nvacceWgemkX4NEtn.gpu -computecap 86 -ptx /tmp/nvacceWgem5n6NRPb.ptx -o /tmp/nvaccKWgeSc99ei3K.bin -ftz -cuda12020
 /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/compilers/bin/tools/nvdd -dcuda /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/cuda/12.2 -reloc -cuda12020 -fat src/tidl_conv2d_base.c -sm 86 /tmp/nvaccKWgeSc99ei3K.bin -compute 86 /tmp/nvacceWgem5n6NRPb.ptx -o /tmp/nvacceWgemyLVfjWr.fat
NVC++/x86-64 Linux 23.7-0: compilation successful

I have attached the full build log and are requesting your assistance in diagnosing why the runtime error (“No CUDA device code available”) is triggered under these conditions.
Any help or suggestions you can provide would be greatly appreciated.

Thank you in advance for your support.

TIDL_build_log.txt

5 months ago

0 Varun Tripathi 5 months ago

TI__Genius 10105 points

Hi Seunghun,
Can you confirm whether off the shelf tools work for you? https://github.com/TexasInstruments/edgeai-tidl-tools/blob/master/docs/advanced_setup.md#docker-based-setup-for-x86_pc

Regards,
Varun

0 Seunghun Hong 4 months ago in reply to Varun Tripathi

Prodigy 10 points

Hi Varun,

Thank you for pointing me to the off-the-shelf TI tools, but my situation is different. I’ve already applied the OpenACC patch provided to the TIDL source, and the libraries build and link successfully (the build output even shows “Generating NVIDIA GPU code”), yet at runtime I still get:

Accelerator Fatal Error: No CUDA device code available  
File: …/src/tidl/algo/src/tidl_conv2d_base.c  
Function: _Z24TIDL_refConv2dKernelFastLi3EaaiiEVPT0_PT1_PT2_PT3_S7_S7_iiiiiiiiiiiiiiiiiii  
Line: 473

This indicates that the patched CUDA kernels aren’t being loaded into the NVIDIA context. Because I’m already using the patched sources, the off-the-shelf Docker setup doesn’t resolve this. Could you advise what additional steps or configuration might be required to ensure the patched OpenACC kernels are correctly generated and loaded at runtime?

Thanks again for your help,

Seunghun Hong

0 John H. Gardner 4 months ago in reply to Seunghun Hong

TI__Guru 51961 points

Per Varun:

Try building without -cuda=12.2 as an argument and see if it resolves the issue
Try running off the shelf TI examples, with TI provided GPU tools mentioned here: https://github.com/TexasInstruments/edgeai-tidl-tools

- https://github.com/TexasInstruments/edgeai-tidl-tools/blob/master/docs/advanced_setup.md#docker-based-setup-for-x86_pc

Setup environment without docker and try the same (Step 1.)

Processors

Processors forum

PROCESSOR-SDK-J721S2: Runtime error occurs from OpenACC based TIDL