This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

PROCESSOR-SDK-J721S2: Runtime error occurs from OpenACC based TIDL

Part Number: PROCESSOR-SDK-J721S2

Tool/software:

This is Seunghun at Stradvision.
I encounter runtime errors in the resulting executable built by BUILD_WITH_OPENACC.

Environment Details
  1. HPC SDK 23.7
  1. PSDK 9.2 for J721S2
  1. Docker image: nvidia/cuda:11.8.0-devel-ubuntu22.04 provided by NVIDIA
  1. NVIDIA-related environment (tested on two different PCs):
    • (4-1) RTX 4080
      • NVIDIA Graphics Driver: 535.183
      • CUDA Driver: Pre-installed in HPC SDK (12.2)
    • (4-2) RTX 3070, TITAN X (two GPUs in the same machine)
      • NVIDIA Graphics Driver: 530.41
      • CUDA Driver: Pre-installed in HPC SDK (12.2)
  1. Modified build settings: In our concerto build configuration for the executable that uses the TIDL library, we added the following link information:

LDIRS += /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/compilers/lib
LDIRS += /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/cuda/12.2/lib64
SHARED_LIBS += acccuda acchost accdevice accdevaux cudart

Runtime Errors Observed
- RTX 4080
Accelerator Fatal Error: No CUDA device code available
 File: /home/seunghun/strad/svnet3/src_tda4x/platforms/92_j721s2/c7x-mma-tidl/ti_dl/algo/src/tidl_conv2d_base.c
 Function: _Z24TIDL_refConv2dKernelFastILi3EaaiiEvPT0_PT1_PT2_PT3_S7_S7_iiiiiiiiiiiiiiiiiiiiiiiiii:463
 Line: 473
- RTX 3070
(Changed nvc++ build option from -gpu=ccall to -gpu=cc86)
Accelerator Fatal Error: No CUDA device code available
 File: /home/seunghun/strad/svnet3/src_tda4x/platforms/92_j721s2/c7x-mma-tidl/ti_dl/algo/src/tidl_conv2d_base.c
 Function: _Z24TIDL_refConv2dKernelFastILi3EaaiiEvPT0_PT1_PT2_PT3_S7_S7_iiiiiiiiiiiiiiiiiiiiiiiiii:463
 Line: 473
 
-TITAN X (same executable for the RTX 3070)
(Changed nvc++ build option from -gpu=ccall to -gpu=cc86)
Accelerator Fatal Error: This file was compiled: -acc=gpu -gpu=cc80 -gpu=cc86 -acc=host or -acc=multicore
Rebuild this file with -gpu=cc61 to use NVIDIA Tesla GPU 0
 File: /home/seunghun/strad/svnet3/src_tda4x/platforms/92_j721s2/c7x-mma-tidl/ti_dl/algo/src/tidl_conv2d_base.c
 Function: _Z24TIDL_refConv2dKernelFastILi3EaaiiEvPT0_PT1_PT2_PT3_S7_S7_iiiiiiiiiiiiiiiiiiiiiiiiii:463
 Line: 473
             
Additional Information
This indicates we are using the CUDA 12.2 version that comes pre-installed with HPC SDK.
When running ldd on the executable, it shows OpenACC and CUDA-related libraries linked as follows:
libacccuda.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libacccuda.so (0x00007fe543400000)
libacchost.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libacchost.so (0x00007fe543000000)
libaccdevice.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libaccdevice.so (0x00007fe542800000)
libaccdevaux.so => /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/compilers/lib/libaccdevaux.so (0x00007fe542400000)
libcudart.so.12 => /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/cuda/12.2/lib64/libcudart.so.12 (0x00007f19bac00000)
Build output (using nvc++ -v) confirms:
Export PGI_CURR_CUDA_HOME=/opt/nvidia/hpc_sdk/Linux_x86_64/23.7/cuda/12.2
Export NVHPC_CURRENT_CUDA_HOME=/opt/nvidia/hpc_sdk/Linux_x86_64/23.7/cuda/12.2
Export NVHPC_CURRENT_CUDA_VERSION=12.2.53
Export NVCOMPILER=/opt/nvidia/hpc_sdk/Linux_x86_64/23.7
Export PGI=/opt/nvidia/hpc_sdk
The runtime error occurs in the function TIDL_refConv2dKernelFast despite the build logs showing “Generating NVIDIA GPU code” and the generation of .ptx ,  fat binary files.
void TIDL_refConv2dKernelFast<1, unsigned short, signed char, int, int>(unsigned short*, signed char*, int*, int*, int*, int*, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int):
    473, Generating present(pCoeffs[:((numInChannels-1)*(coeffsWidth*coeffsHeight))+((coeffsWidth*(coeffsHeight*(numInChannels*(numOutChannels-1))))+(numOutChannels*(coeffsWidth*((numInChannels*(numGroups-1))*coeffsHeight))))+1],pInChannel[:((width%strideWidth)+(width-strideWidth))+((inImPitch*((height%strideHeight)+(height-strideHeight)))+((inChPitch*(numInChannels-1))+((inBatchPitch*(numBatches-1))+(inChPitch*(numInChannels*(numGroups-1))))))+1],pBias[:numOutChannels+((numGroups-1)*numOutChannels)],accPtr[:(((width%strideWidth)+(width-strideWidth))/strideWidth)+((((height%strideHeight)+(height-strideHeight))*outImPitch)+(((numOutChannels-1)*outChPitch)+(((numBatches-1)*outBatchPitch)+(((numGroups-1)*numOutChannels)*outChPitch))))+1])
         Generating implicit firstprivate(numGroups,strideHeight,topPad,width,pInChannel,numInChannels,numBatches,leftPad,inWidth,isOTFpad,inHeight,strideWidth,inImPitch,height,numOutChannels)
         Generating NVIDIA GPU code
        496, #pragma acc loop gang, vector(128) collapse(5) /* blockIdx.x threadIdx.x */
        498,   /* blockIdx.x threadIdx.x collapsed */
        500,   /* blockIdx.x threadIdx.x collapsed */
        502,   /* blockIdx.x threadIdx.x collapsed */
        504,   /* blockIdx.x threadIdx.x collapsed */
             Generating reduction(min:_min)
             Generating reduction(max:_max)
        519, #pragma acc loop seq
        524, #pragma acc loop seq
        527, #pragma acc loop seq
    504, Generating implicit firstprivate(enableBias,inBatchPitch,inChPitch,outBatchPitch,outImPitch,outChPitch)
    519, Generating implicit firstprivate(coeffsHeight,coeffsWidth)
    527, Generating implicit firstprivate(dilationHeight,startRowNumberInTensor,padVal,dilationWidth)
    
    
    
    ...................
    
 /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/compilers/bin/tools/nvdd -dcuda /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/cuda/12.2 -usenvvm -nvvm70 -reloc /tmp/nvacceWgemkX4NEtn.gpu -computecap 86 -ptx /tmp/nvacceWgem5n6NRPb.ptx -o /tmp/nvaccKWgeSc99ei3K.bin -ftz -cuda12020
 /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/compilers/bin/tools/nvdd -dcuda /opt/nvidia/hpc_sdk/Linux_x86_64/23.7/cuda/12.2 -reloc -cuda12020 -fat src/tidl_conv2d_base.c -sm 86 /tmp/nvaccKWgeSc99ei3K.bin -compute 86 /tmp/nvacceWgem5n6NRPb.ptx -o /tmp/nvacceWgemyLVfjWr.fat
NVC++/x86-64 Linux 23.7-0: compilation successful
    
    

I have attached the full build log and are requesting your assistance in diagnosing why the runtime error (“No CUDA device code available”) is triggered under these conditions.
Any help or suggestions you can provide would be greatly appreciated.
Thank you in advance for your support.
TIDL_build_log.txt