TDA4VE-Q1: Unable to use more than 2 DRU channels for my C7X kernels when TIDL is running

Part Number: TDA4VE-Q1

Tool/software:

Hi,

We have a lot of custom C7X kernels and most of them use between 2-6 DRU channels for block processing. The issue we have is when we try to run kernels that require >2 channesl on the C7X_1 after TIDL has been initialized, then the dma initialization fails. We have found that we cannot use more than 2 DRU channels when TIDL is running because TIDL uses 14 and it seems that it does not share the handles.

I do want to note that those kernels work perfectly together when TIDL is off or when running on C7X_2.

Here's our flow:

  1. Init TIDL through OnnxRunTime
    class Tidl {
    public:
        bool initialize();
        bool runInference() const;
        
        /* ... */
        
    private:
    	bool initTidl(Ort::SessionOptions& sessionOptions);
        
        bool m_verbose;
    	std::string m_artifactsPath;
    	std::string m_weights;
    	Ort::SessionOptions m_sessionOptions;
    	Ort::Env m_env;
    	std::unique_ptr<Ort::Session> m_session;
    	std::unique_ptr<Ort::IoBinding> m_bindings;
        	
    	/* ... */
    };
    
    bool Tidl::initialize()
    {
    	m_env = Ort::Env(m_verbose ? ORT_LOGGING_LEVEL_VERBOSE : ORT_LOGGING_LEVEL_WARNING, "tidl");
    
    	m_sessionOptions.SetIntraOpNumThreads(1);
    
    	if (!initTidl(m_sessionOptions))
    	{
    		return false;
    	}
    
    	m_sessionOptions.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_EXTENDED);
    
    	// Initialize session
    	try
    	{
    		m_session = std::unique_ptr<Ort::Session>(new Ort::Session(m_env, m_weights.c_str(), m_sessionOptions));
    	}
    	catch (const Ort::Exception& e)
    	{
    		LOG_ERROR(e.what());
    		return false;
    	}
    
    	m_bindings = std::unique_ptr<Ort::IoBinding>(new Ort::IoBinding(*m_session));
    
    	return true;
    }
    
    bool Tidl::initTidl(Ort::SessionOptions& sessionOptions)
    {
    	// Setup session using TIDL
    
    	c_api_tidl_options options;
    	OrtStatus* status = OrtSessionsOptionsSetDefault_Tidl(&options);
    	if (status != nullptr)
    	{
    		LOG_ERROR("Failed to set default TIDL options.");
    		return false;
    	}
    
    	fs::path weightsPath(m_weights);
    	if (!fs::exists(weightsPath))
    	{
    		LOG_ERROR("The following weight file does not exist : %s\n", m_weights.c_str());
    		return false;
    	}
    
    	m_artifactsPath = weightsPath.parent_path().string();
    	if (m_artifactsPath.size() > sizeof(options.artifacts_folder))
    	{
    		LOG_ERROR("Artifacts path too long.");
    		return false;
    	}
    
    	std::strncpy(options.artifacts_folder, m_artifactsPath.c_str(), m_artifactsPath.size());
    
    	// Debug level undocumented. Assumed to be 0 to 3, where 0 is error only, 3 is the most verbose.
    	// Debug level 3 introduces very high latencies.
    	options.debug_level = m_verbose ? 2 : 0;
    
    	status = OrtSessionOptionsAppendExecutionProvider_Tidl(sessionOptions, &options);
    	if (status != nullptr)
    	{
    		LOG_ERROR("Failed to set TIDL execution provider.");
    		return false;
    	}
    
    	return true;
    }
    


  2. Run TIDL inference
    bool Tidl::runInference() const
    {
    	auto run_options = Ort::RunOptions();
    
    	try
    	{
    		m_session->Run(run_options, *m_bindings);
    	}
    	catch (const Ort::Exception& e)
    	{
    		LOG_ERROR(e.what());
    		return false;
    	}
    
    	return true;
    }


  3. C7X kernel creation (called through vxCreateGraph())
    static vx_status MyCustomKernel_Create(tivx_target_kernel_instance kernel, tivx_obj_desc_t *objDesc[], uint16_t numParams, void* privArg) {
        /* ... */
        for (int ch = 8; ch <= 10; ch++) {
            if (VX_SUCCESS != dma_acquire(handles[i], ch))
            {
                VX_PRINT(VX_ZONE_ERROR, "Failed to acquire channel %u!\n", ch);
                success = false;
            }
        }
        /* ... */
    }
    
    static vx_status dma_acquire(app_udma_ch_handle_t& handle, const uint32_t ch) {
        vx_status status = VX_SUCCESS;
        handle = appUdmaCopyNDGetHandle(ch);
        if (nullptr == handle) {
            VX_PRINT(VX_ZONE_ERROR, "appUdmaCopyNDGetHandle() returned nullptr!\n");
            status = VX_FAILURE;
        }
        return status;
    }


  4. Error
    [C7x_1 ] 1026752.990518 s: 
    [C7x_1 ] 1026752.990529 s: 
    [C7x_1 ] 1026752.990541 s: 
    [C7x_1 ] 1026752.990547 s: 
    [C7x_1 ] 1026752.990554 s: UDMA : ERROR: UDMA channel open failed!!
    [C7x_1 ] 1026752.990584 s:  VX_ZONE_ERROR:[dma_acquire:1188] appUdmaCopyNDGetHandle() returned nullptr!
    [C7x_1 ] 1026752.990606 s:  VX_ZONE_ERROR:[MyCustomKernel_Create:109] Failed to acquire channel 10!
    [C7x_1 ] 1026752.990622 s: UDMA : ERROR: NULL Pointer!!!
    [C7x_1 ] 1026752.990633 s: UDMA : ERROR: Unable to delete channel 10 handle!!!

As you can see from the log, only the third channel initialization (channel 10) failed. I have tried using other channels, but it's always the third one that fails. I think it's because of how the driver handles the requests.

I was recently told by someone from TI that this was not expected behavior. Do you see something wrong with our flow?

Thank you,

Fred