This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

SK-AM62A-LP: could run cpu+npu in the same time based on edgeai-gst-apps /app_cpp.

Part Number: SK-AM62A-LP


Tool/software:

 could run cpu+npu in the same time  based on edgeai-gst-apps /app_cpp.  We tried to delete  allownodes.txt 's contents And tried  add  these code to postprocess part .

auto allocator_info = Ort::MemoryInfo::CreateCpu(OrtDeviceAllocator, OrtMemTypeCPU);
Ort::Value input_tensor_ = Ort::Value::CreateTensor<float>(allocator_info, input_image_.data(), input_image_.size(), input_shape.data(), input_shape.size());
auto cpu_output = ort_session->Run(Ort::RunOptions{ nullptr }, &input_names[0], &input_tensor_, 1output_names.data(), 1);
const float* output_cpu = cpu_output[0].GetTensorMutableData<float>();

But all the fps will down to 5 . Is that normal?

  • Hi

    It is suggested to measure the latency of each part to know which statement lower down speed. 

    Regards,

    Adam

  • Hello,

    Are you trying to run a model only on CPU in this instance?  FPS will be much slower than NPU/C7x, of course. C7x is often 20-50x faster than 4x A53's on AM62A.

    I am not sure how you have created the ort_session here, but I assume it uses CPU execution provider instead of TIDL, like here: 

    BR,
    Reese

  • It is easy to use onnxruntime cpu . I have  tried a lot  but failed in npu

    void * allocTensorMem(int size, int accel)
    {
        void * ptr = NULL;
        if (accel)
        {
            #ifdef DEVICE_AM62
            LOG_ERROR("TIDL Delgate mode is not allowed on AM62 devices...\n");
            printf("Could not allocate memory for a Tensor of size %d \n ", size);
            exit(0);
            #else
            ptr = TIDLRT_allocSharedMem(64, size);
            #endif
        }
        else
        {
            ptr = malloc(size);
        }
        if (ptr == NULL)
        {
            printf("Could not allocate memory for a Tensor of size %d \n ", size);
            exit(0);
        }
        return ptr;
    }  
    /**************************************************************************************************************************/
    
        Ort::SessionOptions session_cpuop;
        session_cpuop.SetLogSeverityLevel(3);
        Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "ONNXModel");
        string artifacts_path="/opt/model_zoo/c777/artifacts";
        std::string model_path_cpu = "/opt/model_zoo/c777/model/modified_mpiifacegaze-60.onnx";
        c_api_tidl_options *options = (c_api_tidl_options *)malloc(sizeof(c_api_tidl_options));
        OrtStatus *def_status = OrtSessionsOptionsSetDefault_Tidl(options);
        strcpy(options->artifacts_folder, artifacts_path.c_str());
        OrtStatus *status = OrtSessionOptionsAppendExecutionProvider_Tidl(session_cpuop, options);
        session_cpuop.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_EXTENDED);
        session_cpuop.SetLogSeverityLevel(3);
        Ort::AllocatorWithDefaultOptions allocator;
        Ort::Session session(env, model_path_cpu.c_str(), session_cpuop);
        std::vector<int64_t> input_shape = {1, 3, 60, 60};
        cv::resize(gaze_img, resized_cpu_image, cv::Size(60, 60));  
        normalize_(resized_cpu_image);
        std::vector<char*> input_names,output_names;
        std::vector<std::string> strings = {"156"};
        std::vector<std::string> strings1 = {"input.1"};
        std::vector<Ort::Value> output_tensors;
        Ort::AllocatorWithDefaultOptions allocator;
        //input_names.push_back("input.1");
        for (int i = 0; i < 1; i++){
            const char* exampleString = strings[i].c_str();
            char* newCharPtr = new char[strlen(exampleString) + 1];
            strcpy(newCharPtr, exampleString);
            output_names.push_back(newCharPtr);
        }
        for (int i = 0; i < 1; i++){
            const char* exampleString1 = strings1[i].c_str();
            char* newCharPtr1 = new char[strlen(exampleString1) + 1];
            strcpy(newCharPtr1, exampleString1);
            input_names.push_back(newCharPtr1);
        }
        auto allocator_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);//Ort::MemoryInfo::CreateCpu(OrtDeviceAllocator, OrtMemTypeCPU);
        std::vector<Ort::Value> input_tensors;
        Ort::Value input_tensor = Ort::Value::CreateTensor<float>(allocator_info, input_image_.data(), input_image_.size(), input_shape.data(), input_shape.size());
        input_tensors.push_back(std::move(input_tensor));
        //Ort::Value input_tensor_ = Ort::Value::CreateTensor<float>(allocator_info, input_image_.data(), input_image_.size(), input_shape.data(), input_shape.size());
        auto run_options = Ort::RunOptions();
        run_options.SetRunLogVerbosityLevel(2);
        run_options.SetRunLogSeverityLevel(3);
        auto cpu_output = session.Run(run_options, input_names.data(), input_tensors.data(), 1, output_names.data(), 1);
        Ort::IoBinding binding(session);
        binding.BindInput(input_names[0], input_tensors[0]);
        for(int idx =0; idx < 1; idx++){
                auto node_dims = cpu_output[idx].GetTypeInfo().GetTensorTypeAndShapeInfo().GetShape();
                size_t tensor_size = 1;
                for(int j = node_dims.size()-1; j >= 0; j--)
                    tensor_size *= node_dims[j];
                ONNXTensorElementDataType tensor_type  = cpu_output[idx].GetTypeInfo().GetTensorTypeAndShapeInfo().GetElementType();              
                if(tensor_type == ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT)
                    tensor_size *= sizeof(float);
                else if(tensor_type == ONNX_TENSOR_ELEMENT_DATA_TYPE_UINT8)
                    tensor_size *= sizeof(uint8_t);
                else if(tensor_type == ONNX_TENSOR_ELEMENT_DATA_TYPE_INT64)
                    tensor_size *= sizeof(int64_t);              
                else{
                std::cout << "Un Supported output tensor_type\n";
                    exit(0);}  
                void * outData = allocTensorMem(tensor_size, 1);
                auto output_tensor = Ort::Value::CreateTensor(allocator_info, (void *)outData, tensor_size, node_dims.data(), node_dims.size(),tensor_type);
                output_tensors.push_back(std::move(output_tensor));
                binding.BindOutput(output_names[idx], output_tensors[idx]);
            }
            session.Run(run_options, binding);
            std::cout<<test<<std::endl;
            float *inDdata = output_tensors.at(0).GetTensorMutableData<float>();
            std::cout<<inDdata<<std::endl;
    /**************************************************************************************************************************/
  • Hello,

    I re-formatted your code above to make it more readable in the e2e interface -- code blocks (using Insert -> Code) help greater here

    But all the fps will down to 5 . Is that normal?

    Is this for the portion of the model running CPU? I cannot say if this is expected or not -- it depends on the complexity of your network. As I stated above, CPU is usually 20-50x slower than the NPU

    We tried to delete  allownodes.txt 's contents And tried  add  these code to postprocess part .

    I'm not entirely sure I understand here. It sounds like you tried to remove post-processing layers from allowedNode.txt so that TIDL would instead delegate this to the Arm core. I suppose that might work, but it would be better to handle this during compilation of the model\

    • Use deny_list and Max_num_subgraphs to control the accelerated vs. unaccelerated set of layers. 
      • For instance, set max_num_subgraphs to 1, and deny layers where you'd like to stop using TIDL
        • --> everything after these layers should also be denied because TIDL will not create an additional subgraph. 

    This is so that you use NPU for part of the model and CPU for the rest?

    It is easy to use onnxruntime cpu . I have  tried a lot  but failed in npu

    Can you be more specific? What has failed? Can you provide error messages?

    BR,
    Reese

  • It is easy to use onnxruntime cpu . I have  tried a lot  but failed in npu

    Can you be more specific? What has failed? Can you provide error messages?

    It  still running on cpu 

  • 没有任何的报错,单个模型推理部分20ms,但是cpu占用却离奇的340%  没办法确定是不是用在npu上面

    void * outData = allocTensorMem(tensor_size, 1);  不管是1   ptr = TIDLRT_allocSharedMem(64, size);   

    还是0   ptr = malloc(size);    cpu消耗都是340%

  • 同时每次都是只能跑128次。for循环在128次以内能正常结束 。但是超过128次或是用while1到128次就直接结束了

  •  如果我在warmup后面实现类似摄像头循环读图片再处理的效果,infertime不一致 

  • Hello, 

    I can only offer limited help with most of the text in Chinese.

    I do clearly see the infer time is highly variable, and seems to have common modes (15ms, 27ms). I've seen such behavior in the past when frequency scaling is enabled in Linux for the Arm cores. 

    $cat /sys/devices/system/cpu/cpufreq/policy0/scaling_available_governors
    ondemand userspace performance
    
    $cat /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
    performance #should be either userspace or performance
    

    I think you see similar behavior on multiple parts of your code. Is this correct?


  • Ort::SessionOptions session_cpuop;
    session_cpuop.SetLogSeverityLevel(3);
    Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "ONNXModel");
    string artifacts_path="/home/c666-phase1/artifacts";
    std::string model_path_cpu = "/home/c666-phase1/model/modified_modified_test_phase1.onnx";
    c_api_tidl_options *options = (c_api_tidl_options *)malloc(sizeof(c_api_tidl_options));
    OrtStatus *def_status = OrtSessionsOptionsSetDefault_Tidl(options);
    strcpy(options->artifacts_folder, artifacts_path.c_str());
    OrtStatus *status = OrtSessionOptionsAppendExecutionProvider_Tidl(session_cpuop, options);
    session_cpuop.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_EXTENDED);
    Ort::Session session(env, model_path_cpu.c_str(), session_cpuop);
    auto run_options = Ort::RunOptions();
    run_options.SetRunLogVerbosityLevel(2);
    run_options.SetRunLogSeverityLevel(3);
    
    
    
    std::vector<int64_t> input_shape = {1, 3, 128, 128};
    cv::Mat resized_cpu_image;
    cv::Mat gaze_img = cv::imread("/home/an.jpg");
    std::vector<char*> input_names,output_names;
    std::vector<std::string> strings = {"outputall"}; // output
    std::vector<std::string> strings1 = {"images"}; // input
    std::vector<Ort::Value> output_tensors;
    Ort::AllocatorWithDefaultOptions allocator;
    for (int i = 0; i < 1; i++){
    const char* exampleString = strings[i].c_str();
    char* newCharPtr = new char[strlen(exampleString) + 1];
    strcpy(newCharPtr, exampleString);
    output_names.push_back(newCharPtr);
    }
    for (int i = 0; i < 1; i++){
    const char* exampleString1 = strings1[i].c_str();
    char* newCharPtr1 = new char[strlen(exampleString1) + 1];
    strcpy(newCharPtr1, exampleString1);
    input_names.push_back(newCharPtr1);
    }
    auto allocator_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);
    std::vector<Ort::Value> input_tensors;
    
    gaze_img = cv::imread("/home/3.jpg");
    
    cv::resize(gaze_img, resized_cpu_image, cv::Size(128, 128));
    normalize_(resized_cpu_image);
    
    Ort::Value input_tensor = Ort::Value::CreateTensor<float>(allocator_info, input_image_.data(), input_image_.size(), input_shape.data(), input_shape.size());
    input_tensors.push_back(std::move(input_tensor));
    input_shape.size());
    //warmup
    auto cpu_output = session.Run(run_options, input_names.data(), input_tensors.data(), 1, output_names.data(), 1);
    Ort::IoBinding binding(session);
    binding.BindInput(input_names[0], input_tensors[0]);
    for(int idx =0; idx < 1; idx++){
    auto node_dims = cpu_output[idx].GetTypeInfo().GetTensorTypeAndShapeInfo().GetShape();
    size_t tensor_size = 1;
    for(int j = node_dims.size()-1; j >= 0; j--)
    tensor_size *= node_dims[j];
    ONNXTensorElementDataType tensor_type = cpu_output[idx].GetTypeInfo().GetTensorTypeAndShapeInfo().GetElementType();
    if(tensor_type == ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT)
    tensor_size *= sizeof(float);
    else if(tensor_type == ONNX_TENSOR_ELEMENT_DATA_TYPE_UINT8)
    tensor_size *= sizeof(uint8_t);
    else if(tensor_type == ONNX_TENSOR_ELEMENT_DATA_TYPE_INT64)
    tensor_size *= sizeof(int64_t);
    else{
    std::cout << "Un Supported output tensor_type\n";
    exit(0);}
    //cout<<tensor_size<<endl; 940
    void * outData = allocTensorMem(tensor_size, 1);
    auto output_tensor = Ort::Value::CreateTensor(allocator_info, (void *)outData, tensor_size, node_dims.data(), node_dims.size(),tensor_type);
    output_tensors.push_back(std::move(output_tensor));
    binding.BindOutput(output_names[idx], output_tensors[idx]);
    }
    cv::VideoCapture cap(3);
    for (int i = 0; i < 1000; i++){
    
    cap >> gaze_img;
    
    cv::resize(gaze_img, resized_cpu_image, cv::Size(128, 128));
    normalize_(resized_cpu_image);
    clock_t stavg = clock();
    Ort::Value input_tensor = Ort::Value::CreateTensor<float>(allocator_info, input_image_.data(), input_image_.size(), input_shape.data(), input_shape.size());
    input_tensors.push_back(std::move(input_tensor));
    binding.BindInput(input_names[0], input_tensors[0]);
    session.Run(run_options, binding);
    clock_t endavg = clock();
    printf("infer time:%f ms\n", (double)(endavg - stavg)*1000 / CLOCKS_PER_SEC);
    float *inDdata = output_tensors.at(0).GetTensorMutableData<float>();
    
    }
    
    }

  • Hi

    As we discussed on the phone, the problem of one model of two models not running becomes these two:

    1. infer time not stable 

    2. large cpu loading

    So please use top -H to see the subprocess cpu loading. And please file this in another e2e as it becomes another issue.

    Regards,

    Adam

  • We will address this topic of CPU utilization and inconsistent inference latency in the  new thread.

    As a quick summarization for future visitors, 5 FPS is realistic on CPU. The code snippets posted are functionally correct, but the problem at this stage is that inference latency is inconsistent. This is probably a result of CPU-starvation or otherwise needing too much compute in short period of time --> inconsistent performance from context switching.