SK-AM62A-LP: could run cpu+npu in the same time based on edgeai-gst-apps /app_cpp.

Wang Xiaojun

Part Number: SK-AM62A-LP

Tool/software:

could run cpu+npu in the same time based on edgeai-gst-apps /app_cpp. We tried to delete allownodes.txt 's contents And tried add these code to postprocess part .

auto allocator_info = Ort::MemoryInfo::CreateCpu(OrtDeviceAllocator, OrtMemTypeCPU);

Ort::Value input_tensor_ = Ort::Value::CreateTensor<float>(allocator_info, input_image_.data(), input_image_.size(), input_shape.data(), input_shape.size());

auto cpu_output = ort_session->Run(Ort::RunOptions{ nullptr }, &input_names[0], &input_tensor_, 1, output_names.data(), 1);

const float* output_cpu = cpu_output[0].GetTensorMutableData<float>();

But all the fps will down to 5 . Is that normal?

9 months ago

0 Adam Hua 9 months ago

TI__Expert 5015 points

It is suggested to measure the latency of each part to know which statement lower down speed.

Regards,

Adam

0 Reese Grimsley 9 months ago in reply to Adam Hua

TI__Genius 16006 points

Hello,

Are you trying to run a model only on CPU in this instance? FPS will be much slower than NPU/C7x, of course. C7x is often 20-50x faster than 4x A53's on AM62A.

I am not sure how you have created the ort_session here, but I assume it uses CPU execution provider instead of TIDL, like here:

https://github.com/TexasInstruments/edgeai-tidl-tools/blob/09347c5390eb95f80754dfbd7cbc7e98029254b9/examples/osrt_cpp/ort/onnx_main.cpp#L358

BR,
Reese

0 Wang Xiaojun 9 months ago in reply to Reese Grimsley

Prodigy 50 points

It is easy to use onnxruntime cpu . I have tried a lot but failed in npu

void * allocTensorMem(int size, int accel)
{
    void * ptr = NULL;
    if (accel)
    {
        #ifdef DEVICE_AM62
        LOG_ERROR("TIDL Delgate mode is not allowed on AM62 devices...\n");
        printf("Could not allocate memory for a Tensor of size %d \n ", size);
        exit(0);
        #else
        ptr = TIDLRT_allocSharedMem(64, size);
        #endif
    }
    else
    {
        ptr = malloc(size);
    }
    if (ptr == NULL)
    {
        printf("Could not allocate memory for a Tensor of size %d \n ", size);
        exit(0);
    }
    return ptr;
}  
/**************************************************************************************************************************/

    Ort::SessionOptions session_cpuop;
    session_cpuop.SetLogSeverityLevel(3);
    Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "ONNXModel");
    string artifacts_path="/opt/model_zoo/c777/artifacts";
    std::string model_path_cpu = "/opt/model_zoo/c777/model/modified_mpiifacegaze-60.onnx";
    c_api_tidl_options *options = (c_api_tidl_options *)malloc(sizeof(c_api_tidl_options));
    OrtStatus *def_status = OrtSessionsOptionsSetDefault_Tidl(options);
    strcpy(options->artifacts_folder, artifacts_path.c_str());
    OrtStatus *status = OrtSessionOptionsAppendExecutionProvider_Tidl(session_cpuop, options);
    session_cpuop.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_EXTENDED);
    session_cpuop.SetLogSeverityLevel(3);
    Ort::AllocatorWithDefaultOptions allocator;
    Ort::Session session(env, model_path_cpu.c_str(), session_cpuop);
    std::vector<int64_t> input_shape = {1, 3, 60, 60};
    cv::resize(gaze_img, resized_cpu_image, cv::Size(60, 60));  
    normalize_(resized_cpu_image);
    std::vector<char*> input_names,output_names;
    std::vector<std::string> strings = {"156"};
    std::vector<std::string> strings1 = {"input.1"};
    std::vector<Ort::Value> output_tensors;
    Ort::AllocatorWithDefaultOptions allocator;
    //input_names.push_back("input.1");
    for (int i = 0; i < 1; i++){
        const char* exampleString = strings[i].c_str();
        char* newCharPtr = new char[strlen(exampleString) + 1];
        strcpy(newCharPtr, exampleString);
        output_names.push_back(newCharPtr);
    }
    for (int i = 0; i < 1; i++){
        const char* exampleString1 = strings1[i].c_str();
        char* newCharPtr1 = new char[strlen(exampleString1) + 1];
        strcpy(newCharPtr1, exampleString1);
        input_names.push_back(newCharPtr1);
    }
    auto allocator_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);//Ort::MemoryInfo::CreateCpu(OrtDeviceAllocator, OrtMemTypeCPU);
    std::vector<Ort::Value> input_tensors;
    Ort::Value input_tensor = Ort::Value::CreateTensor<float>(allocator_info, input_image_.data(), input_image_.size(), input_shape.data(), input_shape.size());
    input_tensors.push_back(std::move(input_tensor));
    //Ort::Value input_tensor_ = Ort::Value::CreateTensor<float>(allocator_info, input_image_.data(), input_image_.size(), input_shape.data(), input_shape.size());
    auto run_options = Ort::RunOptions();
    run_options.SetRunLogVerbosityLevel(2);
    run_options.SetRunLogSeverityLevel(3);
    auto cpu_output = session.Run(run_options, input_names.data(), input_tensors.data(), 1, output_names.data(), 1);
    Ort::IoBinding binding(session);
    binding.BindInput(input_names[0], input_tensors[0]);
    for(int idx =0; idx < 1; idx++){
            auto node_dims = cpu_output[idx].GetTypeInfo().GetTensorTypeAndShapeInfo().GetShape();
            size_t tensor_size = 1;
            for(int j = node_dims.size()-1; j >= 0; j--)
                tensor_size *= node_dims[j];
            ONNXTensorElementDataType tensor_type  = cpu_output[idx].GetTypeInfo().GetTensorTypeAndShapeInfo().GetElementType();              
            if(tensor_type == ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT)
                tensor_size *= sizeof(float);
            else if(tensor_type == ONNX_TENSOR_ELEMENT_DATA_TYPE_UINT8)
                tensor_size *= sizeof(uint8_t);
            else if(tensor_type == ONNX_TENSOR_ELEMENT_DATA_TYPE_INT64)
                tensor_size *= sizeof(int64_t);              
            else{
            std::cout << "Un Supported output tensor_type\n";
                exit(0);}  
            void * outData = allocTensorMem(tensor_size, 1);
            auto output_tensor = Ort::Value::CreateTensor(allocator_info, (void *)outData, tensor_size, node_dims.data(), node_dims.size(),tensor_type);
            output_tensors.push_back(std::move(output_tensor));
            binding.BindOutput(output_names[idx], output_tensors[idx]);
        }
        session.Run(run_options, binding);
        std::cout<<test<<std::endl;
        float *inDdata = output_tensors.at(0).GetTensorMutableData<float>();
        std::cout<<inDdata<<std::endl;
/**************************************************************************************************************************/

0 Reese Grimsley 8 months ago in reply to Wang Xiaojun

TI__Genius 16006 points

Hello,

I re-formatted your code above to make it more readable in the e2e interface -- code blocks (using Insert -> Code) help greater here

Wang Xiaojun said:
But all the fps will down to 5 . Is that normal?

Is this for the portion of the model running CPU? I cannot say if this is expected or not -- it depends on the complexity of your network. As I stated above, CPU is usually 20-50x slower than the NPU

Wang Xiaojun said:
We tried to delete allownodes.txt 's contents And tried add these code to postprocess part .

I'm not entirely sure I understand here. It sounds like you tried to remove post-processing layers from allowedNode.txt so that TIDL would instead delegate this to the Arm core. I suppose that might work, but it would be better to handle this during compilation of the model\

Use deny_list and Max_num_subgraphs to control the accelerated vs. unaccelerated set of layers.
- For instance, set max_num_subgraphs to 1, and deny layers where you'd like to stop using TIDL
  - --> everything after these layers should also be denied because TIDL will not create an additional subgraph.

This is so that you use NPU for part of the model and CPU for the rest?

Wang Xiaojun said:
It is easy to use onnxruntime cpu . I have tried a lot but failed in npu

Can you be more specific? What has failed? Can you provide error messages?

BR,
Reese

0 Wang Xiaojun 7 months ago in reply to Reese Grimsley

Prodigy 50 points

Reese Grimsley said:
Wang Xiaojun said:
It is easy to use onnxruntime cpu . I have tried a lot but failed in npu

Can you be more specific? What has failed? Can you provide error messages?

It still running on cpu

0 Wang Xiaojun 7 months ago in reply to Wang Xiaojun

Prodigy 50 points

没有任何的报错，单个模型推理部分20ms，但是cpu占用却离奇的340% 没办法确定是不是用在npu上面

void * outData = allocTensorMem(tensor_size, 1); 不管是1 ptr = TIDLRT_allocSharedMem(64, size);

还是0 ptr = malloc(size); cpu消耗都是340%

0 Wang Xiaojun 7 months ago in reply to Wang Xiaojun

Prodigy 50 points

同时每次都是只能跑128次。for循环在128次以内能正常结束。但是超过128次或是用while1到128次就直接结束了

0 Wang Xiaojun 7 months ago in reply to Wang Xiaojun

Prodigy 50 points

如果我在warmup后面实现类似摄像头循环读图片再处理的效果，infertime不一致

0 Reese Grimsley 7 months ago in reply to Wang Xiaojun

TI__Genius 16006 points

Hello,

I can only offer limited help with most of the text in Chinese.

I do clearly see the infer time is highly variable, and seems to have common modes (15ms, 27ms). I've seen such behavior in the past when frequency scaling is enabled in Linux for the Arm cores.

$cat /sys/devices/system/cpu/cpufreq/policy0/scaling_available_governors
ondemand userspace performance

$cat /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
performance #should be either userspace or performance

I think you see similar behavior on multiple parts of your code. Is this correct?

0 Wang Xiaojun 7 months ago in reply to Reese Grimsley

Prodigy 50 points

0 Wang Xiaojun 7 months ago in reply to Wang Xiaojun

Prodigy 50 points

Ort::SessionOptions session_cpuop;
session_cpuop.SetLogSeverityLevel(3);
Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "ONNXModel");
string artifacts_path="/home/c666-phase1/artifacts";
std::string model_path_cpu = "/home/c666-phase1/model/modified_modified_test_phase1.onnx";
c_api_tidl_options *options = (c_api_tidl_options *)malloc(sizeof(c_api_tidl_options));
OrtStatus *def_status = OrtSessionsOptionsSetDefault_Tidl(options);
strcpy(options->artifacts_folder, artifacts_path.c_str());
OrtStatus *status = OrtSessionOptionsAppendExecutionProvider_Tidl(session_cpuop, options);
session_cpuop.SetGraphOptimizationLevel(GraphOptimizationLevel::ORT_ENABLE_EXTENDED);
Ort::Session session(env, model_path_cpu.c_str(), session_cpuop);
auto run_options = Ort::RunOptions();
run_options.SetRunLogVerbosityLevel(2);
run_options.SetRunLogSeverityLevel(3);



std::vector<int64_t> input_shape = {1, 3, 128, 128};
cv::Mat resized_cpu_image;
cv::Mat gaze_img = cv::imread("/home/an.jpg");
std::vector<char*> input_names,output_names;
std::vector<std::string> strings = {"outputall"}; // output
std::vector<std::string> strings1 = {"images"}; // input
std::vector<Ort::Value> output_tensors;
Ort::AllocatorWithDefaultOptions allocator;
for (int i = 0; i < 1; i++){
const char* exampleString = strings[i].c_str();
char* newCharPtr = new char[strlen(exampleString) + 1];
strcpy(newCharPtr, exampleString);
output_names.push_back(newCharPtr);
}
for (int i = 0; i < 1; i++){
const char* exampleString1 = strings1[i].c_str();
char* newCharPtr1 = new char[strlen(exampleString1) + 1];
strcpy(newCharPtr1, exampleString1);
input_names.push_back(newCharPtr1);
}
auto allocator_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);
std::vector<Ort::Value> input_tensors;

gaze_img = cv::imread("/home/3.jpg");

cv::resize(gaze_img, resized_cpu_image, cv::Size(128, 128));
normalize_(resized_cpu_image);

Ort::Value input_tensor = Ort::Value::CreateTensor<float>(allocator_info, input_image_.data(), input_image_.size(), input_shape.data(), input_shape.size());
input_tensors.push_back(std::move(input_tensor));
input_shape.size());
//warmup
auto cpu_output = session.Run(run_options, input_names.data(), input_tensors.data(), 1, output_names.data(), 1);
Ort::IoBinding binding(session);
binding.BindInput(input_names[0], input_tensors[0]);
for(int idx =0; idx < 1; idx++){
auto node_dims = cpu_output[idx].GetTypeInfo().GetTensorTypeAndShapeInfo().GetShape();
size_t tensor_size = 1;
for(int j = node_dims.size()-1; j >= 0; j--)
tensor_size *= node_dims[j];
ONNXTensorElementDataType tensor_type = cpu_output[idx].GetTypeInfo().GetTensorTypeAndShapeInfo().GetElementType();
if(tensor_type == ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT)
tensor_size *= sizeof(float);
else if(tensor_type == ONNX_TENSOR_ELEMENT_DATA_TYPE_UINT8)
tensor_size *= sizeof(uint8_t);
else if(tensor_type == ONNX_TENSOR_ELEMENT_DATA_TYPE_INT64)
tensor_size *= sizeof(int64_t);
else{
std::cout << "Un Supported output tensor_type\n";
exit(0);}
//cout<<tensor_size<<endl; 940
void * outData = allocTensorMem(tensor_size, 1);
auto output_tensor = Ort::Value::CreateTensor(allocator_info, (void *)outData, tensor_size, node_dims.data(), node_dims.size(),tensor_type);
output_tensors.push_back(std::move(output_tensor));
binding.BindOutput(output_names[idx], output_tensors[idx]);
}
cv::VideoCapture cap(3);
for (int i = 0; i < 1000; i++){

cap >> gaze_img;

cv::resize(gaze_img, resized_cpu_image, cv::Size(128, 128));
normalize_(resized_cpu_image);
clock_t stavg = clock();
Ort::Value input_tensor = Ort::Value::CreateTensor<float>(allocator_info, input_image_.data(), input_image_.size(), input_shape.data(), input_shape.size());
input_tensors.push_back(std::move(input_tensor));
binding.BindInput(input_names[0], input_tensors[0]);
session.Run(run_options, binding);
clock_t endavg = clock();
printf("infer time:%f ms\n", (double)(endavg - stavg)*1000 / CLOCKS_PER_SEC);
float *inDdata = output_tensors.at(0).GetTensorMutableData<float>();

}

}

0 Adam Hua 7 months ago in reply to Wang Xiaojun

TI__Expert 5015 points

As we discussed on the phone, the problem of one model of two models not running becomes these two:

1. infer time not stable

2. large cpu loading

So please use top -H to see the subprocess cpu loading. And please file this in another e2e as it becomes another issue.

Regards,

Adam

0 Wang Xiaojun 7 months ago in reply to Adam Hua

Prodigy 50 points

SK-AM62A-LP: problems of C++ onnxruntime infer - Processors forum - Processors - TI E2E support forums

0 Reese Grimsley 7 months ago in reply to Wang Xiaojun

TI__Genius 16006 points

We will address this topic of CPU utilization and inconsistent inference latency in the new thread.

As a quick summarization for future visitors, 5 FPS is realistic on CPU. The code snippets posted are functionally correct, but the problem at this stage is that inference latency is inconsistent. This is probably a result of CPU-starvation or otherwise needing too much compute in short period of time --> inconsistent performance from context switching.