This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM5729: Big gap between PC and TIDL result

Part Number: AM5729

Dear Champs,

My customer modified imagenet example for CNN binary classification by using Pytorch, and convert it to ONNX(v1.4), and import it to TIDL.

But, the result of TIDL was different from PC's. 

So, I would like to check with you if this work flow is OK or they should use Neo-AI Deep Learning Runtime (DLR).

Could you please help on this?

Thanks and Best Regards,

SI.

  • Hi SI, the flow you described should be OK. Do you know how much is the accuracy gap?. Moving from floating point models to fix point requires quantization, and this would be reflected on some accuracy lost,  it should be marginal though..

    Do you know which xNN model are they testing? During model import in TIDL there is a calibration process which uses one image as an input, do you know if they are using their own image? we have a default one packaged (airshow), but maybe good if they use their own one for a better calibration.

    Finally, when the model is converted on ONNX, can they run it on a PC?. Just to check Pytorch to ONNX conversion is correct

    thank you,

    Paula

  • HI Paula,

    They think there is too much gap between PC and TIDL and thus they don't think it is not quantization error.

    When they checked it on PC, they achieved 99%, but the accuracy of TIDL was much different than PC as below.

    They modified LeNet-5 network for their input image and added more layers as below.

    They have used their own image, and their calibration result is as below.

    They have checked ONNX version and its PC result using 'np.testing.assert_allclose(to_numpy(torch_out), ort_outs[0], rtol=1e-03, atol=1e-05)' function as below, and there was no assert occurred.

    'np.testing.assert_allclose(to_numpy(torch_out), ort_outs[0], rtol=1e-03, atol=1e-05)'

    e.g. they confirmed ONNX model in their PC with below procedure.

    1) They generated ONNX using below.

    torch.onnx.export(model,  
                      dummy_data,  # )
                      './' + filename + '_model.onnx',  # )
                      opset_version=9)  #  ONNX version (9 = ONNX v1.4)

    2) they confirmed ONNX model using below API.

    onnx_model = onnx.load('./' + filename + '_model.onnx')
    onnx.checker.check_model(onnx_model)

    3) The result was calculated using Python API of ONNX run-time on the same process and inference session was generated.

    ort_session = onnxruntime.InferenceSession('./' + filename + '_model.onnx')

    def to_numpy(tensor):
        return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()

    4) run ONNX model.

    ort_inputs = {ort_session.get_inputs()[0].name: to_numpy(dummy_data)}
    ort_outs = ort_session.run(None, ort_inputs)

    5) comparison between ONNX run-time and PyTorch as below with error range(rtol=1e-03, atol=1e-05).

    it was succeed without any ASSERTION.

    np.testing.assert_allclose(to_numpy(torch_out), ort_outs[0], rtol=1e-03, atol=1e-05)

    I attached their source including input image in below. please check it.

    imagenet.zip

    Thanks and Best Regards,

    SI.

  • Hi SI, sorry for the delay, I have been tied with some task for J7 PSDKLA 7.0 release. A couple of questions/comments, we could try to build their model in AWS sagemaker (neo-ai-dlr) or in TVM. Do you think he can share the model with us? or if they prefer they can try Sagemaker compilation job themselves (attached example snapshot).

    Also, to be honest, I am not completely sure if by porting from pytorch, to ONNX, to TIDL you need to take care of input format conversion, but maybe worthy to check with your customer.. TIDL expects  NCHW (plane interleaved). If it is fed with NHWC (pixel interleaved) instead, then, this could be a source of accuracy degradation..

    thank you,

    Paula

  • Hi Paula,

    Thanks for your response.

    I uploaded their model in below. Could you please check?

    MyWork_PyTorch_20200604.zip

    Thanks and Best Regards,

    SI.

  • Hi Paula,

    I already suggested to use AWS sagemaker, but my customer still failed to compile their model as below.

    Could you please provide an information where my customer can find more detailed guide to compile their model using AWS sagemaker?

    Is it possible to guide them to use AWS sagemaker?

    They generated model using Pytorch and made it to tarball, and use this tarball for input artifact.

    They tried to use below date configurations, but failed to compile it as below.

    {"input0":[1,1,36,208]}

     

    {"input0":[4,1,36,208]}

     

    [[4,1,36,208]]

    It would be very helpful if you can guide him to compile it using AWS Sagemaker.

    And also, they checked input image format you you suggested, but could not find any strange thing.

    When they checked, the input image format for Pytorch is also 'NCHW' which is same as TIDL.

    Thanks and Best Regards,

    SI.

  • Hi SI, yes, I faced the same generic error when trying to compile shared pytorch model with my AWS account.. In the past, I saw a similar failure in AWS when trying another pytorch model from a customer. By that time, the suggestion from AWS was to create the model using torch.jit.save() or traced_model.save() as shown in this python notebook (and snippet code below) and also to use pytorch version 1.2.0 (same version used in SageMaker).


    trace = torch.jit.trace(resnet18.float().eval(),
    torch.zeros(input_shape).float())
    trace.save('model.pth')

    No sure how you customer saved the model, but if he used a different method, could you please ask him to try one of the two options above? also, if a different version of pytorch, could he gave a try? if possible to 1.2.0?

    Another option is try to compile directly with TVM..

    thank you,

    Paula

  • Hi Paula,

    My customer succeed to compile shared pytorch model using pytorch version 1.2.0.

    But, they could not find how they can apply outputs of Sagemaker on AM5729.

    They got below outputs, but they could not find next procedures to port these on AM5729.

    When they compile Pytorch(.pth)  file using Sagemaker, they got below 3 outputs.

    compiled.params

    compiled.so

    compiled_model.json

     

    When they compile ONNX(.onnx) file using Sagemaker, they got below 6 outputs.

    compiled.params

    compiled.so

    compiled_model.json

    subgraph0.cfg

    tidl_subgraph_net.bin

    tidl_subgraph_params.bin

    When I checked below TIDL UG, it seemed the url was broken.

    http://software-dl.ti.com/processor-sdk-linux/esd/docs/latest/linux/Foundational_Components/Machine_Learning/neo.html#compiling-network-models

     https://github.com/TexasInstruments/tvm/tree/dev/apps/tidl_deploy 

    Please guide how they can apply their outputs of Sagemaker on AM5729.

    Thanks and Best Regards,

    SI.

  • Hi SI, our initial AM57x compilation in SageMaker works as all or none offload. If there is any unsupported layer then no layer is delegated to TIDL, an everything defaults to run on ARM.. Interestingly your customer's Pytorch code seems to fall in that category, but not the ONNX..   

    In the past, for a simple test of compiled Mobilne_v2 Tensorflow model in AWS, I used  "do_tidl4.sh" script inside /usr/share/dlr/demos. My methodology was to copy all generated files in a new subfolder /usr/share/dlr/demos/aws_compiled.. and run "./do_tidl4.sh aws_compiled". Just FYI, one issue is that this simple script demo expects batch size 16, so I compiled in aws with this batch size (16), but you can also change this easily inside "do_tidl4.sh".

    I have only tested TensorFlow models, I will confirm if we currently support any other framework.

    For your information, we are working with AWS to make this implementation more robust, and to add multiple subgraphs support (instead of everything or nothing offload approach). Need to come back to you on dates, but we are probably talking end of July where multiple subgraphs and more test coverage would be deployed in AWS services for AM57x

    thank you,

    Paula

  • Hi Paula,

    Thanks for your response.

    I have 3 questions as below.

    1. Is there any C/C++ application example for Neo AI DLR? Where customer can find Neo AI DLR API to develop application SW?

    2. Have you checked Pytorch model I shared before? Could you please guide how my customer can convert it to use TIDL?

    3. Actually my customer is evaluating our DL performance of AM5729 now and they want to check if our performance will be enough for their application.

       Based on their Pytorch model, could you please provide guide how they can experience the best performance of TIDL on AM5729?

       I'm afraid the fps of Neo AI DLR will be slow than direct conversion to TIDL.

    Thanks and Best Regards,

    SI.

  • Hi Paula,

    They tried to run DLR, but failed with below errors.

    ~~~

    Testing inference on aws_compiled/

    [06:22:29] ../3rdparty/tvm/src/runtime/graph/graph_runtime.cc:98: Warning: cannot find "input" among input

    Traceback (most recent call last):

      File "./tidl_dlr4.py", line 62, in <module>

        probabilities = model.run(input_data) #need to be a list of input arrays matching input names

      File "/usr/lib/python3.5/site-packages/dlr/api.py", line 77, in run

        return self._impl.run(input_values)

      File "/usr/lib/python3.5/site-packages/dlr/dlr_model.py", line 325, in run

        self._set_input(key, value)

      File "/usr/lib/python3.5/site-packages/dlr/dlr_model.py", line 208, in _set_input

        c_int(in_data.ndim)))

      File "/usr/lib/python3.5/site-packages/dlr/dlr_model.py", line 26, in _check_call

        raise DLRError(_LIB.DLRGetLastError().decode('ascii'))

    dlr.dlr_model.DLRError: TVMError: Check failed: static_cast<size_t>(index) < input_nodes_.size() (4294967295 vs. 1) :

    Stack trace:

      File "../3rdparty/tvm/src/runtime/graph/graph_runtime.cc", line 178

      [bt] (0) /usr/lib/python3.5/site-packages/dlr/libdlr.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x21) [0xb12713ee]

      [bt] (1) /usr/lib/python3.5/site-packages/dlr/libdlr.so(tvm::runtime::GraphRuntime::GetInput(int) const+0x69) [0xb12d02ea]

      [bt] (2) /usr/lib/python3.5/site-packages/dlr/libdlr.so(dlr::TVMModel::SetInput(char const*, long long const*, float*, int)+0x5d) [0xb129d712]

    [bt] (3) /usr/lib/python3.5/site-packages/dlr/libdlr.so(SetDLRInput+0x19) [0xb12933e6]

     

    ~~~~~~~~~~

    They found the input image format of Neo AI DLR is 'NHWC', and they run their Neo AI DLR as below.

    1. run do_tidl4.sh 

    cp $1/*.bin .
    python3 ./tidl_dlr4.py $1 4 input

      

    2. They converted the grayscale input data using np.array().

     

    img_bmp = cv2.imread('test_Input.bmp', cv2.IMREAD_GRAYSCALE)

    imageIN = np.array(img_bmp, 'float32')

     

    #(36, 208)          (N)CHW -> np.concatenate   

    #(4, 1, 36, 208)  np.transpose(0, 2, 3, 1) -> NHWC

    #(4, 36, 208, 1)    NHWC

    Please let me know if input image format conversion procedure is right and they run their model properly.

     

    Thanks and Best Regards,

    SI.

  • SI, let me respond to your latest post, and then we come back to your earlier questions.

    1. I have seen  "matching input names" error when I used "data" instead of "input" or vice versa.. depending on the model framework this change.. Because you had a successful compilation in SageMaker neo compilation job, you should use the same data input configuration. From an screenshot above I think you used input0?

    2. I am asking a colleague who has some scripts on how to resize/convert to numpy images and come back to you.. but maybe you can check ImagePreprocessing() in below *.cc file? how image is normalized from video input.. this might help. Also, in your previous post you ask us for a C/C++ example, could this one help?

    From yesterday's post:

    • I haven't checked how to import your shared pytorch. I will work on this and come back to you in case of questions.
    • Also, with respect to performance, yes, TVM runtime would add an overhead, but we believe is minimal.. Performance numbers and a better understanding of overhead will come later. The beauty of this approach is with respect to a variety of frameworks and models support. So, we expect less manual work (and possible reduction of errors/frustration) that users has to do..

    thank you,

    Paula

  • SI, I confirmed Numpy image format depends on the framework. TF and TFlite are NHWC, all other frameworks NCHW. With respect to "matching input names" error, from below link it should be "input0", so correcting do_tidl4.sh script should fix the issue.

    thank you,

    Paula

  • Hi Paula,

    Thanks for your guidance. My customer succeed to run their trained model based on your guidance and they recognized their input format was wrong.

    But, the result is same as previous one which used ONNX on TIDL. The result was wrong as same before and can not reach to PC's result so much.

    So, I suggested to use Jacinto Caffe again and they are trying. In the meantime, could you please check why their model is not worked on our AM5729 using ONNX and SageMaker Neo?

    Thanks and Best Regards,

    SI.

  • Hi SI, can you send me some steps on how to test it? can your costumer share their test application?, if so a readme or steps along with it would be helpful.. so I can reproduce it in my side.

    thank you,

    Paula

  • Hi Paula,

    Sorry for late response. Please check below file for their test.

    They succeed to run it by using same input data from this file as you recommended, but they found the result was very bad to 50% accuracy although the accuracy was almost 100% in the PC as I mentioned at the first of this thread.

    Please let me know you result and your opinion on this.

    demos.tar.gz

    Thanks and Best Regards,

    SI.