This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

PROCESSOR-SDK-AM57X: Example to train CNN network on tensorflow and run using TIDL

Part Number: PROCESSOR-SDK-AM57X

Hi,

I'm trying to convert a tensorflow model to TIDL and run it on a AM57x processor. Towards this, I followed the instructions shared in this thread to convert the model and the model got converted.

For simplicity, I have used the network in the TensorflowExample (cifar10_train.py) itself as reference. Only difference in my model is that I'm using MNIST dataset for training and testing instead of CIFAR-10 dataset. I have validated the results of my modified network on tensorflow and so, I expect it to work with TIDL also (given the network in TensorflowExample is validated on TIDL). However, when I tried to run the application using the converted model, I get the following issues -

1. Application terminates with segfault (core dump analysis below) if the target device type is EVE.

(gdb) bt
#0 0xb663b522 in std::local_Rb_tree_decrement (__x=0x5714e0)
at /home/tcwg-buildslave/workspace/tcwg-make-release/label/docker-trusty-amd64-tcwg-build/target/arm-linux-gnueabihf/snapshots/gcc-linaro-6.2-2016.11/libstdc++-v3/src/c++98/tree.cc:98
#1 0xb6d75c5e in Coal::Object::Object(Coal::Object::Type, Coal::Object*) () from /usr/lib/libOpenCL.so.1
#2 0xb6d6f91a in Coal::Event::Event(Coal::CommandQueue*, Coal::Event::Status, unsigned int, _cl_event* const*, int*) () from /usr/lib/libOpenCL.so.1
#3 0xb6d72820 in Coal::KernelEvent::KernelEvent(Coal::CommandQueue*, Coal::Kernel*, unsigned int, unsigned int const*, unsigned int const*, unsigned int const*, unsigned int, _cl_event* const*, int*) () from /usr/lib/libOpenCL.so.1
#4 0xb6d72cca in Coal::TaskEvent::TaskEvent(Coal::CommandQueue*, Coal::Kernel*, unsigned int, _cl_event* const*, int*) () from /usr/lib/libOpenCL.so.1
#5 0xb6d6d51a in clEnqueueTask () from /usr/lib/libOpenCL.so.1
#6 0x0001d8ea in tidl::Kernel::RunAsync() ()
#7 0x0001b0f2 in tidl::ExecutionObject::ProcessFrameStartAsync() ()
#8 0x00014b38 in RunConfiguration (config_file=..., num_devices=num_devices@entry=1, device_type=device_type@entry=tidl::DeviceType::DLA, format=format@entry=0, input_file=...) at main.cpp:229
#9 0x000137a2 in main (argc=3, argv=0xbeb1dc84) at main.cpp:116

2. Application terminates with segmentation fault if target device type is DSP and number of devices configured is 2 instead of 1

ERROR: [ Line: 312] CL_INVALID_PROGRAM_EXECUTABLE

core dump analysis:

(gdb) bt
#0 0xb6d3fd02 in std::_Rb_tree<Coal::Object*, Coal::Object*, std::_Identity<Coal::Object*>, std::less<Coal::Object*>, std::allocator<Coal::Object*> >::_M_erase(std::_Rb_tree_node<Coal::Object*>*) () from /usr/lib/libOpenCL.so.1
#1 0xbee35614 in ?? ()

Is there some step that I'm missing?

Regards,
Manu

  • Manu, our expert on this is traveling and the response to this issue may be delayed as a result. I'm sorry for any inconvenience.
  • Manu,

    Can you share the following information? This will help us reproduce the problem and analyze the issues you are seeing.
    * Version of Processor Linux SDK used to run the example
    * Configuration file used for inference
    * Sample input and
    * Imported bin files (two files: *net*.bin, *param*.bin)

    Thanks,
    Ajay
  • Hi Ajay,

    Thanks for getting back on my query. Actually, I have made some progress towards this.

    After some investigation, I figured out that the crash was coming due to the type of images used for training / evaluating the network (my network uses grayscale images, unlike other TIDL examples which use RGB images).
    This was not handled correctly in the application and was resulting in crash. I have corrected this now and observed that application does not crash (tested only with DSP and not with EVE)

    I'm now facing another problem. The results that I'm getting do not match expected result (snapshot of the result below)

    Input: digit7.png => this is input image containing the digit '7'
    frame[ 0]: Time on DSP0: 43.01 ms, host: 44.12 ms API overhead: 2.53 %
    1: 2
    2: 4
    3: 3
    4: 7
    5: 1

    As you can see, although expected result was '7', it is not the best match (not even in the top 3). However, the result is correct when tested using Tensorflow.

    Please find the attached file (below) containing imported files, sample input and config file for your reference.
    Processor Linux SDK version: v5.01.00.11. Do let me know if you need any other details.

    It would be helpful if you can throw some light on what might be going wrong.

    /cfs-file/__key/communityserver-discussions-components-files/791/4505.sample.zip

    Regards,
    Manu

  • Manu,

    Thanks for the update and artifacts to reproduce the issue. We've also run into incorrect results with the mnist dataset on our end and are investigating. Will post an update as soon as we discover what is causing the failure.

    Ajay

     

  • Hi Ajay,

    Thanks for the information. Look forward to your update on this.

    Regards,
    Manu

  • Hi Manu,

    I am closing this thread for now. Will update here once we root cause the failure.
  • Hi Manu,

    Reported issues has been fixed and the fix will be available in next Processor SDK release (version 5.2) scheduled this month end. Please watch out for the release. In this release, MNIST example is also provided in the same directory as the other TIDL examples.

    Regards,
    Manisha