This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM5728: caffe-jacinto failed to train

Genius 13655 points
Part Number: AM5728
Other Parts Discussed in Thread: TMDSEVM572X

Hello Champs,

HW: TMDSEVM572X

SW: Processor SDK Linux 06_03_00_106,
        Machine Learning: TIDL  caffe-jacinto。

Customer used caffe-jacinto to train a network,define the last layer Convolution kernel as 1*1, output is 1. But when training, it prompted error message. 

There is similar structure in TI example except that the output is different

What's wong?

object detection.mobilenet,both conv3_1/sep and conv3_2/sep are kernel 1*1,group is 1,but the output channel doesn't match group.


Customer's network configuration.
layer {
name: "fu1_1/dw"
type: "Convolution"
bottom: "conv7_3"
top: "fu1_1/dw"
convolution_param {
num_output: 64
bias_term: false
pad: 1
kernel_size: 3
group: 64
stride: 1
weight_filler {
type: "msra"
}
dilation: 1
}
}
layer {
name: "fu1_1/dw/bn"
type: "BatchNorm"
bottom: "fu1_1/dw"
top: "fu1_1/dw"
batch_norm_param {
scale_bias: true
}
}
layer {
name: "relu1_1/dw"
type: "ReLU"
bottom: "fu1_1/dw"
top: "fu1_1/dw"
}
layer {
name: "fu1_1/sep"
type: "Convolution"
bottom: "fu1_1/dw"
top: "fu1_1/sep"
convolution_param {
num_output: 64
bias_term: false
pad: 0
kernel_size: 1
group: 1
stride: 1
weight_filler {
type: "msra"
}
dilation: 1
}
}
layer {
name: "fu1_1/sep/bn"
type: "BatchNorm"
bottom: "fu1_1/sep"
top: "fu1_1/sep"
batch_norm_param {
scale_bias: true
}
}
layer {
name: "relu1_1/sep"
type: "ReLU"
bottom: "fu1_1/sep"
top: "fu1_1/sep"
}
layer {
name: "fu1_2/dw"
type: "Convolution"
bottom: "fu1_1/sep"
top: "fu1_2/dw"
convolution_param {
num_output: 64
bias_term: false
pad: 1
kernel_size: 3
group: 64
stride: 1
weight_filler {
type: "msra"
}
dilation: 1
}
}
layer {
name: "fu1_2/dw/bn"
type: "BatchNorm"
bottom: "fu1_2/dw"
top: "fu1_2/dw"
batch_norm_param {
scale_bias: true
}
}
layer {
name: "relu1_2/dw"
type: "ReLU"
bottom: "fu1_2/dw"
top: "fu1_2/dw"
}
layer {
name: "fu1_2/sep"
type: "Convolution"
bottom: "fu1_2/dw"
top: "estdmap"
convolution_param {
num_output: 1
bias_term: false
pad: 0
kernel_size: 1
group: 1
stride: 1
weight_filler {
type: "msra"
}
dilation: 1
}
}

Customer's network structure




Thanks.
Rgds
Shine

  • Hello, TI!

    Actually, the last several layers of my original network is as below:

    layer {
      bottom: "conv7_2"
      top: "conv7_3"
      name: "conv7_3"
      type: "Convolution"
      param {
        lr_mult: 1
        decay_mult: 1
      }
      param {
        lr_mult: 2
        decay_mult: 0
      }
      convolution_param {
        num_output: 64
        pad: 2
        dilation: 2
        kernel_size: 3
        weight_filler {
          type: "gaussian"
          std: 0.01
        }
        bias_filler {
          type: "constant"
          value: 0
        }
      }
    }
    layer {
      bottom: "conv7_3"
      top: "estdmap"
      name: "fu1"
      type: "Convolution"
      param {
        lr_mult: 1
        decay_mult: 1
      }
      param {
        lr_mult: 2
        decay_mult: 0
      }
      convolution_param {
        num_output: 1
        kernel_size: 1
        weight_filler {
          type: "gaussian"
          std: 0.01
        }
        bias_filler {
          type: "constant"
          value: 0
        }
      }
    }
    
    layer {
      name: "loss"
      type: "EuclideanLoss"
      bottom: "estdmap"
      bottom: "densitymap"
      top: "loss"
    }

    When training, the caffe-jacinto prints the error mentioned above. However, the caffe framework trains the network successfully. Then I guess it could be relevant to the characteristics of caffe-jacinto? 

    Since the kernel size of the final layer is 1*1, then I modify my network as it is done in the mobilenet. It turns out to fail as well...

    Here comes a new finding. Based on my original architecture, I change the "num_output" of the layer "fu1" from "1" to other numbers(2, 3, 12, etc.) and it also train successfully.

    Is there a solution with which the output of the layer "fu1" can be "1"?

    Thanks.

  • Hello, TI!

    Just a reminder.

    If the "num_output" cannot be set to 1, it could be a bug in caffe-jacinto.

  • Hi, I couldn't read the text in the image pasted after "But when training, it prompted error message.". Can you please post the error message in higher resolution?

  • Sorry about the image.

    I20210914 13:15:29.829465   514 net.cpp:267] TRAIN Top shape for layer 35 'conv7_3' 1 64 41 128 (335872)
    I20210914 13:15:29.829478   514 layer_factory.hpp:172] Creating layer 'fu1' of type 'Convolution'
    I20210914 13:15:29.829481   514 layer_factory.hpp:184] Layer's types are Ftype:FLOAT Btype:FLOAT Fmath:FLOAT Bmath:FLOAT
    I20210914 13:15:29.829568   514 net.cpp:200] Created Layer fu1 (36)
    I20210914 13:15:29.829579   514 net.cpp:572] fu1 <- conv7_3
    I20210914 13:15:29.829584   514 net.cpp:542] fu1 -> estdmap
    F20210914 13:15:29.829599   514 conv_dw_layer.cpp:17] Check failed: bottom[0]->channels() == conv_param.num_output() && conv_param.num_output() == conv_param.group() For Depthwise Seperable Convolution, input channels, output channels and groups must have same value. 64 1 1

    When the "num_output" of "fu1" is "1", it failed to start training in caffe-jacinto. It worked well with other numbers as I mentioned above.

    When the "num_output" of "fu1" is "1", it trained successfully in caffe.

    Is it because of some limitations related to group convolution in caffe-jacinto?

  • Please comment out three lines in layer_factory.cpp as shown below, recompile caffe-jacinto and it should work:

    https://git.ti.com/cgit/jacinto-ai/caffe-jacinto/tree/src/caffe/layer_factory.cpp#n62

      //if(conv_param.num_output() == conv_param.group()) {
      //  return CreateLayerBase<ConvolutionDepthwiseLayer>(param, ftype, btype);
      //}


    Details:
    ConvolutionDepthwiseLayer is just a faster implementation specifically for Depthwise layers - it is not mandatory.

    The check shown above should have ensured that the input channels output channels and groups are same as done in (https://git.ti.com/cgit/jacinto-ai/caffe-jacinto/tree/src/caffe/layers/conv_dw_layer.cpp#n17)

    But input channels is not available inside layer_factory.cpp - so the condition to instantiate ConvolutionDepthwiseLayer is not fully correct.
    
    

    Hope this helps.

  • Thank you for your quick response!

    I'll give some feedback after I try.

    Thanks! 

  • It trains successfully.

    Thanks!