AM5728: caffe-jacinto failed to train

Shine

Part Number: AM5728
Other Parts Discussed in Thread: TMDSEVM572X

Hello Champs,

HW: TMDSEVM572X

SW: Processor SDK Linux 06_03_00_106，
Machine Learning: TIDL caffe-jacinto。

Customer used caffe-jacinto to train a network，define the last layer Convolution kernel as 1*1, output is 1. But when training, it prompted error message.

There is similar structure in TI example except that the output is different

What's wong?

object detection.mobilenet，both conv3_1/sep and conv3_2/sep are kernel 1*1，group is 1，but the output channel doesn't match group.

Customer's network configuration.
layer {
name: "fu1_1/dw"
type: "Convolution"
bottom: "conv7_3"
top: "fu1_1/dw"
convolution_param {
num_output: 64
bias_term: false
pad: 1
kernel_size: 3
group: 64
stride: 1
weight_filler {
type: "msra"
}
dilation: 1
}
}
layer {
name: "fu1_1/dw/bn"
type: "BatchNorm"
bottom: "fu1_1/dw"
top: "fu1_1/dw"
batch_norm_param {
scale_bias: true
}
}
layer {
name: "relu1_1/dw"
type: "ReLU"
bottom: "fu1_1/dw"
top: "fu1_1/dw"
}
layer {
name: "fu1_1/sep"
type: "Convolution"
bottom: "fu1_1/dw"
top: "fu1_1/sep"
convolution_param {
num_output: 64
bias_term: false
pad: 0
kernel_size: 1
group: 1
stride: 1
weight_filler {
type: "msra"
}
dilation: 1
}
}
layer {
name: "fu1_1/sep/bn"
type: "BatchNorm"
bottom: "fu1_1/sep"
top: "fu1_1/sep"
batch_norm_param {
scale_bias: true
}
}
layer {
name: "relu1_1/sep"
type: "ReLU"
bottom: "fu1_1/sep"
top: "fu1_1/sep"
}
layer {
name: "fu1_2/dw"
type: "Convolution"
bottom: "fu1_1/sep"
top: "fu1_2/dw"
convolution_param {
num_output: 64
bias_term: false
pad: 1
kernel_size: 3
group: 64
stride: 1
weight_filler {
type: "msra"
}
dilation: 1
}
}
layer {
name: "fu1_2/dw/bn"
type: "BatchNorm"
bottom: "fu1_2/dw"
top: "fu1_2/dw"
batch_norm_param {
scale_bias: true
}
}
layer {
name: "relu1_2/dw"
type: "ReLU"
bottom: "fu1_2/dw"
top: "fu1_2/dw"
}
layer {
name: "fu1_2/sep"
type: "Convolution"
bottom: "fu1_2/dw"
top: "estdmap"
convolution_param {
num_output: 1
bias_term: false
pad: 0
kernel_size: 1
group: 1
stride: 1
weight_filler {
type: "msra"
}
dilation: 1
}
}

Customer's network structure

Thanks.
Rgds
Shine

over 2 years ago

0 user6446474 over 2 years ago

Prodigy 110 points

Hello, TI!

Actually, the last several layers of my original network is as below:

layer {
  bottom: "conv7_2"
  top: "conv7_3"
  name: "conv7_3"
  type: "Convolution"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  convolution_param {
    num_output: 64
    pad: 2
    dilation: 2
    kernel_size: 3
    weight_filler {
      type: "gaussian"
      std: 0.01
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
}
layer {
  bottom: "conv7_3"
  top: "estdmap"
  name: "fu1"
  type: "Convolution"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  convolution_param {
    num_output: 1
    kernel_size: 1
    weight_filler {
      type: "gaussian"
      std: 0.01
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
}

layer {
  name: "loss"
  type: "EuclideanLoss"
  bottom: "estdmap"
  bottom: "densitymap"
  top: "loss"
}

When training, the caffe-jacinto prints the error mentioned above. However, the caffe framework trains the network successfully. Then I guess it could be relevant to the characteristics of caffe-jacinto?

Since the kernel size of the final layer is 1*1, then I modify my network as it is done in the mobilenet. It turns out to fail as well...

Here comes a new finding. Based on my original architecture, I change the "num_output" of the layer "fu1" from "1" to other numbers(2, 3, 12, etc.) and it also train successfully.

Is there a solution with which the output of the layer "fu1" can be "1"?

Thanks.

0 user6446474 over 2 years ago

Prodigy 110 points

Hello, TI!

Just a reminder.

If the "num_output" cannot be set to 1, it could be a bug in caffe-jacinto.

0 Manu Mathew over 2 years ago in reply to user6446474

TI__Genius 9926 points

Hi, I couldn't read the text in the image pasted after "But when training, it prompted error message.". Can you please post the error message in higher resolution?

0 user6446474 over 2 years ago in reply to Manu Mathew

Prodigy 110 points

Sorry about the image.

I20210914 13:15:29.829465   514 net.cpp:267] TRAIN Top shape for layer 35 'conv7_3' 1 64 41 128 (335872)
I20210914 13:15:29.829478   514 layer_factory.hpp:172] Creating layer 'fu1' of type 'Convolution'
I20210914 13:15:29.829481   514 layer_factory.hpp:184] Layer's types are Ftype:FLOAT Btype:FLOAT Fmath:FLOAT Bmath:FLOAT
I20210914 13:15:29.829568   514 net.cpp:200] Created Layer fu1 (36)
I20210914 13:15:29.829579   514 net.cpp:572] fu1 <- conv7_3
I20210914 13:15:29.829584   514 net.cpp:542] fu1 -> estdmap
F20210914 13:15:29.829599   514 conv_dw_layer.cpp:17] Check failed: bottom[0]->channels() == conv_param.num_output() && conv_param.num_output() == conv_param.group() For Depthwise Seperable Convolution, input channels, output channels and groups must have same value. 64 1 1

When the "num_output" of "fu1" is "1", it failed to start training in caffe-jacinto. It worked well with other numbers as I mentioned above.

When the "num_output" of "fu1" is "1", it trained successfully in caffe.

Is it because of some limitations related to group convolution in caffe-jacinto?

+1 Manu Mathew over 2 years ago in reply to user6446474

TI__Genius 9926 points

Please comment out three lines in layer_factory.cpp as shown below, recompile caffe-jacinto and it should work:

https://git.ti.com/cgit/jacinto-ai/caffe-jacinto/tree/src/caffe/layer_factory.cpp#n62

  //if(conv_param.num_output() == conv_param.group()) {
  //  return CreateLayerBase<ConvolutionDepthwiseLayer>(param, ftype, btype);
  //}


Details:

ConvolutionDepthwiseLayer is just a faster implementation specifically for Depthwise layers - it is not mandatory.


The check shown above should have ensured that the input channels output channels and groups are same as done in (https://git.ti.com/cgit/jacinto-ai/caffe-jacinto/tree/src/caffe/layers/conv_dw_layer.cpp#n17)

But input channels is not available inside layer_factory.cpp - so the condition to instantiate ConvolutionDepthwiseLayer is not fully correct.

Hope this helps.

0 user6446474 over 2 years ago in reply to Manu Mathew

Prodigy 110 points

Thank you for your quick response!

I'll give some feedback after I try.

Thanks!

0 user6446474 over 2 years ago in reply to Manu Mathew

Prodigy 110 points

It trains successfully.

Thanks!

Processors

Processors forum

AM5728: caffe-jacinto failed to train