This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA4VM: A precision problem of TIDL multi-batch model

Part Number: TDA4VM

Abstract: The multi-batch model inference in PC emulation mode has a precision problem caused by pad filling error (During the filling process of TIDL_layerPadding at a certain layer, the index calculation is incorrect, which results in filling the normal data with padValue(0)). We have an advised solution, please check it.

Description:  The multi-batch model (numBatches = 10) inference in pc mode will encounter a precision problem. We use the same input data in each batch and expect the same output, but the rest of batches have different results with the first batch.

1. Model Importer Opration

Onnx:  resnet18v2.onnx (https://s3.amazonaws.com/onnx-model-zoo/resnet/resnet18v2/resnet18v2.onnx)

Shell command: ./tidl_model_import.out resnet18v2_importer.txt

Please refer to the attachment  resnet18v2.zip for  details about the configuration and model.1715.resnet18v2.zip

2. Model Inference Operation

Environment:ubuntu20 x86

First, we create params with tidl model bin files, according to the example in vision_apps. Second, we create tivxTIDLNode and vxGraph. Then we called vxVerifyGraph and vxProcessGraph.

In our program, the tiovx-related symbols is assessed by linking libvx_tidl_rt.so.

The input data of vx_tensor used by 10 copies of the same image, refer to ILSVRC2012_val_00008685.bin.

3. Error Information

As showed in the following picture, the output of other batches is different with the first.

4. Analysis and Solution

We dump the output data of each layer during inference by “traceWriteLevel=3 and compare each batch data in each dump file, then we found that the data are different between each batch from 43rd layer which is Pooling layer.

Further analyzing shows that the problem is caused by function TIDL_layerPadding which called after the 42nd layer(BatchReshape) process. In this function, originally correct output values are overwritten by padValue(0).

Fullscreen
1
2
3
4
5
6
7
8
9
/* tidl_alg_utils.c TIDL_layerPadding */
if (((padRFillZeros > 0) || (TIDL_PADDING_TYPE_TOP_LEFT == paddingType)) && (TIDL_PADDING_TYPE_PAD_LAYER != paddingType))
{
status = TIDL_FillPaddedRows((uint8_t *)outPtrs[j], ...); // has not called in 42nd layer
}
if((padC > 0) && (TIDL_PADDING_TYPE_PAD_LAYER != paddingType) && (status == IALG_EOK))
{
status = TIDL_FillPaddedCols((uint8_t *)outPtrs[j], ...); // has not update bufInfo->bufHeight
}
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

And the root cause is that, expression “bufInfo->bufHeight = bufInfo->bufHeight / numBatchesexists in TIDL_FillPaddedRows but TIDL_FillPaddedCols doesnt. So the index  of 42nd layer output which computed to be filled is wrong since it has not called TIDL_FillPaddedRows and hasnt update the value of bufInfo->bufHeight. By the way, the bufInfo->bufHeight in TIDL_FillPaddedCols of 42nd is 5120, which normally should be 512.

Solution: Advance the update of bufInfo->bufHeight from TIDL_FillPaddedRows  to its father function TIDL_layerPaddingsee as follows:

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
/* tidl_alg_utils.c TIDL_layerPadding */
sBufferInfo_t *bufInfo = &intAlgHandle->perfSimOutput->sdataFlowInfo[i].bufInfo[OUT_FEAT_MAP][WRITE];
bufInfo->bufHeight = bufInfo->bufHeight / TIDLLayer->outData.dimValues[TIDL_DIM_BATCH]; // update bufHeight before filling
if (((padRFillZeros > 0) || (TIDL_PADDING_TYPE_TOP_LEFT == paddingType)) && (TIDL_PADDING_TYPE_PAD_LAYER != paddingType))
{
status = TIDL_FillPaddedRows((uint8_t *)outPtrs[j], ...); // has not called in 42nd layer
}
if((padC > 0) && (TIDL_PADDING_TYPE_PAD_LAYER != paddingType) && (status == IALG_EOK))
{
status = TIDL_FillPaddedCols((uint8_t *)outPtrs[j], ...); // has not update bufInfo->bufHeight
}
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

The output data is correct after this modification:

5Questions and Requirements

a. Whether the solution to this problem is feasible and is there any other point that has not been considered?

b. Will this issue be fixed in later versions ?