This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM69A: Normalization on Deep Learning Accelerators

Part Number: AM69A

Tool/software:

In our image classification pipeline, the TI model-zoo models require fp32 input.

We are currently converting UINT8 image data to fp32 on the ARM processor, normalizing it, and then passing it to the Deep Learning Accelerator.

If the Deep Learning Accelerator were capable of performing the normalization, we could bypass the fp32 conversion and directly input the UINT8 images, which should reduce latency.

Is this feasible?

  • Hi Mitani-san,

    I may not be understanding your question, but the models run natively with uint/int8.   And image data is usually three channels of 8-bit data, R 0-255, G 0-255, and B 0-255.  So the data should stay in int8 format.  If you need to normalize it, say one channel is significantly off from the others (this would also be unusual), you can add a batch normalization layer explicitly or adjust the import configuration file (TIDLRT syntax, see below).

    #inDataNorm = 1
    #inScale = 0.00392156862745098 0.00392156862745098 0.00392156862745098
    #inMean = 0 0 0

    #inQuantFactor = 255 255 255

    You may only need inDataNorm =1.  If you need the scale and means, of course, the numbers would change based on your input data.

    Regards,

    Chris

  • Thank you for always responding so quickly.

    I apologize for not providing enough explanation. For example, when you display ResNet18 from the TI model zoo(software-dl.ti.com/.../resnet18.onnx) in Netron, it looks like this.

    The code using onnxruntime is as follows, but if the input 'x' is not tensor in fp32, it will result in an error.

        import onnxruntime as ort

        so = ort.SessionOptions()
        ep_list = ['TIDLExecutionProvider', 'CPUExecutionProvider']
        session = ort.InferenceSession(model_path, providers=ep_list, provider_options=[provider_options, {}], sess_options=so)

        out = session.run(None, {input_name: np.array(x)})

    That is the reason I inferred that the input format of the model is fp32, but please let me know if I have misunderstood anything.

    Thank you.

  • Mitani-san,

    I see you're running this on the host right now, so you're correct. Before you compile the ONNX model, it is natively float32.   You first need to read in your image file (in this case a png of 3, r, g,b)

    from PIL import Image
    import numpy as np
    
    def get_rgb_channels(image_path):
        """Reads a PNG image and returns its RGB channels as NumPy arrays.
        
        Args:
            image_path (str): The path to the PNG image file.
        
        Returns:
            tuple: A tuple containing three NumPy arrays (red, green, blue).
        """
        img = Image.open(image_path).convert('RGB')
        img_array = np.array(img)
        red_channel = img_array[:, :, 0]
        green_channel = img_array[:, :, 1]
        blue_channel = img_array[:, :, 2]
        return red_channel, green_channel, blue_channel
    
    # Example usage:
    image_file = 'your_image.png'  # Replace with your image path
    red, green, blue = get_rgb_channels(image_file)
    
    

    After calling get_rgb_channels (file), you will have a red, green, and blue 8-bit channel in an array.  To convert this to float32, simply take the 8bit array and convert it by:

    float32_red= red.astype(np.float32)

    Then you can use it as input into the model.

    Additionally, here is some code to test this with random data (no image needed):

    so = ort.SessionOptions()
    ep_list = ['TIDLExecutionProvider', 'CPUExecutionProvider']
    session = ort.InferenceSession(model_path, providers=ep_list, provider_options=[provider_options, {}], sess_options=so)

    input_details =session.get_inputs()
    output_details =session.get_outputs()
    input_dict = {}
    output_dict = {}

    for i in range(len(input_details)):
       np.random.seed(0)
       if(input_details[i].type == 'tensor(float)'):
           input_data = np.random.randn(*input_details[i].shape).astype(np.float32)
       elif(input_details[i].type == 'tensor(int64)'):
           input_data = np.random.randn(*input_details[i].shape).astype(np.int64)
       elif(input_details[i].type == 'tensor(uint8)'):
           input_data = np.random.randn(*input_details[i].shape).astype(np.uint8)
       elif(input_details[i].type == 'tensor(int32)'):
           input_data = np.random.randn(*input_details[i].shape).astype(np.int32)
       else:
           input_data = np.random.randn(*input_details[i].shape).astype(np.float32)

    input_dict[input_details[i].name] = input_data

    start_time = time.time()

    output = list(session.run(None, input_dict))
    for i in range(len(output_details)):
        output_dict[output_details[i].name] = output[i]

  • Thank you for providing the detailed code.
    The original intent of my question was to inquire whether it is possible to perform the fp32 processing of this code on the AM69A's Deep Learning Accelerator.

    Specifically, is it possible to add a block like the following to the input of the model from the model zoo and have the normalization process handled by the Deep Learning Accelerator instead of the ARM on the AM69A?
    Additionally, can we expect an improvement in latency as a result?
    Thank you.

  • Hi Mitani-san,

    You can set the numParamBits in TIDLRT to 32 or in OSRT set tensor_bits = 32.  In examples/osrt_python/common_utils.py change tensort_bits = 8 to 16.  For TIDLRT, in your import file add a line like:

    numParamBits            = 16

    or 

    numParamBits            = 32

    Please note that the performance will degrade substantially once you do this.  A better way is to identify the layers that need more resolution and set them to 16 bit mode by:

    params16bitNamesList = "layer1, layer2, layerN"

    Regards,

    Chris

  • Thank you for your response.
    If we set the appropriate word length required for the calculations, it seems that both the preprocessing and postprocessing of the CNN can be performed using the Deep Learning Accelerator. I will investigate whether it is faster to execute on ARM or on the Deep Learning Accelerator.
    It seems that I am not fully utilizing the Deep Learning Accelerator on the AM69A in my application, so I will consider ways to make use of it.
    Thank you very much.