This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA4VH-Q1: Queries related to MMALIB Convolve Row

Part Number: TDA4VH-Q1

Tool/software:

Hello,

SDK version: 9.2.0.5. 

We are currently trying to build a custom kernel that utilizes convolve_row API from MMALIB. To support our development, we started looking at sample example code provided at

mmalib_09_02_00_08/ti/mmalib/src/cnn_c7xmma/MMALIB_CNN_convolve_row_ixX_ixX_oxX

As part of it we have a few queries listed below to which we might need your support:

1. Could you please give us a brief explanation of what exactly are the parameters considered for this testing? I ask this because when I see the test_data I found very large kernel and test_feature map arrays. But in the MMALIB_CNN_convolve_row_ixX_ixX_oxX_idat.c I see the values of kernel width and kernel height are 3 (this was confusing because the regIn and refKernel matrices when observed are very large) furthermore the refIn matrices for all the test cases seem to contain only 0's and no other numbers. Please kindly help us understand what exactly is tested in this sample code. 

2. In the sample test code: MMALIB_CNN_convolve_row_ixX_ixX_oxX_d.c, predicate registers are created only if the stridewidth and strideheight are 1. However, in our case, we are planning to have a stride of 4 on both the directions. Should we implement predicate registers as well?

3. within MMALIB_CNN_convolve_row_ixX_ixX_oxX_d.c, the values of strideshift are less than stridewidth and stride height, please explain the significance of this change and what should this be in our case:

4. What exactly should be the value of ValidColsIn, ValidRowsIn, inChOffset, subN, ValidColsPerRowIn, outputPitchPerRow, InputPitchPerRow,: please elaborate their explanation within the documentation is not clear enough and is confusing. 

5. Please provide some more explanation on how the following equations were derived:

Kindly provide your responses for each question. 

  • Hi Srikar,

    Could we get more information regarding your use case and custom kernel so we can better support you? 

    Generally the CNN MMALIB functions are intertwined with our TI Deep Learning (TIDL) framework and used within our TIDL offerings rather than in standalone applications. If you are using this in a deep learning context, I would point you to see what convolution layers we support out of the box. 

    Best,

    Asha

  • Hi Asha, 

    We are trying to develop a custom kernel that would convert a 4X4 image sensor CFA pattern to 2X2 CFA pattern. This conversion is done using 4 levels of convolution on a raw image. we have pre-defined kernels using which the convolution is to be performed. One 5X5 and two 3X3 kernels will be used to perform convolution on each frame. Hence, we wanted to use convolve_row API. However, if you have any other suggestion for us we'd be glad to implement it. 

    Furthermore, when you mention that MMALIB functions are linked to TIDL, does that mean we cannot use these APIs for generic application as mentioned above? What are the sample tests mentioned for each API used for?

    Please Advice. 

  • Hi Srikar,

    MMALIB convolution calls should not be used outside of the TIDL framework. If you are looking for an image pipeline, openVX provides this functionality.

    For example, OpenVX calls to convolve an image.

    https://registry.khronos.org/OpenVX/specs/1.2/html/de/d78/group__group__convolution.html 

    There is also a VXLIB call (based on openVX) for the platform.

    https://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-j784s4/08_06_01_03/exports/docs/vxlib/docs/doxygen/html/vxlib_html/convolution.html 

  • Hi Srikar,

    I want to clarify Chris' responses based on information gained from your FAE. I apologize in the delay in getting to your question. 

    Can you expand on what your problem geometry is? As in image sizes, data types, etc. 

    Are you looking to do 2D convolutions or are you specifically looking at a doing a CNN dense convolution? Convolve_row is an implementation of a CNN style dense convolution designed and optimized to work on tensors (hence the link to our deep learning framework) which might not be aligned with your use case based on the information you have provided.

    If you are looking for an optimal C7x solution in the case of 2D convolution, VXLIB_convolve which is optimized for the C7x architecture (documented here, source code available in the 9.2 SDK under vxlib_09_02_00_04/ for the specific implementation details) could more appropriate based on what problem geometry you have and the style of convolution you are trying to achieve. Note this is similar to MMALIB in the sense that it is a standalone baremetal library that you would need to integrate into your application. 

    Best,

    Asha

  • Hey Asha,

    our image size is: 3840X2160 with raw 16 bit data.

    Are you looking to do 2D convolutions or are you specifically looking at a doing a CNN dense convolution?

    We are looking to perform 2D convolution on image with 4 different kernels separately. 

    Note this is similar to MMALIB in the sense that it is a standalone baremetal library that you would need to integrate into your application. 

    We observed vxlib_convolve API doesn't allow for strides within convolution. Could you give us more clarity on if we can implement the convolution operation using strides (i.e., skip a few pixels on x axis and y axis to perform convolution). And within the vxlib_convolve documentation we read that the convolution for 3X3, 5X5, 7X7 are optimized however, looking into the source code we understood that the API is implemented with nested for loops as mentioned in the image below. Could you please elaborate on what exactly is the optimization performed for the kernel sizes for this API.

    I also have one more question:

    If I need to configure a custom kernel that would take a raw image input. how do i add this as an input argument using vxAddParameterToKernel API, because I dont see a VX_TYPE defined for Raw_images. However, I tried to develop a kernel using TIVX_TYPE_RAW_IMAGE declared in tivx_ext_raw_image.h. I am curious, though currently I am able to use this type within vxAddParameterToKernel, I am concerned it is not part of vx_type_e as mentioned in the declaration of the API, would this bring any implications/complications in further stages for not using a type as part of vx_type_e enum.

  • Hi Srikar,

    I do apolgize for the delay in getting back to you regarding your issue, as I was on extended business travel and needed to align internally with our teams as well. 

    We are looking to perform 2D convolution on image with 4 different kernels separately. 

    In this case, we would not suggest going with the MMALIB kernel as a stride 4 is not supported for the convolution sizes you are looking at. Also at least based on your requirements with stride, I think you would need to modify source code in this case to achieve such functionality (which you can only do with VXLIB). 

    We observed vxlib_convolve API doesn't allow for strides within convolution.

    If I am understanding what you want this would be some form of strided convolution with skipping pixels (something similar described here). The stride parameter stride_y is the only stride parameter available, but this is used to process the image into separate blocks (not as something between pixels). So this feature would be something you would need to add yourself but you can build upon the existing function. 

    Could you please elaborate on what exactly is the optimization performed for the kernel sizes for this API.

    The VXLIB_convolve_cn.cpp is what we would consider the "natural c" implementation, meaning this is the reference to how the algorithm works in c and is not optimized. 

    For the optimized code see the VXLIB_convolve_ci.cpp file and let me know if you have further questions regarding the implementation that we can discuss. 

    For the TIOVX questions, I would recommend you create a new ticket with this - this will allow it to be easier addressed by our OpenVX expert and we can use this thread to discuss convolution. 

    I do apologize for the delay in getting to your question. 

    Best,

    Asha

  • Hi Asha, 

    Per my conversation with srikar and his team: we had a few follow ups, 

    • Stoneridge – Look at the optimized implementation of VXLIB_convolve (contained in VXLIB_convolve_ci.cpp) in particular the 3x3 and 5x5 implementations which are closest to what they are looking for. This is with idea that they would want to build/modify this base implementation and leverage what we already offer. In particular it would be good to look at the following:
        • Where are the source codes for the algos?  __vfir4hw_vww() API

     

      • VXLIB_convolve_3x3_init_ci() and VXLIB_convolve_5x5_init_ci() – these functions are where the Streaming Engine (SE) and Streaming Address (SA) parameters are set which determine the “data access pattern” (setting up the pattern for how they data will be streamed in from memory later when read in the execution loop)
        • What other method is used to optimize?
        • You mentioned that SE can access DDR as well, what is the performance delta between accessing data from DDR vs L2 memory vs MSMC?
        • For the MSMC, is it a design constraint to read and write at the same time?  
  • Hello,

    Just to add a bit more information to the queries above:

    • What other method is used to optimize?

    We are more interested in understanding if the process through which convolution is performed is optimized or the optimization is purely because of usage of Streaming engine. 

    For the MSMC, is it a design constraint to read and write at the same time?

    in one of the previous tickets, Asha mentioned that we can configure SE only to either read/write but can't have both the operations. we are curious to know if that is a design constraint or software constraint. link: TDA4VH-Q1: MSMC 1 Extended Memory and Streaming Engine - Processors forum - Processors - TI E2E support forums

  • Hi Daviel, Srikar,

    Where are the source codes for the algos?  __vfir4hw_vww() API

    This is an intrinsic coming from the C7x compiler. If you look at the compiler header files, you can this definition in include/c7x_direct.h

    This intrinsic corresponds to the VFIR4HW assembly instruction - more information regarding this in the C7x ISA document. Essentially, it performs a sliding MAC operation with a filter size of 4 - we use this for a filter size of 3x3.

    What other method is used to optimize?
    We are more interested in understanding if the process through which convolution is performed is optimized or the optimization is purely because of usage of Streaming engine. 

    When we refer to something as "optimized" we usually mean we are trying to get full performance entitlement out of C7x. So this means using the parts of the architecture that are available to us. Streaming engine is utilized to increase read throughput from L2 memory across the functions we have implemented - and in this case, it is needed by the VFIR instruction. We use the VFIR and other C7x intrinsics to perform convolution over a full vector width of data (as opposed to a scalar implementation). 

    The best way to put it would be that we have mapped 3x3, 5x5 etc. convolution operations to the C7x processor in a way that utilizes the hardware components available to us and the compiler's features to achieve the best performance possible. 

    You mentioned that SE can access DDR as well, what is the performance delta between accessing data from DDR vs L2 memory vs MSMC?

    Yes, with the memory architecture SE can access MSMC and DDR, but you will see a performance hit with this. We don't have specific numbers on this - however since you are curious of the performance specifically with convolution, you can go into the test driver file VXLIB_convolve_d.c and change where the input vectors are stored (pIn, and pFilter) which are allocated in L2 memory by default with the TI_memalign() function. 

    we can configure SE only to either read/write but can't have both the operations. we are curious to know if that is a design constraint or software constraint.

    Streaming Engine can only be used to read from memory, yes. This is a hardware architecture constraint.

    Best,

    Asha