This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA4VH-Q1: Model Inference Time Difference

Part Number: TDA4VH-Q1

My model running using python take 30ms

However when using cpp tidlrt to read the model, it takes only 6ms, which is really weird.

My model has 35 outputs so I have rebuild some .so files on the tda4, and there's no error when running.

(This is from my previous issue: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1303572/tda4vh-q1-config-file-size-does-not-match-size-of-stidl_iobufdesc_t/4960330#4960330)

So I would think how I use tidlrt is somehow correct.

this is my screenshot:

  • Hi,

    Firstly am trying to understand few things.

    As mentioned here : https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1303572/tda4vh-q1-config-file-size-does-not-match-size-of-stidl_iobufdesc_t/4960330#4960330

    (change some header, rebuild, and replace some .so or .a files?)

    Have you modified few micros specific to number of buffers related and rebuilt the TILD-RT ? As 35 output sized is not supported by std rt, am I right ?

    Then with modified RT you doing model inference and sharing the results is this correct ?

  • hi pratik,

    yes!


    for this file c7x-mma-tidl/arm-tidl/rt/inc/itidl_ti.h

    this from 16 to 32
    // #define TIDL_NUM_OUT_BUFS ((int32_t) 16)
    #define TIDL_NUM_OUT_BUFS ((int32_t) 32)

    this from 32 to 64
    // #define TIDL_MAX_ALG_OUT_BUFS ((int32_t) 32)
    #define TIDL_MAX_ALG_OUT_BUFS ((int32_t) 64)

    and then i will get this new build file:

    ti-processor-sdk-rtos-j784s4-evm-09_00_00_02/c7x-mma-tidl/arm-tidl/rt/out/J784S4/A72/LINUX/release/libvx_tidl_rt.so.1.0

    and i put it to my tda4 /usr/bin

  • As I guessed !

    With above mentioned changes the RT will be built please note that these changes will not reflect at end, as for that one need to make clean built of all the libs, and that will not be possible as we dont ship all the source code files.

    I would recommend to add concat layers in your model and shrink down the input buffer sizes to the supported limit TIDL_NUM_OUT_BUFS instead of modifying the source code and rebuilding the same.

  • Hi Pratik,

    so TIDL_NUM_OUT_BUFS is fixed to 16 and cannot be configured?

    if i concat some layers together and their values distribute differently, i guess the quantization result will be bad?

  • so TIDL_NUM_OUT_BUFS is fixed to 16 and cannot be configured?

    At least from the users point its cant be.

    if i concat some layers together and their values distribute differently, i guess the quantization result will be bad?

    Not fully sure, but I would suggest to try out the experiment and see the deviation, however I see adding concat layer is way to move !

    Let me know if you have more issue specific to highlighted topic of this thread, else we can close this !

    Thank you

  • Hi Pratik,

    I was using tidlrt with a model with 21 outputs, my setting was the same as the original:

    #define TIDL_NUM_OUT_BUFS ((int32_t) 16)
    #define TIDL_MAX_ALG_OUT_BUFS ((int32_t) 32)

    however, it can run successfully.

    can you explain whether this is possible or not?

    is that because 21 < TIDL_MAX_ALG_OUT_BUFS -> so i can at most have 32 tensor output for a single model?

    then what is the difference between TIDL_NUM_OUT_BUFS & TIDL_MAX_ALG_OUT_BUFS

    Thanks~

  • I will have to check with dev team on implementation level details for this.

    Please expect delay in my response as team is on year end break, will try get back to you on this in first week of Jan.

    Thank You.

  • I was using tidlrt with a model with 21 outputs, my setting was the same as the original:

    #define TIDL_NUM_OUT_BUFS ((int32_t) 16)
    #define TIDL_MAX_ALG_OUT_BUFS ((int32_t) 32)

    however, it can run successfully.

    Understood, this should work seamlessly as you see 21 <= TIDL_MAX_ALG_OUT_BUFFS (32)

    Can you explain what was motivation for adding change here , since current TIDL-RT setting meets the expectation of supporting 21 output model ! 

    for this file c7x-mma-tidl/arm-tidl/rt/inc/itidl_ti.h

    this from 16 to 32
    // #define TIDL_NUM_OUT_BUFS ((int32_t) 16)
    #define TIDL_NUM_OUT_BUFS ((int32_t) 32)

    this from 32 to 64
    // #define TIDL_MAX_ALG_OUT_BUFS ((int32_t) 32)
    #define TIDL_MAX_ALG_OUT_BUFS ((int32_t) 64)

    and then i will get this new build file:

    then what is the difference between TIDL_NUM_OUT_BUFS & TIDL_MAX_ALG_OUT_BUFS

    From the user perspective you should check TIDL_MAX_ALG_OUT_BUFFS (This term is specific to num buffers for network), TIDL_NUM_OUT_BUFS  is specific to layer level buffer info.

    Hope this adds the clarity, if there are no more question on highlighted topic we can consider to close this thread.

    Best,

    Pratik

  • Hi Pratik,

    Thanks for your help!

    My current status is that i concat my model that has 35 output to 29 output.

    but the runtime is much faster than it should be.

    may i ask is there a size limit for a single tensor, like xxx float ?

    Thanks~

  • As I understood from your response.

    You have concatenated the model output to 29 and you are able to infer the model (Correct me if am wrong to comprehend your response)

    What do mean by 

    but the runtime is much faster than it should be.

    You mean the model inference is slow ? or fast ? in any case how are you comparing ? and with what you are comparing this ? (Since 35 output model wont compile/infer) 

    Secondly,

    may i ask is there a size limit for a single tensor, like xxx float ?

    Can you be more elaborative here ? Didn't understood what is the ask.

  • Sorry for the unclear question.

    our model originally have 29 outputs

    the model only need 4ms to inference, however the tensor result is wrong.

    for the same exact model, if we select 21 outputs in onnx model, then do onnx->tidl

    the model output can seem correct and take 9ms to run

    for the same exact model, if we select other 8 outputs in onnx model, then do onnx->tidl

    the model output can seem correct

    so my point is that for the 29 output model, it should not be that fast.

    ---

    btw, is there any limit for output tensor memory size? based on our previous discussion, the model can have up to 32 outputs.

    but when i have 32 large tensors output compared to 32 small tensors, the model can still run on tda4?

    should i do any additional resource config for the tda4?

    for example, should i increse the MSMCSIZE_KB in device_config.cfg ?

    the default value from the sdk is 2944, but when i increase to 4896, it shows graph verification failed when i run the model on the tda4

    thanks again for your help!

  • The output tensor sizes are related to external memory hierarchy eg, DDRs so you can fit your output tensor in that available memory.

    for example, should i increse the MSMCSIZE_KB in device_config.cfg ?

    the default value from the sdk is 2944, but when i increase to 4896, it shows graph verification failed when i run the model on the tda4

    thanks again for your help!

    I have checked this with our dev team, and seems like on AHP device that you are using, there is no scope apart from using default mem configs that has been provided as part of device_config file.

  • hi patrik,

    how could i know the size of ddr?

    i use htop on my tda4 and it show 28gb ram, does that means i have 20+gb for output tensor, so it is enough for my model?
    (my model output definitely will not be that large)

    back to my original question, if my same exact model can have correct output when the 21 out of 29 are chosen in the last network layer

    but cannot have correct output when 29 out of 29 is chosen

    (the intermediate layers are all the same in each case, only different output were chosen)


    does that mean I don't need to change MSMCSIZE_KB, because in the above two settings, the intermediate layers are the same, which means they use the same amount of resource?

    thanks!

  • You can refer TRM for DDR related info.

    Secondly i would like to understand the term "if x outputs are chosen from y" could you elaborate what does chosen mean here ?

    The question boils down to you are not able to see correct output when you have 29 output head model ? (is my understanding right ?)

    If yes, here are few things that i would like to know.

    Firstly how are you comparing accuracy result ? what is reference ?

    Is this observation is coming on latest SDK 9.1 (highly recommending to test on ToT sdk as it has recent fixes)

    Debug level 2 trace logs for working and non working case.

  • for the same exact model, we originally have 29 models

    we can turn off some of them in pytorch, thus have x/29 in onnx, thus have x/29 in tidl

    we compare the accuracy result with python onnx result.

    i think we can know close this thread,

    based on our experiment, this is the conclusion we have

    if the total output amount of float is fixed, we only change the output number by concating

    i.e.

    exp1:

    h x w x c1

    h x w x c2

    h x w x c3

    vs

    exp2:

    h x w x cc1

    h x w x cc2

    ...

    h x w x cc29

    where (c1+c2+c3) == (cc1+cc2 ... cc29)

    exp1 will work but exp2 fail, even if exp2 have tensor output amount < 32 as we discussed.

    so now we concat some layer to solve the issue.

    thanks for your help patrik!

  • Sure,

    Great discussion, and hope first approach that we discussed will resolve this issue.

    As suggested closing this thread.