TDA4VH-Q1: Model Inference Time Difference

ST T

Part Number: TDA4VH-Q1

My model running using python take 30ms

However when using cpp tidlrt to read the model, it takes only 6ms, which is really weird.

My model has 35 outputs so I have rebuild some .so files on the tda4, and there's no error when running.

(This is from my previous issue: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1303572/tda4vh-q1-config-file-size-does-not-match-size-of-stidl_iobufdesc_t/4960330#4960330)

So I would think how I use tidlrt is somehow correct.

this is my screenshot:

over 2 years ago

0 Pratik Kedar over 2 years ago

TI__Mastermind 24041 points

Hi,

Firstly am trying to understand few things.

As mentioned here : https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1303572/tda4vh-q1-config-file-size-does-not-match-size-of-stidl_iobufdesc_t/4960330#4960330

(change some header, rebuild, and replace some .so or .a files?)

Have you modified few micros specific to number of buffers related and rebuilt the TILD-RT ? As 35 output sized is not supported by std rt, am I right ?

Then with modified RT you doing model inference and sharing the results is this correct ?

0 ST T over 2 years ago in reply to Pratik Kedar

Prodigy 190 points

hi pratik,

yes!

for this file c7x-mma-tidl/arm-tidl/rt/inc/itidl_ti.h

this from 16 to 32
// #define TIDL_NUM_OUT_BUFS ((int32_t) 16)
#define TIDL_NUM_OUT_BUFS ((int32_t) 32)

this from 32 to 64
// #define TIDL_MAX_ALG_OUT_BUFS ((int32_t) 32)
#define TIDL_MAX_ALG_OUT_BUFS ((int32_t) 64)

and then i will get this new build file:

ti-processor-sdk-rtos-j784s4-evm-09_00_00_02/c7x-mma-tidl/arm-tidl/rt/out/J784S4/A72/LINUX/release/libvx_tidl_rt.so.1.0

and i put it to my tda4 /usr/bin

0 Pratik Kedar over 2 years ago in reply to ST T

TI__Mastermind 24041 points

As I guessed !

With above mentioned changes the RT will be built please note that these changes will not reflect at end, as for that one need to make clean built of all the libs, and that will not be possible as we dont ship all the source code files.

I would recommend to add concat layers in your model and shrink down the input buffer sizes to the supported limit TIDL_NUM_OUT_BUFS instead of modifying the source code and rebuilding the same.

0 ST T over 2 years ago in reply to Pratik Kedar

Prodigy 190 points

Hi Pratik,

so TIDL_NUM_OUT_BUFS is fixed to 16 and cannot be configured?

if i concat some layers together and their values distribute differently, i guess the quantization result will be bad?

0 Pratik Kedar over 2 years ago in reply to ST T

TI__Mastermind 24041 points

ST T said:
so TIDL_NUM_OUT_BUFS is fixed to 16 and cannot be configured?

At least from the users point its cant be.

ST T said:
if i concat some layers together and their values distribute differently, i guess the quantization result will be bad?

Not fully sure, but I would suggest to try out the experiment and see the deviation, however I see adding concat layer is way to move !

Let me know if you have more issue specific to highlighted topic of this thread, else we can close this !

Thank you

0 ST T over 2 years ago in reply to Pratik Kedar

Prodigy 190 points

Hi Pratik,

I was using tidlrt with a model with 21 outputs, my setting was the same as the original:

#define TIDL_NUM_OUT_BUFS ((int32_t) 16)
#define TIDL_MAX_ALG_OUT_BUFS ((int32_t) 32)

however, it can run successfully.

can you explain whether this is possible or not?

is that because 21 < TIDL_MAX_ALG_OUT_BUFS -> so i can at most have 32 tensor output for a single model?

then what is the difference between TIDL_NUM_OUT_BUFS & TIDL_MAX_ALG_OUT_BUFS

Thanks~

0 Pratik Kedar over 2 years ago in reply to ST T

TI__Mastermind 24041 points

I will have to check with dev team on implementation level details for this.

Please expect delay in my response as team is on year end break, will try get back to you on this in first week of Jan.

Thank You.

0 Pratik Kedar over 2 years ago in reply to Pratik Kedar

TI__Mastermind 24041 points

ST T said:
I was using tidlrt with a model with 21 outputs, my setting was the same as the original:

#define TIDL_NUM_OUT_BUFS ((int32_t) 16)
#define TIDL_MAX_ALG_OUT_BUFS ((int32_t) 32)

however, it can run successfully.

Understood, this should work seamlessly as you see 21 <= TIDL_MAX_ALG_OUT_BUFFS (32)

Can you explain what was motivation for adding change here , since current TIDL-RT setting meets the expectation of supporting 21 output model !

ST T said:
for this file c7x-mma-tidl/arm-tidl/rt/inc/itidl_ti.h

this from 16 to 32
// #define TIDL_NUM_OUT_BUFS ((int32_t) 16)
#define TIDL_NUM_OUT_BUFS ((int32_t) 32)

this from 32 to 64
// #define TIDL_MAX_ALG_OUT_BUFS ((int32_t) 32)
#define TIDL_MAX_ALG_OUT_BUFS ((int32_t) 64)

and then i will get this new build file:

ST T said:
then what is the difference between TIDL_NUM_OUT_BUFS & TIDL_MAX_ALG_OUT_BUFS

From the user perspective you should check TIDL_MAX_ALG_OUT_BUFFS (This term is specific to num buffers for network), TIDL_NUM_OUT_BUFS is specific to layer level buffer info.

Hope this adds the clarity, if there are no more question on highlighted topic we can consider to close this thread.

Best,

Pratik

0 ST T over 2 years ago in reply to Pratik Kedar

Prodigy 190 points

Hi Pratik,

Thanks for your help!

My current status is that i concat my model that has 35 output to 29 output.

but the runtime is much faster than it should be.

may i ask is there a size limit for a single tensor, like xxx float ?

Thanks~

0 Pratik Kedar over 2 years ago in reply to ST T

TI__Mastermind 24041 points

As I understood from your response.

You have concatenated the model output to 29 and you are able to infer the model (Correct me if am wrong to comprehend your response)

What do mean by

ST T said:
but the runtime is much faster than it should be.

You mean the model inference is slow ? or fast ? in any case how are you comparing ? and with what you are comparing this ? (Since 35 output model wont compile/infer)

Secondly,

ST T said:
may i ask is there a size limit for a single tensor, like xxx float ?

Can you be more elaborative here ? Didn't understood what is the ask.

0 ST T over 2 years ago in reply to Pratik Kedar

Prodigy 190 points

Sorry for the unclear question.

our model originally have 29 outputs

the model only need 4ms to inference, however the tensor result is wrong.

for the same exact model, if we select 21 outputs in onnx model, then do onnx->tidl

the model output can seem correct and take 9ms to run

for the same exact model, if we select other 8 outputs in onnx model, then do onnx->tidl

the model output can seem correct

so my point is that for the 29 output model, it should not be that fast.

---

btw, is there any limit for output tensor memory size? based on our previous discussion, the model can have up to 32 outputs.

but when i have 32 large tensors output compared to 32 small tensors, the model can still run on tda4?

should i do any additional resource config for the tda4?

for example, should i increse the MSMCSIZE_KB in device_config.cfg ?

the default value from the sdk is 2944, but when i increase to 4896, it shows graph verification failed when i run the model on the tda4

thanks again for your help!

0 Pratik Kedar over 2 years ago in reply to ST T

TI__Mastermind 24041 points

The output tensor sizes are related to external memory hierarchy eg, DDRs so you can fit your output tensor in that available memory.

ST T said:
for example, should i increse the MSMCSIZE_KB in device_config.cfg ?

the default value from the sdk is 2944, but when i increase to 4896, it shows graph verification failed when i run the model on the tda4

thanks again for your help!

I have checked this with our dev team, and seems like on AHP device that you are using, there is no scope apart from using default mem configs that has been provided as part of device_config file.

0 ST T over 2 years ago in reply to Pratik Kedar

Prodigy 190 points

hi patrik,

how could i know the size of ddr?

i use htop on my tda4 and it show 28gb ram, does that means i have 20+gb for output tensor, so it is enough for my model?
(my model output definitely will not be that large)

back to my original question, if my same exact model can have correct output when the 21 out of 29 are chosen in the last network layer

but cannot have correct output when 29 out of 29 is chosen

(the intermediate layers are all the same in each case, only different output were chosen)

does that mean I don't need to change MSMCSIZE_KB, because in the above two settings, the intermediate layers are the same, which means they use the same amount of resource?

thanks!

0 Pratik Kedar over 2 years ago in reply to ST T

TI__Mastermind 24041 points

You can refer TRM for DDR related info.

Secondly i would like to understand the term "if x outputs are chosen from y" could you elaborate what does chosen mean here ?

The question boils down to you are not able to see correct output when you have 29 output head model ? (is my understanding right ?)

If yes, here are few things that i would like to know.

Firstly how are you comparing accuracy result ? what is reference ?

Is this observation is coming on latest SDK 9.1 (highly recommending to test on ToT sdk as it has recent fixes)

Debug level 2 trace logs for working and non working case.

+1 ST T over 2 years ago in reply to Pratik Kedar

Prodigy 190 points

for the same exact model, we originally have 29 models

we can turn off some of them in pytorch, thus have x/29 in onnx, thus have x/29 in tidl

we compare the accuracy result with python onnx result.

i think we can know close this thread,

based on our experiment, this is the conclusion we have

if the total output amount of float is fixed, we only change the output number by concating

i.e.

exp1:

h x w x c1

h x w x c2

h x w x c3

exp2:

h x w x cc1

h x w x cc2

...

h x w x cc29

where (c1+c2+c3) == (cc1+cc2 ... cc29)

exp1 will work but exp2 fail, even if exp2 have tensor output amount < 32 as we discussed.

so now we concat some layer to solve the issue.

thanks for your help patrik!

+1 Pratik Kedar over 2 years ago in reply to ST T

TI__Mastermind 24041 points

Sure,

Great discussion, and hope first approach that we discussed will resolve this issue.

As suggested closing this thread.

Processors

Processors forum

TDA4VH-Q1: Model Inference Time Difference