TDA4VM: Slow execution of convolutional node on SDK 7.0

PR7

Part Number: TDA4VM

Hello,

I am using SDK 7.0 and I am experiencing again an issue that I encountered in older SDK releases but that was solved in SDK 6.2. Please see the related question.

The problem is due to a 2D convolution operation that runs very slowly at 8-bit (120-130ms).

The operation has the following parameters:

- Input shape NCHW: 1x128x8x24

- 1x1 kernel shape with stride (2,2)

I attached model to reproduce, which is the same posted in the older question: conv2D.zip .

Could you check why this is happening again and could you suggest a way to overcome this problem without downgrading the sdk version?

Thanks,

Pierre

over 5 years ago

0 Anshu Jain over 5 years ago

TI__Guru 56820 points

Hi Pierre,

Can you try to run this model in debug mode and let us know if you get any warnings in console?

Regards,

Anshu

0 PR7 over 5 years ago in reply to Anshu Jain

Prodigy 160 points

Hi Anshu,

I already tried with this configuration:

- debugTraceLevel = 3 in the conversion scipt

- writeTraceLevel = 3 in the conversion script

- #define APP_DEBUG in the execution code for the board

I can't see any warning during conversion (TIDL runs in emulation mode to collect activations, all model checks passed correctly) nor during execution on target using OpenVX APIs.

Thanks,

Pierre

0 Anshu Jain over 5 years ago in reply to PR7

TI__Guru 56820 points

Hi Pierre,

You should build the TI DL target library in debug mode and then run it on EVM. No need to set debugTraceLevel or writeTraceLevel.

Regards,

Anshu

0 PR7 over 5 years ago in reply to Anshu Jain

Prodigy 160 points

Hi Anshu,

if with "TI DL target library" you mean to rebuild and run TI_DEVICE_dsp_test_dl_algo.out on the EVM using CCS, I guess it is not feasible for us since we have not physical access to board as for now. Actual board configuration is Linux+TI-RTOS, so we could rebuild everything in debug mode (TIDL, tiovx, vision_apps) and redo the test, but I don't know if we would be able to see the warning in that way.

Could be feasible for you to test the model I shared in the way you suggested?

Thanks,

Pierre

0 Anshu Jain over 5 years ago in reply to PR7

TI__Guru 56820 points

Pierre,
I understand. Can you just confirm that the model which you shared ( with just single layer) is the one where you are observing slow behavior? This is not for the current issue reported but for a given layer you should measure the performance with the complete network, when you have single layer both input and output will be from DDR and performance of the layer will not really represent the correct performance.

I will try to run this network and share my observation.

Regards,
Anshu

0 PR7 over 5 years ago in reply to Anshu Jain

Prodigy 160 points

Hi Anshu,

ok I see. Basically what I did is the following:

- I tested the complete model (around 10 GMACs I guess) and it ran at 7 fps while in the previous sdk it was pretty faster (around 100 fps)

- Analyzing the model, I extracted the problematic layer, created a minimal onnx model and ran it on the EVM (8 fps). And this is the minimal model I shared with you.

So I understand that reading input and writing output to DDR will affect the performance since we are using just one layer, but in any case I think that the final execution time for a single layer should be in the order of hundreds os microseconds (at 16 bits, this single layer runs at 200-300 us on the EVM even if I/O is in DDR, so probably the bug is related to the 8 bit version)

Thanks,

Pierre

0 Anshu Jain over 5 years ago in reply to PR7

TI__Guru 56820 points

Pierre,

Thanks for clarification. As mentioned earlier single layer is not the cause of your issue, it was just a note. I will try to run the model which you shared to see if i can reproduce the same behavior. Mostly this configuration is going to un-optimized flow which is resulting into slow performance, but I will confirm the same after trying it out at our end.

Regards,

Anshu

0 PR7 over 5 years ago in reply to Anshu Jain

Prodigy 160 points

Hi Anshu,

thanks, I'll wait for your feedback.

Regards,

Pierre

0 Anshu Jain over 5 years ago in reply to PR7

TI__Guru 56820 points

Hi Pierre,

I am able to reproduce this issue and as suspected earlier this configuration is going to the code which is not optimized in our current release. I am getting more details about why is that and will update you the same once I have that information.

Regards,

Anshu

0 PR7 over 5 years ago in reply to Anshu Jain

Prodigy 160 points

Hi Anshu,

thanks for your feedback. Let me know if this issue can be solved in a simple way, unfortunately it would be quite a limit for us to modify and retrain the model, considering that this was working fine in sdk 6.2. I'll wait for your update.

Thanks,

Pierre

0 Anshu Jain over 5 years ago in reply to PR7

TI__Guru 56820 points

Hi Pierre,

The optimized implementation for this will be available in SDK 7.1 release which is expected to be available by October end.

Regards,

Anshu

Processors

Processors forum

TDA4VM: Slow execution of convolutional node on SDK 7.0