TDA4VM: [UDMA TDA4]UDMA stuck when transfer data in Transpose Mode.

lei fu1

Expert 1640 points

Part Number: TDA4VM

Hi all,

I want to use UDMA to transfer data and transpose data as shown below: Only the data in yellow need to be tranferred,

stride = width * N

And here is my configuration of UDMA TR:

But the UDMA stucked and never stop when I use this UDMA TR to transfer data.

Did I misconfigure something? Could you please provide me with a correct configuration to implement my idea?

Br,

Lance

over 5 years ago

0 Karan Saxena over 5 years ago

TI__Guru* 77194 points

Hi Lance,

Do you mean to do a transfer like:

You mean

1 2 3 4

……..

5 6 7 8

1 5

2 6

3 7

4 8

If yes, then the TR is wrong. I'm looking at giving you a reference for the same. Please confirm my understanding.

Regards,

Karan

0 Karan Saxena over 5 years ago in reply to Karan Saxena

TI__Guru* 77194 points

Hi Lance,

Can you try this configuration? Let me know if this works.

pTr->icnt0 = width;

pTr->icnt1 = height;

pTr->icnt2 = 1U;

pTr->icnt3 = 1U;

pTr->dim1 = width * N;

pTr->dim2 = 0; //don't care

pTr->dim3 = 0; //don't care

pTr->dicnt0 = 1;

pTr->dicnt1 = width;

pTr->dicnt2 = height;

pTr->dicnt3 = 1U;

pTr->ddim1 = height;

pTr->ddim2 = 1;

pTr->ddim3 = 0; //don't care

Regards,

Karan

0 lei fu1 over 5 years ago in reply to Karan Saxena

Expert 1640 points

Yes. Your understanding is right.

I will try your configuration.

What about the fmtflags? fmtflag = 0x00000200U?

0 lei fu1 over 5 years ago in reply to Karan Saxena

Expert 1640 points

Hi Karan,

It doesn't work.

UDMA stucked, too.

Is there something wrong with flag or fmtflag configuration?

Br,

Lance

0 Karan Saxena over 5 years ago in reply to lei fu1

TI__Guru* 77194 points

Hi Lance,

Have you modified some existing example for this experiment? If yes, can you provide me a patch.

And also, this is on which SDK?

I want to replicate the setup locally to give you pointed answers.

Regards,

Karan

0 Karan Saxena over 5 years ago in reply to Karan Saxena

TI__Guru* 77194 points

Hi Lance,

Can you try one more thing:

IUse Type 9 TR with UDMA as it allows source and destination count to be different :

pTr->icnt0 = width;

pTr->icnt1 = height;

pTr->icnt2 = 1U;

pTr->icnt3 = 1U;

pTr->dim1 = width * N;

pTr->dim2 = 0; //don't care

pTr->dim3 = 0; //don't care

pTr->dicnt0 = height;

pTr->dicnt1 = width;

pTr->dicnt2 = 1;

pTr->dicnt3 = 1U;

pTr->ddim1 = height;

pTr->ddim2 = 1;

pTr->ddim3 = 0; //don't care

Regards,

Karan

0 lei fu1 over 5 years ago in reply to Karan Saxena

Expert 1640 points

Hi Karan,

Does this configuration work on your side?

I use Type 9 TR, but it still stucked.

I will try to offer you an example to replicate this problem.

My SDK version is psdk_rtos_6.2.

BTW, I see there is transpose mode in DFMT. Could you offer a configuration of transpose mode?

Br,

Lance

0 Karan Saxena over 5 years ago in reply to lei fu1

TI__Guru* 77194 points

Hi Lance,

lei fu1 said:
I use Type 9 TR, but it still stucked.

When you say stuck, what do you mean by this? Can you provide more details? Which function is it stuck, what is the trace?

lei fu1 said:
BTW, I see there is transpose mode in DFMT. Could you offer a configuration of transpose mode?

I need to look at this.

lei fu1 said:
Does this configuration work on your side?

I actually consulted the driver owner, I will test all these scenarios when I can replicate your setup.

Regards,

Karan

0 lei fu1 over 5 years ago in reply to Karan Saxena

Expert 1640 points

stuck means I will wait the transmission completion signal but it'll never happen.

The program is in a dead loop.

0 Karan Saxena over 5 years ago in reply to lei fu1

TI__Guru* 77194 points

Thanks, got it.

Help me replicate the issue and I can help you with more pointed information.

Regards,

Karan

0 lei fu1 over 5 years ago in reply to Karan Saxena

Expert 1640 points

Hi Karan,

Do you know how to build this example separately？ I have some difficulty to build a CCS project, so I want to modify this example to help you replicate the issue.

It's best to run it under host_emulation.

Regards,

Lance

0 Karan Saxena over 5 years ago in reply to lei fu1

TI__Guru* 77194 points

Hi Lance,

So you are using the udma_dru_direct_tr_test example? I can run that on the EVM, that should not be an issue.

Any modifications you've done to the example?

Regards,

Karan

0 lei fu1 over 5 years ago in reply to Karan Saxena

Expert 1640 points

Hi Karan,

Could you please change the configuration of pTr in this function in udma_dru_direct_tr_test.c to the ones you provided me and try? I tried your configuration in our own program and couldn't get the transpose to transfer data.

Thanks,

Regards,

Lance

0 lei fu1 over 5 years ago in reply to Karan Saxena

Expert 1640 points

Hi Karan,

Thanks,

Regards,

Lance

0 Brijesh Jadav over 5 years ago in reply to lei fu1

TI__Guru**** 454525 points

Lance,

Could you please check the TR response and see if there is anything reported back by DMA engine?

Rgds,

Brijesh

0 lei fu1 over 5 years ago in reply to Brijesh Jadav

Expert 1640 points

Brijesh，

We removed event registration in our program according to your response in thread:https://e2e.ti.com/support/processors/f/791/t/897849

Now we are synchronizing the DMA by using a while loop to query the registers, in this case can we still get TR Response?

0 Brijesh Jadav over 5 years ago in reply to lei fu1

TI__Guru**** 454525 points

Hello Lance,

Yes, in this case, even will not just come, but TR response should still be available.

Please make sure to clear TR response before submitting frame and check it at the end of the frame.

Rgds,

Brijesh

0 lei fu1 over 5 years ago in reply to Brijesh Jadav

Expert 1640 points

Hi Brijesh,

I have two question about TR response:

1. Where to configure TR Response? and how to check and clear it of a frame?

2. If Transpose mode is available, how much slower is the transfer efficiency of DMA compared to direct copy?

Lance

0 lei fu1 over 5 years ago in reply to Brijesh Jadav

Expert 1640 points

Hi Brijesh,

I have two ways to optimize my code, could you please help to analyze which one is better:

Use DMA transpose mode to copy data from DDR to SRAM, and then use SE to read data in continuous memory.
Use DMA to copy data from DDR to SRAM, and use SE transpose mode to read data in discrete memory。

Rgds,

Lance

0 Brijesh Jadav over 5 years ago in reply to lei fu1

TI__Guru**** 454525 points

Hi Lance,

TR response would be stored at the end of the TR, it is 32bit value written back by DMA engine, providing status of the DMA transfer. This will be helpful in case of hang situation to know status.

Regards,

Brijesh

0 lei fu1 over 5 years ago in reply to Brijesh Jadav

Expert 1640 points

Hi Brijesh，

Sorry to bother you, but I don't see the response field in the TR struct.

Regards,

Lance

0 Brijesh Jadav over 5 years ago in reply to lei fu1

TI__Guru**** 454525 points

Hi Lance,

It is not part of this structure, because it is not part of TR. We typically reserve one more 32bit word at the end of TR, one for each TR, at the end of all TR. This is where DMA engine will write back TR response.

Regards,

Brijesh

0 lei fu1 over 5 years ago in reply to Brijesh Jadav

Expert 1640 points

Hi Brijesh，

I don't understand what's the meaning of reserving one more 32bit word at the end of TR. Could you please show me some sample codes of reading the TR response?

Or do you have examples for copying data in UDMA DRU Direct Transpose mode that works properly?

Lance

0 Brijesh Jadav over 5 years ago in reply to lei fu1

TI__Guru**** 454525 points

Hi Lance,

We (Karan Saxena are trying make example for transpose mode. First we are trying it with TR Indirect mode. Will keep you updated.

For TR response, please refer to UDMA specs.

Rgds,

Brijesh

0 lei fu1 over 5 years ago in reply to Brijesh Jadav

Expert 1640 points

OK. Thanks a lot!

Looking forward to your examples.

Regards,

Lance

0 Karan Saxena over 5 years ago in reply to lei fu1

TI__Guru* 77194 points

Hi Lance,

Update: I tried to do a transposed copy using UDMA and also DRU, the transpose is not happening during the transfer. I'm debugging this but in the meantime wanted to check if separately programming the Tx and Rx can also work for you i.e. not using block copy but separate channels for tx and rx?

Regards,

Karan

0 lei fu1 over 5 years ago in reply to Karan Saxena

Expert 1640 points

Hi Karan,

I'm not sure what you mean by "separately programming the Tx and What does "Rx" mean? My goal is to achieve an efficient reshape operation, such as the Conversion of a CNN feature from NCHW to NHWC. Do you have some suggestion for this operation.

Regards,

Lance

0 Karan Saxena over 5 years ago in reply to lei fu1

TI__Guru* 77194 points

Hi Lance,

I have attached the modified the modified udma_dru_testapp (patch on top of SDK6.02) which achieves the transpose.

But this will be an inefficient way in case you need to do it for large chunks. There is a discussion happening with Zhong, Ming on email pertaining to your specific use-case.

Regards.

Karan

Patch - /cfs-file/__key/communityserver-discussions-components-files/791/0001_2D00_udma_5F00_dru_5F00_testapp_2D00_Transfer_2D00_and_2D00_transpose_2D00_using_2D00_DRU.patch

0 lei fu1 over 5 years ago in reply to Karan Saxena

Expert 1640 points

Hi Karan,

I have seen your patch and I noticed that you are still using No Change Mode of UDMA to transfer data.

I've tried this method of yours before and it's really inefficient.

I was wondering if you have tried to use this Transpose Mode mentioned in the datasheet? Will this mode be optimized on hardware so it is more efficient to achieves the transpose.

Regards.

Lance

0 Karan Saxena over 5 years ago in reply to lei fu1

TI__Guru* 77194 points

Hi Lance,

With UDMA and DRU there are some limitations on what features are supported.

There has been an email exchange with Zhong Ming and based on that -

MMALIB supports matrix transpose API. You can point to the following document : there is an existing example in SDK 7.0.

https://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-jacinto7/latest/exports/docs/mmalib_01_02_00_03/docs/user_guide/group__MMALIB__LINALG__matrixTranspose__ixX__oxX.html

We think using this would be the most optimal solution for your usecase.

You can build the by following link to setup enviroment - http://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-jacinto7/latest/exports/docs/mmalib_01_02_00_03/docs/user_guide/BUILD_INSTRUCTIONS.html#build_instructions and then build using the command:

make scrub; make -j12 TARGET_CPU=C7100 LINK_LIBS=1 linalg_c7xmma/MMALIB_LINALG_matrixTranspose_ixX_oxX TEST_CASE=1000 TEST_ENV=EVM

Find the binary at psdk_rtos_auto_j7_07_00_00_11/mmalib_01_02_00_03/out/C7100/release/linalg_c7xmma/MMALIB_LINALG_matrixTranspose_ixX_oxX_C7100.out

Regards,

Karan

0 lei fu1 over 5 years ago in reply to Karan Saxena

Expert 1640 points

Hi Karan,

Thanks a lot for your support!

I will try to use MMALib Transpose to optimize my codes. When the DRU can support transpose transfer, please ask zhongming to tell me about that.

Thanks again for your help.

Best Regards,

Lance

0 Keerthy J over 5 years ago in reply to lei fu1

TI__Guru**** 156610 points

Hi Lance,

MMALIB transpose should be better efficient than our previous approach. Can you clarify why is the request for DRU transpose?

- Keerthy

0 lei fu1 over 5 years ago in reply to Keerthy J

Expert 1640 points

Hi Keerthy，

On the one hand, it is because TIDL does not support a separate reshape layer, but for certain post-process algorithm NHWC format is much more efficient.

on the other hand, we hope that the transpose transmission of data can be parallel to other calculation processes.

Best Regards,

Lance

0 Keerthy J over 5 years ago in reply to lei fu1

TI__Guru**** 156610 points

Hi Lance,

The suggestion from our experts is if the intent is to hide the processing then the conversion should be done in either
layer previous to reshape or the one layer after reshape layer.

Couple of questions:

1) How many instances of reshape layer are being used?
2) If there is only one reshape layer what is the layer before and after reshape layer?

- Keerthy

0 lei fu1 over 5 years ago in reply to Keerthy J

Expert 1640 points

Hi Keerthy,

1. We use permute and reshape layers in the Networks for detection.

2. Sometimes we use only one permute layer to change format from NCHW to NHWC for the convenience of post process. Sometime we use permute and reshape in the network and followed by a softmax layer.

Lance

Processors

Processors forum

TDA4VM: [UDMA TDA4]UDMA stuck when transfer data in Transpose Mode.