This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
Hi all,
I want to use UDMA to transfer data and transpose data as shown below: Only the data in yellow need to be tranferred,
stride = width * N
And here is my configuration of UDMA TR:
But the UDMA stucked and never stop when I use this UDMA TR to transfer data.
Did I misconfigure something? Could you please provide me with a correct configuration to implement my idea?
Br,
Lance
Hi Lance,
Do you mean to do a transfer like:
You mean
1 2 3 4
……..
5 6 7 8
To
1 5
2 6
3 7
4 8
If yes, then the TR is wrong. I'm looking at giving you a reference for the same. Please confirm my understanding.
Regards,
Karan
Hi Lance,
Can you try this configuration? Let me know if this works.
pTr->icnt0 = width;
pTr->icnt1 = height;
pTr->icnt2 = 1U;
pTr->icnt3 = 1U;
pTr->dim1 = width * N;
pTr->dim2 = 0; //don't care
pTr->dim3 = 0; //don't care
pTr->dicnt0 = 1;
pTr->dicnt1 = width;
pTr->dicnt2 = height;
pTr->dicnt3 = 1U;
pTr->ddim1 = height;
pTr->ddim2 = 1;
pTr->ddim3 = 0; //don't care
Regards,
Karan
Yes. Your understanding is right.
I will try your configuration.
What about the fmtflags? fmtflag = 0x00000200U?
Hi Karan,
It doesn't work.
UDMA stucked, too.
Is there something wrong with flag or fmtflag configuration?
Br,
Lance
Hi Lance,
Have you modified some existing example for this experiment? If yes, can you provide me a patch.
And also, this is on which SDK?
I want to replicate the setup locally to give you pointed answers.
Regards,
Karan
Hi Lance,
Can you try one more thing:
IUse Type 9 TR with UDMA as it allows source and destination count to be different :
pTr->icnt0 = width;
pTr->icnt1 = height;
pTr->icnt2 = 1U;
pTr->icnt3 = 1U;
pTr->dim1 = width * N;
pTr->dim2 = 0; //don't care
pTr->dim3 = 0; //don't care
pTr->dicnt0 = height;
pTr->dicnt1 = width;
pTr->dicnt2 = 1;
pTr->dicnt3 = 1U;
pTr->ddim1 = height;
pTr->ddim2 = 1;
pTr->ddim3 = 0; //don't care
Regards,
Karan
Hi Lance,
lei fu1 said:I use Type 9 TR, but it still stucked.
When you say stuck, what do you mean by this? Can you provide more details? Which function is it stuck, what is the trace?
lei fu1 said:BTW, I see there is transpose mode in DFMT. Could you offer a configuration of transpose mode?
I need to look at this.
lei fu1 said:Does this configuration work on your side?
I actually consulted the driver owner, I will test all these scenarios when I can replicate your setup.
Regards,
Karan
Thanks, got it.
Help me replicate the issue and I can help you with more pointed information.
Regards,
Karan
Hi Lance,
So you are using the udma_dru_direct_tr_test example? I can run that on the EVM, that should not be an issue.
Any modifications you've done to the example?
Regards,
Karan
Lance,
Could you please check the TR response and see if there is anything reported back by DMA engine?
Rgds,
Brijesh
Brijesh,
We removed event registration in our program according to your response in thread:https://e2e.ti.com/support/processors/f/791/t/897849
Now we are synchronizing the DMA by using a while loop to query the registers, in this case can we still get TR Response?
Hello Lance,
Yes, in this case, even will not just come, but TR response should still be available.
Please make sure to clear TR response before submitting frame and check it at the end of the frame.
Rgds,
Brijesh
Hi Brijesh,
I have two question about TR response:
1. Where to configure TR Response? and how to check and clear it of a frame?
2. If Transpose mode is available, how much slower is the transfer efficiency of DMA compared to direct copy?
Lance
Hi Brijesh,
I have two ways to optimize my code, could you please help to analyze which one is better:
Rgds,
Lance
Hi Lance,
TR response would be stored at the end of the TR, it is 32bit value written back by DMA engine, providing status of the DMA transfer. This will be helpful in case of hang situation to know status.
Regards,
Brijesh
Hi Lance,
It is not part of this structure, because it is not part of TR. We typically reserve one more 32bit word at the end of TR, one for each TR, at the end of all TR. This is where DMA engine will write back TR response.
Regards,
Brijesh
Hi Brijesh,
I don't understand what's the meaning of reserving one more 32bit word at the end of TR. Could you please show me some sample codes of reading the TR response?
Or do you have examples for copying data in UDMA DRU Direct Transpose mode that works properly?
Lance
Hi Lance,
We (Karan Saxena are trying make example for transpose mode. First we are trying it with TR Indirect mode. Will keep you updated.
For TR response, please refer to UDMA specs.
Rgds,
Brijesh
Hi Lance,
Update: I tried to do a transposed copy using UDMA and also DRU, the transpose is not happening during the transfer. I'm debugging this but in the meantime wanted to check if separately programming the Tx and Rx can also work for you i.e. not using block copy but separate channels for tx and rx?
Regards,
Karan
Hi Karan,
I'm not sure what you mean by "separately programming the Tx and What does "Rx" mean? My goal is to achieve an efficient reshape operation, such as the Conversion of a CNN feature from NCHW to NHWC. Do you have some suggestion for this operation.
Regards,
Lance
Hi Lance,
I have attached the modified the modified udma_dru_testapp (patch on top of SDK6.02) which achieves the transpose.
But this will be an inefficient way in case you need to do it for large chunks. There is a discussion happening with Zhong, Ming on email pertaining to your specific use-case.
Regards.
Karan
Hi Karan,
I have seen your patch and I noticed that you are still using No Change Mode of UDMA to transfer data.
I've tried this method of yours before and it's really inefficient.
I was wondering if you have tried to use this Transpose Mode mentioned in the datasheet? Will this mode be optimized on hardware so it is more efficient to achieves the transpose.
Regards.
Lance
Hi Lance,
With UDMA and DRU there are some limitations on what features are supported.
There has been an email exchange with Zhong Ming and based on that -
MMALIB supports matrix transpose API. You can point to the following document : there is an existing example in SDK 7.0.
We think using this would be the most optimal solution for your usecase.
You can build the by following link to setup enviroment - http://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-jacinto7/latest/exports/docs/mmalib_01_02_00_03/docs/user_guide/BUILD_INSTRUCTIONS.html#build_instructions and then build using the command:
make scrub; make -j12 TARGET_CPU=C7100 LINK_LIBS=1 linalg_c7xmma/MMALIB_LINALG_matrixTranspose_ixX_oxX TEST_CASE=1000 TEST_ENV=EVM
Find the binary at psdk_rtos_auto_j7_07_00_00_11/mmalib_01_02_00_03/out/C7100/release/linalg_c7xmma/MMALIB_LINALG_matrixTranspose_ixX_oxX_C7100.out
Regards,
Karan
Hi Karan,
Thanks a lot for your support!
I will try to use MMALib Transpose to optimize my codes. When the DRU can support transpose transfer, please ask zhongming to tell me about that.
Thanks again for your help.
Best Regards,
Lance
Hi Lance,
MMALIB transpose should be better efficient than our previous approach. Can you clarify why is the request for DRU transpose?
- Keerthy
Hi Keerthy,
On the one hand, it is because TIDL does not support a separate reshape layer, but for certain post-process algorithm NHWC format is much more efficient.
on the other hand, we hope that the transpose transmission of data can be parallel to other calculation processes.
Best Regards,
Lance
Hi Lance,
The suggestion from our experts is if the intent is to hide the processing then the conversion should be done in either
layer previous to reshape or the one layer after reshape layer.
Couple of questions:
1) How many instances of reshape layer are being used?
2) If there is only one reshape layer what is the layer before and after reshape layer?
- Keerthy
Hi Keerthy,
1. We use permute and reshape layers in the Networks for detection.
2. Sometimes we use only one permute layer to change format from NCHW to NHWC for the convenience of post process. Sometime we use permute and reshape in the network and followed by a softmax layer.
Lance