why is the time of call QDMA so long?

Zhijun Wang

Prodigy 90 points

The result of QDMA on my 6678 EVM board confuse me.

why is the time of call QDMA_copy_fast is so long?

Thank you!

6678,1GHz,EVM board

My QDMA 1D-1D copy code:

----------------------------------------------------

void QDMA_copy_fast(Uint8 *src, Uint8 *dst, Uint32 len, Uint32 Tccnum)

{

Uint32 * restrict p=(Uint32 *)(0x2704000+Tccnum*32); //global region

#if 0 /* 32bit write */

*p = 0x80100808| (Tccnum <<12); //OPT,dma early end, 1d->1d

*(p+1) = (Uint32)src; //SRC

*(p+2) = (1<<16)|len; //BCNT, ACNT

*(p+3) = (Uint32)dst; //DST

*(p+4) = (1<<16)|1; //DSTBDIX, SRCBDIX;

*(p+5) = 0xFFFF; //BCNTRLD, LINK;

*(p+6) = 1<<16; //DSTCIDX, SRCCDIX;

*(p+7) = 1; //CCNT

#else /* 64bit write */

_amemd8(p) = _itod((Uint32)src, (0x80100808| (Tccnum <<12)));

_amemd8(p+2) = _itod((Uint32)dst, (1<<16)|len);

_amemd8(p+4) = _itod( 0xFFFF, (1<<16)|1);

_amemd8(p+6) = _itod(1, (1<<16));

#endif

}

The out asm file:(O3)

-------------------------------------------------------

QDMA_copy_fast:

MVKL .S2 0x2704000,B5

SHL .S2 B6,5,B4 ; |207|

|| MV .L1X B4,A7 ; |206|

MVKH .S2 0x2704000,B5

|| MVKL .S1 0x10001,A8

RET .S2 B3 ; |233|

|| SET .S1 A6,16,16,A6 ; |222|

ADD .L2 B5,B4,B4 ; |207|

|| MVKH .S1 0x10001,A8

|| MVKL .S2 0x80100808,B7

STDW .D2T1 A7:A6,*+B4(8) ; |222|

|| ADD .L1 -2,A8,A9

|| MVKH .S2 0x80100808,B7

STDW .D2T1 A9:A8,*+B4(16) ; |223|

|| SHL .S2 B6,12,B5 ; |221|

|| ADD .L1 -1,A8,A4

|| MVK .S1 1,A5 ; |224|

|| MV .L2X A4,B9 ; |206|

STDW .D2T1 A5:A4,*+B4(24) ; |224|

|| OR .L2 B7,B5,B8 ; |221|

STDW .D2T2 B9:B8,*B4 ; |221|

; BRANCH OCCURS {B3} ; |233|

The Call code :

----------------------------------------------------------

CACHE_setL1DSize(CACHE_32KCACHE);

CACHE_setL1PSize(CACHE_32KCACHE);

CACHE_setL2Size (CACHE_256KCACHE);

for(i=0;i<256;i++)

CACHE_enableCaching(i);

...

InitQDMA();//my code

startClock = GET_Clock();//read TSCL&TSCH, return 64bit clock

for(i=0;i<1*1024*1024;i++)

{

QDMA_copy_fast(srcBuff1, dstBuff1, 3*1024, 0);

//QDMA_copy_wait(0);

}

endClock = GET_Clock();

ftime = (float)((endClock - startClock) / 1000)/1000;

printf("ftime = %f ms\n",ftime);

...

Run output

--------------------------------------------------

disableQDMA_copy_wait(0):

ftime = 281.364990ms //268 clocks per loop , too long!!!!

enable QDMA_copy_wait(0):

ftime = 374.341003ms //356 clocks per loop , too long!!!!

over 13 years ago

0 RandyP over 13 years ago

TI__Guru* 84110 points

What results did you expect?
268 clocks to copy 3072 bytes is 11 bytes per clock cycle; why is that so bad?
Where are src & dst addresses?

Do not use 64-bit reads and writes with Config Space accesses. All reads and writes to Config Space including PaRAM, unless you find it specified otherwise, should be 32-bit accesses.

The best feature of QDMA is that you do not have to write to all 8 registers when the same or similar transfers are done. In your case, the exact same transfer is being done each time. For this case, you can write to all 8 registers the first time, then only write the the Trigger Word for the remaining cases. The Trigger Word is probably 7, meaning the CCNT register, so you can just write to it to trigger all the remaining transfers. If you want to keep your loop simple, have one function called QDMA_copy_fast_setup() that writes to words 0-6, then have QDMA_copy_fast() just copy to word 7 inside the loop.

Regards,
RandyP

If you need more help, please reply back. If this answers the question, please click Verify Answer , below.

0 Zhijun Wang over 13 years ago in reply to RandyP

Prodigy 90 points

Thank for your good advice about using QDMA.

268 clocks just call/(cfg qdma) overhead, has nothing to do with transmitting .

Just config QDMA , don't wait for trans over.

There are too many overhead on call QDMA_copy_fast.

By the way, can IDMA access EDMA PaRAM memory? I have try it, there seems to be no effect.

0 RandyP over 13 years ago in reply to Zhijun Wang

TI__Guru* 84110 points

Zhijun,

I see now that you are correct to say that no data transfers are involved in the 268 cycle count.

It would appear that you have a benchmark for writing to Config Space 1M times, in particular PaRAM. Do you find a practical use for this number, since you would never write to PaRAM 1M times?

Try the benchmark with only one call to QDMA_copy_fast, and subtract out the benchmarking overhead of the call to GET_Clock:

Single call benchmark said:
calibrateClock = GET_Clock();//read TSCL&TSCH, return 64bit clock
startClock = GET_Clock();//read TSCL&TSCH, return 64bit clock
QDMA_copy_fast(srcBuff1, dstBuff1, 3*1024, 0);
endClock = GET_Clock();
printf("ctime = %d cyc\n",endClock-startClock-(startClock-calibrateClock));

What result does that show?

Config Space accesses are not fast. Memory Space accesses are fast.

Yes, IDMA0 can write to Config Space. I doubt it will help much for writing 8M Words to Config Space, although it may appear to be faster if you trigger it 1M times without waiting for the previous IDMA0s to complete.

Regards,
RandyP

0 Zhijun Wang over 13 years ago in reply to RandyP

Prodigy 90 points

hi, RandyP,

Test code：(build configuration: release)
---------------------------------------------------------------------------------------------
startClock = GET_Clock(); //read TSCL&TSCH, return 64bit clock
QDMA_copy_fast(srcBuff1, dstBuff1, 3*1024, 0);
endClock = GET_Clock();
printf("ctime = %d cyc\n",endClock-startClock-(startClock-calibrateClock));
printf("Get_clock overhead = %d", startClock-calibrateClock);

This is the result of test:
-----------------------------------------------------------------------------------------------
[C66xx_0] start test...
[C66xx_0] ctime = 231 cyc
[C66xx_0] Get_clock overhead= 17
[C66xx_0] end test...

You said :”Config Space accesses are not fast. Memory Space accesses are fast.”, I think this is the point of problem. But I can’t find the specific description.

About IDMA
---------------------------------------------------------------------------------------------------
I copy L2 data to PaRAM entry 0 by IDMA0, but the QDMA copy failure. So, I think config QDMA by IDMA may not work.

TEST code:

*(Uint32 *)0x800000 = 0x80100008| (Tccnum <<12);
         *(Uint32 *)0x800004 = (Uint32)src;
         *(Uint32 *)0x800008 = (1<<16)|len;
         *(Uint32 *)0x80000C = (Uint32)dst;
         *(Uint32 *)0x800010 = (1<<16)|1;
         *(Uint32 *)0x800014 = 0xFFFF;
         *(Uint32 *)0x800018 = 1<<16;
         *(Uint32 *)0x80001C = 1;

*(Uint32 volatile*)0x1820004 = 0xFFFFFF00; //IDMA0_MASK,copy 8 words
         *(Uint32 volatile*)0x1820008= 0xf00000;       //IDMA0_SOURCE, L1D start addres
         *(Uint32 volatile*)0x182000C= 0x2704000;    // IDMA0_DEST, PaRAM entry 0
         *(Uint32 volatile*)0x1820010= 0;                      //IDMA0_COUNT
         while( *(Uint32 volatile*)0x1820000);             //check STAT

0 RandyP over 13 years ago in reply to Zhijun Wang

TI__Guru* 84110 points

I am surprised that the 231 cyc count would occur here. Could you do another test and put a read of the QDMA's OPT register from PaRAM (even if it is uninitialized, or write it then read it if you prefer) and put that just before the startClock assignment? This will flush any write buffers, and I thought there would be a write buffer in that path that would make QDMA_copy_fast return more quickly.

Zhijun Wang said:
I think this is the point of problem. But I can’t find the specific description.

I do not know where to point you.

Zhijun Wang said:
I copy L2 data to PaRAM entry 0 by IDMA0
IDMA0_SOURCE, L1D start addres

This might be your problem. I am very confident that IDMA0 to QDMA will work okay. WHat failing results do you see?

Regards,
RandyP

0 Zhijun Wang over 13 years ago in reply to RandyP

Prodigy 90 points

I rewrite a QDMA project . you can find it in QDMA.zip.( You need change include path and lib path)

newest test result is about 113 cycles per call.

0880.QDMA.zip

About IDMA, i mean IDMA0 copy cfg data to QDMA paRAM entry successfully, but not trigger QDMA trans.

0 RandyP over 13 years ago in reply to Zhijun Wang

TI__Guru* 84110 points

Your speed is 2x faster now. Are you happy with that number?

The IDMA0 code you show above would not copy cfg data to QDMA PaRAM successfully. If it did, it would trigger the QDMA transfer. I told you the errors in the earlier post, so the result could not have been successful copy with that code.

Regards,
RandyP

0 Zhijun Wang over 13 years ago in reply to RandyP

Prodigy 90 points

RandyP,

Final results:

Call overhead is 113 clocks.

Cfg QDMA using IDMA0 is OK. In my program, The problem is the dest aera is cached by L2, using CACHE_wbInvL2, then call QDMA, get the right result.

In my program, requires a large number of QDMA calls , this is why I want to minimize call overhead.

Thank you for your help!

0 RandyP over 13 years ago in reply to Zhijun Wang

TI__Guru* 84110 points

There should not be a cache coherency issue between L1D cache and L2 for IDMA or QDMA transfers. This is handled by the cache controllers in hardware.

To further minimize your QDMA call overhead, consider taking advantage of the trigger word placement that I mentioned in my first reply. Depending on how you will use QDMA in your system, you could allocate different QDMA channels for different types of transfers. For example, QDMA Channel 0 could be for generic single transfers so it would be set to trigger on DST and you would write only SRC, ACNT/BCNT, and DST to start each transfer; QDMA Channel 1 could be for same-length repeating transfers from different buffers to the same DST so it would be set to trigger on SRC and you would write only SRC to start each transfer.

Another way to minimize QDMA overhead would be to use several sequential PaRAM sets for "QDMA chaining" (for example, PaRAM X to PaRAM X+9), with the last set being the PaRAM assigned to the QDMA channel (PaRAM X+9). This way you can write several PaRAM sets from L2 setup arrays using one IDMA0 call. When the last of the several PaRAM sets is written, it will trigger the QDMA to execute. That QDMA PaRAM X+9 would be set with STATIC=0 and LINK=PaRAM X+8; PaRAM X+8 would have LINK=PaRAM X+7; and so on.

You can also use DMA channels for transfers. They require one extra write, to ESR, but there could be value depending on your actual system use of QDMA.

Regards,
RandyP

0 Zhijun Wang over 13 years ago in reply to RandyP

Prodigy 90 points

RandyP,

Thank you for your good addvice.

Before we discuss the case of single core. As for the multiple-core case, how to allocate limited DMA resources? Consider the simple application , one EDMA3 Channel Controller has only 8 QDMA channels, and 6678 has 8 core, each core can only be assigned a 1QDMA channel. for EDMA, each core can be assigned 8 EDMA channel. Is this a reasonable way to use it?

0 RandyP over 13 years ago in reply to Zhijun Wang

TI__Guru* 84110 points

EDMA3 channel allocation is completely application dependent. Your method is reasonable, yes.

Some DMA channels are "allocated" to particular peripherals in hardware as indicated in the TPCCn Event tables in the datasheet. So in some applications, you may have a few of those pre-assigned to a peripheral, and that might affect which core would program it.

Regards,
RandyP

Processors

Processors forum

why is the time of call QDMA so long?