This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Is Canny Edge Detection MEMCPY() blocking?

I have a working custom codec, but it processes whole frames in external (slower) memory.  I already converted it to slices for unrelated reasons, and am now considering upgrading it to use faster L1D memory.

I'm looking at the canny edge detection example, at the MEMCPY() function (defined in terms of CANNY_TI_do1DDma).  This function appears to me to be "blocking".  That is, you call it and it does not return until the copy is complete.  I understand this may be just a simplified example, but in my 30 years experience, on and off, of doing DMA, you always overlap the DMA transfer with other activities.  You don't "block" like this.

Does it seem that I'm understanding the situation correctly?  Or am I missing something?  More specifically, I might get my codec working using this MEMCPY() function, but then later restructure it so that it is NOT blocking.  For example, I might handle two slices at a time, stepping one slice forward each iteration.  (My algorithm logic would allow this.)  In doing so, I begin each by starting DMA on slice N+1, and then process slice N (such processing being prefixed by a wait on an earlier DMA start on slice N).  Thus, I am at least overlapping the processing of slice N with the DMA transfer of the next slice, N+1.  This scenario is for a simple input-process-only situation.  A more complicated input-process-output situation could similarly overlay processing with DMA transfer.

Oh, speaking of which, the doc is a little too voluminous to easily find the answer to this next question.  Can I run an input and output DMA simultaneously (DM6467T)?  I'm thinking that, if I can, I'll need "two channels".  I believe the existing Canny example receives a single channel from UNIVERSAL_create().  I would need to somehow create a second channel, which seems it would violate XDAIS standards?  Any advice here as well is greatly appreciated.  (My codec source superstructure is derived from the Canny example.)

Thanks very much,

Helmut

  • On occasion, I have seen the library function memcpy() implemented as a DMA copy. It blocks the current thread and allows other threads to be swapped in while the copy is in progress. A non-blocking mempcy() would require some sort of callback or signal to tell you that the copy has completed and you can proceed to use the source or destination buffers. Multi-threading might be your only way to work around blocking.

  • Thanks, Norman.  You've confirmed my assessment that this MEMCPY was blocking.  I'm working in a remote office of one, so such a double-check on my brain is very helpful.

    Also, when I was thinking of "blocking", I was thinking within the current thread and not amongst multiple threads!  Within the DSP, where I want to use it, there's only one thread anyway (that I know of and for sure that I'd be using).

    Multi-threading isn't necessary to get around the single-threaded same-thread blocking in the DSP.  I've done this many times on other processors.  All that's necessary is to check a completion flag before using the result of the dma xfer.  For example, there's already a single-line while loop at the end of MEMCPY() to check for completion.  Just move it to a new function MEMCPYwait(), then rename the original function MEMCPYstart().  This, of course, assumes that design-time knowledge can be used to reduce the function to have no loop at all, to operate with a maximum byte count small enough to work in one iteration.  Then, I could call MEMCPYstart() for buffer N+1, call MEMCPYwait() for buffer N, process buffer N, increment N and repeat.  Obviously, my notes here suggest at least two independent completion flags.  If sizes require the loop in MEMCPY, then things get more complicated, but are generally handled in the same manner.  Similarly, if the dma xfer takes longer than processing a buffer, then adjust relative workload to compensate.  if not possible, at least some parallelism has been achieved.  

    NOTE, my codec is substantially a DETERMINISTIC PROCESS, so with design-time knowledge, it can be broken up and pieced back together in an operational-speed optimized manner.  From this point of view, multi-threading is useful for non-deterministic processes.  This is a deterministic process, so multi-threading is not so much a necessity as a programmer convenience.  Without multi-threading, insightful programming design can achieve the same and sometimes superior performance.

    Thanks again,

    -Helmut

     

  • Framework Components provides some DMA libraries that have these kind of _wait APIs that you are looking for. 

    Depending on the version of Codec Engine/Framework Components you use using, I could point you to the right DMA library (and the right interface to request DMA resources for use with the library).

    What versions are you using currently ?

  • Thanks.

    === on UBUNTU/VMWARE where I use "make" to compile application and codec server .x64P ===

    Codec Engine - hmm, I have two folders but assume I'm using the latest:  2.25.01.06 vs 2.26.02.11

    FC: 2.25.01.05 (well perhaps I'm using CE 2.25 as well, I'm not sure how to confirm, but note WinXP below)

    DMAI (you don't need this info, do you?) 2.10.00.05

    === on WINDOWS XP where I use "CCS" to compile codec itself .a64P ===

    Codec Engine - 2.26.02.11

    dsplink_linux 1.64

    xdais 6.25.01.08

    xdctools 3.16.02.32

    bios 5.41.02.14

    C6000 Code Gen Tools 7.0.1

     

    -Helmut

  • Okay, since you are on the 2.x branch of Codec Engine/Framework Components, you can use the ACPY3 library to perform DMA transfers. 

    ACPY3 has an API called ACPY3_wait, that lets you "poll" on a transfer and wait for its completion. 

    To request resources, your codec needs to implement the IDMA3 interface that requests DMA resources in a way that ACPY3 can then be used.

    In the Codec Engine installation, you can look at the videnc_copy or videnc1_copy example to that will demonstrate how to request DMA resources and then use ACPY3 to perform transfers.

    There are also non-Codec Engine, Framework Components based example in the framework components installation (ti/sdo/fc/dman3/examples/fastcopy/ ), built specifically for DM6467, that you can use to study how to use ACPY3 and IDMA3. The installation also has documentation on these interfaces.

    Let us know if you need more pointers. 

     

  • Gunjan,

    Thanks, I'll keep this in mind.

    HOWEVER, it would be much easier to simply use CANNY_TI_do1DDma() from canny.c, from the Canny Edge Detection example.  I derived my codec from this example and actually already have CANNY_TI_do1DDma() sitting unused in my code! 

    ref: http://processors.wiki.ti.com/index.php/C64x%2B_iUniversal_Codec_Creation_-_from_memcpy_to_Canny_Edge_Detector 

    Looking inside CANNY_TI_do1DDma I see it referencing edmaChan, which ultimately comes from the UNIVERSAL codec engine stuff.  So there's lots of admin already being done before hand, by someone OTHER than me!

    I'd like to here from you, saying I *can* just use this function.  Otherwise, I intend to try it and find out anyway.  If that works, as I expect, great.  If not, I'll dig deeper into the meaning of your post just above.

    -Helmut

  • I briefly looked at the wiki topic you have linked to above, and looks like it uses the same DMAN3/ACPY3 technology to actually perform the DMA transfers, so this approach is probably legit. (You can probably peek inside the do1DDma code and confirm that it makes some "ACPY3" calls).

    In fact, further down in the post it actually refers to the DMAN3/ACPY3 guide to help with programming the DMA to perform the kinds of transfer the codec needs. 
    Looks like you are on the right track !

     

     

  • Gunjan, thanks very much.  I had seen the ACPY3 on that wiki page.

    Note the source for the function is really NASTY, with lots of hard-coded values and pointer arithmetic.  This hard-coding is something one most often attempts to avoid, but since it's already here and presumably working...

    I'm quoting the code below.  You can see the first while that I'll remove, to do just a single sub-maximum chunk.  You can see the single-line second while near the bottom, which is commented upon as the [wait].

    CANNY_TI_do1DDma (tabbing messed up by forum) said:

    void CANNY_TI_do1DDma(Uns dst, Uns src, Uns bytes, Uns edmaChan)

    {

        const Uint32 maxTransferChunkSize       = 0xfffc;

        Uint32       thisTransferChunkSize      = 0x0;

        Uint32       remainingTransferChunkSize;

     

     

        unsigned int * pEdma;

        unsigned int * pParam;

        volatile unsigned int * ipr;

     

        unsigned char *src_addr = (unsigned char *)src;

        unsigned char *dest_addr = (unsigned char *)dst;

        unsigned int tccNum = (unsigned int)-1;

     

     

    remainingTransferChunkSize = bytes;

     

    // printf("CANNY_TI_do1DDma(0x%x, 0x%x, 0x%x)\n",src,dst,bytes);

     

     

        pEdma = (unsigned int* )EDMAADDR;

     

        while (remainingTransferChunkSize > 0) {

     

            if (remainingTransferChunkSize > maxTransferChunkSize) {

               thisTransferChunkSize = maxTransferChunkSize;

            }

            else {

               thisTransferChunkSize = remainingTransferChunkSize;

            }

     

    pParam = (unsigned int *)((unsigned int) pEdma + 0x4000 +

    (0x20 * edmaChan)); //param # edmaChan

     

    tccNum = (edmaChan << 12);

     

    pParam[0] = (0x00100008 | tccNum); //OPT TCC == edmachan + STATIC

    pParam[1] = (unsigned int)src_addr;

    pParam[2] = 0x00010000 + thisTransferChunkSize; //Bcnt, Acnt

    pParam[3] = (unsigned int)dest_addr;

    pParam[4] = 0x0;

    pParam[5] = 0xFFFF;

    pParam[6] = 0x0;

    pParam[7] = 0x1;

     

     

            remainingTransferChunkSize -= thisTransferChunkSize;

            src_addr += thisTransferChunkSize;

            dest_addr += thisTransferChunkSize;

     

     

     

    /* SECR = 0xFFFFFFFF */

    *((unsigned int *)((unsigned int) pEdma + 0x1040)) = 0xFFFFFFF;

    *((unsigned int *)((unsigned int) pEdma + 0x1070)) = (0x01 << edmaChan);

     

    /* DCHMAP */

    *((unsigned int *)((unsigned int) pEdma + 0x100 + (0x4 * edmaChan))) =

    (edmaChan << 4) ;

     

     

    ipr = (unsigned int *)((unsigned int)pEdma+ 0x1068);

     

    /* ESR = 0x02 */

    *((unsigned int *)((unsigned int) pEdma + 0x1010)) = (0x01 << edmaChan);

     

     

    /* Check for completion */

    while (((*ipr) & (0x1 << edmaChan))  != (0x01 << edmaChan));

       }

     

    }

     

     

     

     

     

     

  • Looking at this code snippet, it doesn't seem like ACPY3 is being used to perform the DMA transfer. The registers are being programmed directly. (The wiki article, however mentions that ACPY3 is used, see section under Applying Slicing to the Pre and Post Processing Algorithms )

    Using ACPY3 would let you abstract all your DMA calls into clean APIs. And all of the internals of the DMA resources would be encapsulated in an IDMA3_Handle. 

    Maybe once you get things working, you can look at how to convert this codec into one that uses DMAN3/ACPY3. The wiki topic seems to have all the details. (See section How can algorithm request hardware resources from the system)