This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Neon memcpy starterware

Hi,

There is a way to use memcpy_neon in starterware ? I am using starterware and BBB and would to use neon features, one important to me is memcpy and I can not find any example.
I compile the code with required flags.
If some one could send me some help I appreciate.

Thanks in advance.

  • Matheus,

    Very interesting question. Unfortunately, I'm not aware of memcpy_neon in Starterware. Hopefully someone else in the forum might have come across this previously.

    Lali
  • What is the standard memcpy() using? Perhaps it already accesses NEON when related compile switches are turned on and when it is faster than normal CPU-operations?

  • Hi qxc.
    Now I'm using a this function to mem_cpy.
    The code:
    long * plDst = (long *) dst;
    long const * plSrc = (long const *) src;
    static size_t counter = 0;
    for (counter= 0; counter < (len / 4); counter += 4) {
    *plDst++ = *plSrc++;
    }
    return (dst);

    This improved a bit the performance, but I want more, I think in use a function in assembly and compare with neon_memcpy.

    Thanks in advance,
  • Matheus, I have escalated this to our apps team....sorry for the delay...
  • Matheus,

    Could you please provide more details on your test?

    In your initial post it was mentioned that you compiled with the relevant neon flags. What was the outcome of that? Did you do some sort of benchmark to see performance?

    In general, memcpy with neon compilation flags sounds right.

    http://processors.wiki.ti.com/index.php/StarterWare_NeonVFP 

    Lali

  • Hi.
    At realy, I want to improve my memcpy. I thought that using neon I could do this.
    I made some tests using a GPIO to measure the performance, when I use a memcpy implementatio it takes about 13ms and with neon_memcpy is about 17ms.

    The function that I am working is toggleFrameBuffer() (starterware + bbBlack). That copies the display buffer. Maybe someone knows another way to improve this operation.

    Thanks in advance.
  • Hi Matheus,

    You could use the DMA engine to do block copy, it's much faster. I am working on a similar project with BBblack, exploring the LCD controller. With the DMA engine, I was able to copy the framebuffer of 1280x720 16BPP frame, that's around 1.75Mbytes, in under 3ms, even when the board is running at 300 MHz.

     You could also take a look at this page (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html) for different approaches to do memcpy. It also explains how to use NEON to do memory copy.

    Regards,

    Khalid

  • Hello Khalid,

    DMA is an interesting approach to do fast memory copying. Is there a StarterWare example available somewhere showing how to use it?

    Thanks!

  • Hi qxc,


    I don't recall seeing an example code in the Starterware pack, but there are quite few examples that use the DMA engine for other purposes. However, I have written my own code using the DMA mainly to copy the large framebuffer to the RAM. I have attached the DMA code section from my project, so you can take a look at it and see if you could use it. The code is rudimentary and can still be further improved.

    If you have D-cache enabled, you will need to do D-cache clean (there is Starterware example code that explain that) before initiating the memory transfer. This will add overhead time to the whole copy process, but is wasn't that significant according to my benchmarks. The D-cache clean took around 1.5ms to clean the 1.75Mbytes framebuffer at 600 MHz, while the DMA engine took under 3ms to transfer the memory block.


    One other good thing about using DMA, beside being faster, is the transfer speed is independent of the CPU frequency speed; you will get the same throughput at 300 MHz & 1000 MHz.

    The only issue that I have came across so far is, for some reason, the DMA engine cannot copy arrays/blocks that are based in the internal RAMs; it will only copy memory blocks that are in the DDR memory, even the destination array must be in the DDR memory.

    The code is commented, but if you need help in using it let me know.


    Regards,

    Khalid

    0083.dma.zip

  • Hi, thanks for the help.
    I will try this approach and make some tests.