This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Fast access to/from GPMC Enabled FPGA Memory

Hello.  I'm wondering if anyway has discovered a method to implement fast memcpy using load multiple/store multiple on an FPGA memory mapped as a GPMCdevice.  I've implemented a fast memcpy on the ARMv7 using load/store multiple like this:

loop:

    ldmia r1!, { r5-r8 } @load four registers
    stmia r0!, { r5-r8 } @store four registers

loop_test:
    cmp r4, #0
    subne r4, r4, #1
    bne loop

but in order to achieve this, the source/target access must be 32-bits.

Does anyone know of a way to implement multiple load/store on ARM using 16 bit accesses?

  • hmmm...maybe to do this with the NEON SIMD unit :o

  • Eric,

    When you run the above code what's the observation? 

    There is no need of 32-bit data bus to use ldm/stm instructions. What is the current throughput that you are getting? What is the expected throughput?



  • Thinking that I needed to specifically implement a 16-bit copy to/from the GPMC CS addressable memory at 0x2b000000, I implemented fast memcpy using the NEON

    NEONCopyPLD:
          pld [r1, #0xC0]
          vldm r1!,{d0-d7}
          vstm r0!,{d0-d7}
          subs r2,r2,#0x40
          bge NEONCopyPLD

    as is discussed here : http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html

    I tried benchmarking a normal 16-bit memcpy, ldm/stm, and the NEON memcpy, and as expected the NEON won.  For grins, I tried copying to the GPMC device memory using all three.  To my surprise I didn't have to make any specific accommodation for accessing the GPMC device!  The controller took care of it for me!  As expected, access from the NEON was the winner.

    How is it that I'm able to transfer the contents of a 32-bit register to two 16-bit locations in my GPMC device?  Is the controller taking care of that for me?  I am impressed

  • As requested, here's the #s for the throughput I'm getting:

    "regular memcpy" is a naive copy short * to short *

    Copy 2048 bytes from a DDR word align buffer to a memory mapped GPMC device:

    Regular memcpy takes           1033 Usec
    Compare result is 0
    memcpy with LDM/STM takes 207 Usec
    Compare result is 0
    neon takes                                  185 Usec

    Copy 2048 bytes from a DDR word aligned buffer to another DDR memory mapped word aligned buffer:

    Regular memcpy takes       936 Usec
    memcpy with ldm/stm  takes 89 Usec
    neon takes                               52 Usec

  • Eric,

    Have you achieved the expected performance? If so, can you mark this query as resolved?