Fast access to/from GPMC Enabled FPGA Memory

Eric Texley

Hello. I'm wondering if anyway has discovered a method to implement fast memcpy using load multiple/store multiple on an FPGA memory mapped as a GPMCdevice. I've implemented a fast memcpy on the ARMv7 using load/store multiple like this:

loop:

ldmia r1!, { r5-r8 } @load four registers
stmia r0!, { r5-r8 } @store four registers

loop_test:
    cmp r4, #0
    subne r4, r4, #1
    bne loop

but in order to achieve this, the source/target access must be 32-bits.

Does anyone know of a way to implement multiple load/store on ARM using 16 bit accesses?

over 12 years ago

0 Eric Texley over 12 years ago

Intellectual 320 points

hmmm...maybe to do this with the NEON SIMD unit :o

0 Renjith Thomas over 12 years ago

Guru 31670 points

Eric,

When you run the above code what's the observation?

There is no need of 32-bit data bus to use ldm/stm instructions. What is the current throughput that you are getting? What is the expected throughput?

0 Eric Texley over 12 years ago in reply to Renjith Thomas

Intellectual 320 points

Thinking that I needed to specifically implement a 16-bit copy to/from the GPMC CS addressable memory at 0x2b000000, I implemented fast memcpy using the NEON

NEONCopyPLD:
      pld [r1, #0xC0]
      vldm r1!,{d0-d7}
      vstm r0!,{d0-d7}
      subs r2,r2,#0x40
      bge NEONCopyPLD

as is discussed here : http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka13544.html

I tried benchmarking a normal 16-bit memcpy, ldm/stm, and the NEON memcpy, and as expected the NEON won. For grins, I tried copying to the GPMC device memory using all three. To my surprise I didn't have to make any specific accommodation for accessing the GPMC device! The controller took care of it for me! As expected, access from the NEON was the winner.

How is it that I'm able to transfer the contents of a 32-bit register to two 16-bit locations in my GPMC device? Is the controller taking care of that for me? I am impressed

0 Eric Texley over 12 years ago in reply to Eric Texley

Intellectual 320 points

As requested, here's the #s for the throughput I'm getting:

"regular memcpy" is a naive copy short * to short *

Copy 2048 bytes from a DDR word align buffer to a memory mapped GPMC device:

Regular memcpy takes 1033 Usec
Compare result is 0
memcpy with LDM/STM takes 207 Usec
Compare result is 0
neon takes 185 Usec

Copy 2048 bytes from a DDR word aligned buffer to another DDR memory mapped word aligned buffer:

Regular memcpy takes 936 Usec
memcpy with ldm/stm takes 89 Usec
neon takes 52 Usec

0 Renjith Thomas over 12 years ago in reply to Eric Texley

Guru 31670 points

Eric,

Have you achieved the expected performance? If so, can you mark this query as resolved?

Processors

Processors forum

Fast access to/from GPMC Enabled FPGA Memory