This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

OMAP DM3730/3530 GPMC Access timing issues

Other Parts Discussed in Thread: DM3730

I am seeing something on the DM3730 GPMC access timing that is not making sense to us at this time.

We have set up CS3 of the OMAP to access our FPGA and our timing has been set up to be roughly 100ns (i.e. time when CS is low)  for a single 16-bit read access from the FPGA. CS3 is set up as a multiplexed 16 bit NOR Flash interface in our design.

When the processor issues a 32-bit read from CS3 we see two back to back transactions with a 12ns delay between the two transactions which is correct since that is the delay that we set up in our GPMC timing. But when we issue a new 32 bit read from CS3 we see a delay of around 150-160ns and then we see the two back to back reads as expected. We are trying to determine why we see the 150-160ns delay.

Following is the simple sequence of instructions that our code executes

1 LDR r_n, [FPGA]

2 STR r_n, [r_s]

3 LDR r_n, [FPGA + 4]

4 STR r_n, [r_s + 4]

5 LDR r_n, [FPGA + 8]

6 STR r_n, [r_s + 8]

 

In the above sequence Instruction 1 issues two back to back 16 bit reads as expected.

But between Instruction 1 and Instruction 3 we see around 150-160ns delay.... obviously as one can see above there is only 1 store (STR) instruuction between the two reads and this shouldnt be taking 150-160ns. My guess is that this store is also happening to the internal data cache which should be very fast.

Is there some other setting outside the GPMC that can influence this?

  • Angelo,

    This is expected behavior. Typically a FPGA connected to the GPMC would be configured as non-cacheable on the ARM side.  From the sound of your transactions that appears to be the case here, i.e. your 32-bit read results in two 16-bit accesses.  If cache was enabled for that memory range then you would see an entire cache line fetch.  This ultimately comes down to the way that a CPU behaves and to the memory architecture as a whole.  Specifically, when you issue a "load" instruction the CPU will stall until that data has actually landed in the CPU register.  When you consider that this data request passes through the ARM cache controllers, ARM AXI controller, L3 Interconnect, and GPMC and then the data itself has to traverse back through that entire path, that ends up being a significant amount of time.  It is only after that data has landed in the CPU register that the CPU stops stalling and moves on to the next instruction.  This is why you see a big gap between each of these accesses.

    Here are my recommendations:

    • Create some "write only registers" in your FPGA.
    • Instead of having a single register such as DATA you should have three registers: DATA, DATA_SET, DATA_CLEAR.
    • Currently if you wanted to set bit 3 for example you would do something like DATA |= 0x0008;  This would in turn produce a read-modify-write set of instructions, which as you have noticed is "expensive" over the GPMC.
    • Instead of doing the operations in step b, you would instead simply do DATA_SET = 0x0008 which would have the affect of setting that one bit.
    • Similarly you would currently clear a bit by doing something such as DATA &= ~0x0008.  In the new paradigm you would simply do DATA_CLEAR=0x0008.
    • Finally, make sure that the mapping of this address range from the ARM side is non-cacheable, but BUFFERABLE.  This will be key in terms of making the writes "fire and forget".  In other words, when you do an operation like DATA_CLEAR=8 it would result in a single cycle instruction and the CPU could continue executing while the data propagates out to the FPGA.  If you configure the ARM as strongly ordered, you will still stall even for the write to complete, which is to make sure that everything in the system occurs with very precise order.
    • For larger transfers, use DMA instead of the CPU to read/write the data from the FPGA.  The DMA would not stall like the CPU between reads.

    Best regards,
    Brad

  • Brad,

    Thanks for the response.

    In my application I dont have too many Read-Modify-Writes.

    I have a whole bunch of periodic reads and my writes are important but not that time critical.  In my example above you see reads from contiguous FPGA addresses.

    I did do a couple of things, I changed my reads to be 64 bit reads (128 bit reads were converted into two 64-bit reads of four 16 bit accesses each by the GPMC, I'm guessing this is due to the L3 Bus interface width of 64 bits) and this helped it a little bit. If I issue 128 bit reads there is a gap of 150-160ns between the two 64 bit transfers.

    I do set up my FPGA space as Strongly ordered. I did do a simple trick of setting up 2 chip selects on the FPGA and since writes followed by reads do not happen in my app on those heavily read registers I put my read registers into a separate CS and optmized the GPMC timing for that CS. Since my FPGA needed a slower write. But the huge gap between the 64 bit transfers still bothers me.

    In my situation do you think chaging it to non-cacheable but bufferable will make a difference?

    Also will DMA help for reads?

    Let me know

    Thanks

    Angelo

    Angelo

  • Angelo Joseph said:
    I did do a couple of things, I changed my reads to be 64 bit reads (128 bit reads were converted into two 64-bit reads of four 16 bit accesses each by the GPMC, I'm guessing this is due to the L3 Bus interface width of 64 bits) and this helped it a little bit. If I issue 128 bit reads there is a gap of 150-160ns between the two 64 bit transfers.

    The larger data type will reduce the total number of accesses.  As you have noticed it still cannot prevent a large gap that you see between reads of non-cacheable data.  The only way to avoid such a gap is to use DMA.

    Angelo Joseph said:
    In my situation do you think chaging it to non-cacheable but bufferable will make a difference?

    To be clear, marking the space as "bufferable" will vastly improve your write performance, but it will have no effect on read performance.  From the GPMC perspective you should see writes occurring back-to-back.  From a CPU perspective you will not stall the CPU while waiting for writes to complete.  You would also not need to use large data types to get the best performance, as contiguous writes would be merged in the buffer.

    Angelo Joseph said:
    Also will DMA help for reads?

    The only way to "eliminate the gap" is to use DMA.  Of course there will be overhead for setting up a DMA transfer, so if you're only reading one or two elements then it's probably not worthwhile.  If, however, you're reading a substantial block of data then I expect a night and day difference between non-cacheable CPU reads and a DMA transfer.  You'll want to copy a block of FPGA data to a cacheable location in your DDR memory, i.e. you would make a copy of your FPGA registers.  You'll need to watch out for cache coherence issues.  Specifically, after you've DMA'd a copy of your registers to DDR you will first need to invalidate (i.e. throw away) whatever is in the cache and THEN perform your access so that the CPU is getting "fresh" data from the DDR.