I am seeing something on the DM3730 GPMC access timing that is not making sense to us at this time.
We have set up CS3 of the OMAP to access our FPGA and our timing has been set up to be roughly 100ns (i.e. time when CS is low) for a single 16-bit read access from the FPGA. CS3 is set up as a multiplexed 16 bit NOR Flash interface in our design.
When the processor issues a 32-bit read from CS3 we see two back to back transactions with a 12ns delay between the two transactions which is correct since that is the delay that we set up in our GPMC timing. But when we issue a new 32 bit read from CS3 we see a delay of around 150-160ns and then we see the two back to back reads as expected. We are trying to determine why we see the 150-160ns delay.
Following is the simple sequence of instructions that our code executes
1 LDR r_n, [FPGA]
2 STR r_n, [r_s]
3 LDR r_n, [FPGA + 4]
4 STR r_n, [r_s + 4]
5 LDR r_n, [FPGA + 8]
6 STR r_n, [r_s + 8]
In the above sequence Instruction 1 issues two back to back 16 bit reads as expected.
But between Instruction 1 and Instruction 3 we see around 150-160ns delay.... obviously as one can see above there is only 1 store (STR) instruuction between the two reads and this shouldnt be taking 150-160ns. My guess is that this store is also happening to the internal data cache which should be very fast.
Is there some other setting outside the GPMC that can influence this?
This is expected behavior. Typically a FPGA connected to the GPMC would be configured as non-cacheable on the ARM side. From the sound of your transactions that appears to be the case here, i.e. your 32-bit read results in two 16-bit accesses. If cache was enabled for that memory range then you would see an entire cache line fetch. This ultimately comes down to the way that a CPU behaves and to the memory architecture as a whole. Specifically, when you issue a "load" instruction the CPU will stall until that data has actually landed in the CPU register. When you consider that this data request passes through the ARM cache controllers, ARM AXI controller, L3 Interconnect, and GPMC and then the data itself has to traverse back through that entire path, that ends up being a significant amount of time. It is only after that data has landed in the CPU register that the CPU stops stalling and moves on to the next instruction. This is why you see a big gap between each of these accesses.
Here are my recommendations:
Instead of having a single register such as DATA you should have three registers: DATA, DATA_SET, DATA_CLEAR.
Currently if you wanted to set bit 3 for example you would do something like DATA |= 0x0008; This would in turn produce a read-modify-write set of instructions, which as you have noticed is "expensive" over the GPMC.
Instead of doing the operations in step b, you would instead simply do DATA_SET = 0x0008 which would have the affect of setting that one bit.
Similarly you would currently clear a bit by doing something such as DATA &= ~0x0008. In the new paradigm you would simply do DATA_CLEAR=0x0008.
Finally, make sure that the mapping of this address range from the ARM side is non-cacheable, but BUFFERABLE. This will be key in terms of making the writes "fire and forget". In other words, when you do an operation like DATA_CLEAR=8 it would result in a single cycle instruction and the CPU could continue executing while the data propagates out to the FPGA. If you configure the ARM as strongly ordered, you will still stall even for the write to complete, which is to make sure that everything in the system occurs with very precise order.
Please click the Verify Answer button on this post if it answers your question.---------------------------------------------------------------------------------------------------------
Thanks for the response.
In my application I dont have too many Read-Modify-Writes.
I have a whole bunch of periodic reads and my writes are important but not that time critical. In my example above you see reads from contiguous FPGA addresses.
I did do a couple of things, I changed my reads to be 64 bit reads (128 bit reads were converted into two 64-bit reads of four 16 bit accesses each by the GPMC, I'm guessing this is due to the L3 Bus interface width of 64 bits) and this helped it a little bit. If I issue 128 bit reads there is a gap of 150-160ns between the two 64 bit transfers.
I do set up my FPGA space as Strongly ordered. I did do a simple trick of setting up 2 chip selects on the FPGA and since writes followed by reads do not happen in my app on those heavily read registers I put my read registers into a separate CS and optmized the GPMC timing for that CS. Since my FPGA needed a slower write. But the huge gap between the 64 bit transfers still bothers me.
In my situation do you think chaging it to non-cacheable but bufferable will make a difference?
Also will DMA help for reads?
Let me know
Angelo JosephI did do a couple of things, I changed my reads to be 64 bit reads (128 bit reads were converted into two 64-bit reads of four 16 bit accesses each by the GPMC, I'm guessing this is due to the L3 Bus interface width of 64 bits) and this helped it a little bit. If I issue 128 bit reads there is a gap of 150-160ns between the two 64 bit transfers.
The larger data type will reduce the total number of accesses. As you have noticed it still cannot prevent a large gap that you see between reads of non-cacheable data. The only way to avoid such a gap is to use DMA.
Angelo JosephIn my situation do you think chaging it to non-cacheable but bufferable will make a difference?
To be clear, marking the space as "bufferable" will vastly improve your write performance, but it will have no effect on read performance. From the GPMC perspective you should see writes occurring back-to-back. From a CPU perspective you will not stall the CPU while waiting for writes to complete. You would also not need to use large data types to get the best performance, as contiguous writes would be merged in the buffer.
Angelo JosephAlso will DMA help for reads?
The only way to "eliminate the gap" is to use DMA. Of course there will be overhead for setting up a DMA transfer, so if you're only reading one or two elements then it's probably not worthwhile. If, however, you're reading a substantial block of data then I expect a night and day difference between non-cacheable CPU reads and a DMA transfer. You'll want to copy a block of FPGA data to a cacheable location in your DDR memory, i.e. you would make a copy of your FPGA registers. You'll need to watch out for cache coherence issues. Specifically, after you've DMA'd a copy of your registers to DDR you will first need to invalidate (i.e. throw away) whatever is in the cache and THEN perform your access so that the CPU is getting "fresh" data from the DDR.
All content and materials on this site are provided "as is". TI and its respective suppliers and providers of content make no representations about the suitability of these materials for any purpose and disclaim all warranties and conditions with regard to these materials, including but not limited to all implied warranties and conditions of merchantability, fitness for a particular purpose, title and non-infringement of any third party intellectual property right. TI and its respective suppliers and providers of content make no representations about the suitability of these materials for any purpose and disclaim all warranties and conditions with respect to these materials. No license, either express or implied, by estoppel or otherwise, is granted by TI. Use of the information on this site may require a license from a third party, or a license from TI.
TI is a global semiconductor design and manufacturing company. Innovate with 100,000+ analog ICs andembedded processors, along with software, tools and the industry’s largest sales/support staff.