AM3517: xloader/uboot performance; MPU Core, GPMC, and SDRC performance

Man Nguyen

Other Parts Discussed in Thread: AM3517

I am working to resolve xloader/uboot performance issues. The hardware (Logic PD Am3517 SOM-M2) is detailed in the attached PDF. With the GPMC configuration and SDRC configuration as is, I am observing that a GPMC Nand Flash data read takes 200 ns and SDRC DDR2 SDRAM data write takes 215 ns. These can be easily tuned down to 50 ns and 35 ns, respectively, and are not the problem for me. My problem, which I did not expect from a superscalar core with several independent memory interfaces, is that the GPMC data read and SDRC data write seem to occur serially rather than in parallel. See the attached PDF for details.

Man Nguyen

Senior Software Engineer

Gambro UF Solutions

7601 Northland Drive

Suite 170

Brooklyn Park, MN 55428

over 14 years ago

0 Jason Vorel36 over 14 years ago

TI__Intellectual 1500 points

Man,

Here is what I understand as your key challenge: GPMC data read and SDRC data write seem to occur serially rather than in pallel.

Possible options to enable parallel setup:

· Use DMA (best solution) Use multiple threads (not recommended)

o As noted in his document this would be the ideal solution.

o Using the DMA would allow the system to access memory independently of system

§ Expected outcome: increase in performance due to the off load of memory management

o Problem: There is a risk in doing this and must be done in a way that memory management is maintained.

§ Expected outcome: increase use of processors that may reduce overall system performance.

Additional Reference Information:

http://processors.wiki.ti.com/index.php/Memory_Vendor_Selection_Guide

http://processors.wiki.ti.com/index.php/AM3517/05_Memory_Subsystem

http://processors.wiki.ti.com/index.php/AM3517/05_GPMC_Subsystem

http://processors.wiki.ti.com/index.php/Tips_for_configuring_OMAP35x,_AM35x,_and_AM-DM37x_GPMC_registers

http://processors.wiki.ti.com/index.php/AM35x_Overview#Memory_Management_Units_.28MMU.29

http://processors.wiki.ti.com/index.php/AM3517/05_SDRC_Subsystem

http://processors.wiki.ti.com/index.php/AM35x-OMAP35x-PSP_03.00.01.06_Feature_Performance_Guide#Driver_DMA_usage

http://en.wikipedia.org/wiki/Direct_memory_access

Jason

0 orbarron over 14 years ago

Expert 1600 points

Man,

I don't think the AM3517 core can handle a read and write in a single cycle. There may be some limitations from ARM that prevent you from do this. You may need some way to start multiply reads and write simultaneously but I believe the instructions still need to be completed before the next one starts.

0 Man Nguyen over 14 years ago in reply to orbarron

Prodigy 110 points

What I drew up in my posting does not assume that the Am3517 can handle a read and write in a single cycle. The essence of what I posted and, drew up in my posting, is this... If you have the core alternating between a load half-word from Nand Flash interfaced to GPMC followed by a store to SDRAM interfaced to SDRC, with the load and store intentionally set up in my performance analysis experiment to be decoupled to eliminate potential confusion, then what prevents the core from keeping both memory controllers continuously busy since they are independent dedicated controllers? Instead, what appears to happen is:

1. Core issue load to GPMC. It cannot issue store to SDRC even with store intentionally design in my experiment to be decoupled from load (no register dependency), and instead appears to spin until load from GPMC is complete.

2. After load from GPMC is complete, it finally issues store to SDRC. Subsequently, it appears to spin until store to SDRC is complete between telling GPMC to do load. Again, no dependency exists to cause this situation.

Given the built in hardware parallelism (two independent memory controllers being GPMC and SDRC), what's described above is hard for me to believe. Is it a consequence of limitation in the core? Core to L3 interconnect bridge? L3 interconnec to GPMC and SDRC?

Man

0 Bernie Thompson TI over 14 years ago in reply to Man Nguyen

TI__Mastermind 41665 points

Man Nguyen said:

1. Core issue load to GPMC. It cannot issue store to SDRC even with store intentionally design in my experiment to be decoupled from load (no register dependency), and instead appears to spin until load from GPMC is complete.

2. After load from GPMC is complete, it finally issues store to SDRC. Subsequently, it appears to spin until store to SDRC is complete between telling GPMC to do load. Again, no dependency exists to cause this situation.

Given the built in hardware parallelism (two independent memory controllers being GPMC and SDRC), what's described above is hard for me to believe. Is it a consequence of limitation in the core? Core to L3 interconnect bridge? L3 interconnec to GPMC and SDRC?

What you are seeing is what I would expect to see, just because the external memory interfaces are independent (and they are), does not mean that the ARM instructions inherently change operation to account for this, the ARM still needs to wait for accesses to complete for loads and stores otherwise you would end up with timing problems. For the ARM to be able to do what you suggest means it would have to be aware that the instructions around the loads and stores have no dependency on the loads and stores, and my impression is that the ARM architecture is not advanced enough to take this into account at run time, it has to assume a read/write needs to complete before the next instruction is executed so the pipeline stalls until that completion.

The interconnects and memory interfaces are capible of servicing multiple simultaneous transfers, but you need a bus master that will issue the transfers such that they end up being simultaneous, which typically would mean using DMA.

0 Man Nguyen over 14 years ago in reply to Bernie Thompson TI

Prodigy 110 points

I don't quite see the timing problems that you've mentioned based on the following visualization of events in the processor:

1. ldrh goes into one of two load/store pipelines. Let's say it's LS pipe 0. In one of the Execute stages, it hands off destination core register info (r0 for example) to L3 initiator agent with target being GPMC. Target agent in GPMC accepts within a few cycles of L3 clock cycle. Core scoreboards r0 to prevent read from r0 until initiator agent receives result from GPMC.

2. In the next one or few core clock cycles, in the example posed, strh goes into the other load/store pipeline, LS pipe 1. In one of the Execute stages, it hands off data to be stored and destination address to another L3 initiator agent with target being SDRC. Target agent in SDRC accepts within a few cycles of L3 clock cycle such that from the core's perspective the strh instruction in LS pipe 1 has completed execution and LS pipe 1 becomes free -- because all information needed has been cleanly transferred to the SDRC.

What I described seem to be in line with the "fire and forget" philosophy that seems to be mentioned in ARM technical documentation. If accurate, then where's the timing problem? Register scoreboarding prevents read-after-write hazards of the destination register for the ldrh instruction.

Am I missing a nuance?

0 Bernie Thompson over 14 years ago in reply to Man Nguyen

Prodigy 175 points

What you describe does sound feasible if there is no dependency between the read and the write (i.e. what you are writing is something other than what you are reading) and if the Cortex-A8 can have multiple external memory fetches going on simultaneously, though I am not certain if the Cortex-A8 architecture or TI implementation will allow for this. The AXI2OCP (the interface between the ARM core and the rest of the AM37xx) does show support for multiple outstanding requests per the TRM, so my suspicion is that this would be due to some limitation within the ARM to not have two external memory fetches going on simultaneously, you may want to ask ARM directly about this, I am afraid I am not that intimately familiar with the load and store pipelines used on the Cortex-A8.

0 Man Nguyen over 14 years ago in reply to Bernie Thompson

Prodigy 110 points

The performance analysis test described in my PDF was intentionally set up such that the load via GPMC and store via SDRC were completely independent. I did this because I did not want the discussion to be muddled by tangential discussion about register scoreboarding. In addition, the diagram I drew up shows two Load/Store pipes. This need not be the case. Even if there were one Load/Store pipe, it should be feasible given the delegation to Initiator Agents, Target Agents, and Response Agents for the load via GPMC and store via SDRC to be essentially churning away in parallel at steady state (i.e. after the first iteration) with only a couple core latencies between them.

0 Man Nguyen over 14 years ago in reply to Man Nguyen

Prodigy 110 points

It should be noted that even if we have 1) ldrh from 0x6e000084 (GPMC NAND Data CS 0) to r0 followed by 2) strh from r0 to an SDRC memory location such that there's dependency between the ldrh and strh and 3) there's only one Load/Store pipe, the steady state behavior should be 1) strh from r0 to SDRC associated with iteration i gets fired off to SDRC (thus unscoreboarding r0) followed within a couple core cycles (~3 ns) by 2) ldrh from 0x6e000084 to r0 associated with iteration i+1. The performance analysis test shown in the original example was done deliberately without dependency to keep the discussion simple when the dependency doesn't matter except at the beginning of the sequence of iterations which doesn't matter because we're only concerned about steady state - there are 1024 iterations of ldrh-strh pairs in xloader/uboot for the NAND Flash in Logic PD's Am3517 SOM-M2, so the initial lag contributes only 0.1% deviation from case of no dependency.

0 Brad Griffis over 14 years ago in reply to Man Nguyen

TI__Guru*** 125430 points

Man,

In the case of a load miss, the Cortex A8 does a "replay" of the instruction until eventually the data has arrived. In other words, PC is no longer advancing.

This is discussed in bits and pieces in the Cortex A8 TRM (DDIO344). For example, in the "About instruction timing" section 16.1, the following is stated:

"In addition to the time taken for the scheduling and issuing of
instructions, there are other sources of latencies that effect the time of a
program sequence. The two most common examples are a branch
mispredict and a memory system stall such as a data cache miss of a load
instruction."

Further in section 16.3 "Dual instruction Issue restrictions" ARM states the following:

"There is only one LS pipeline.
Only one LS instruction can be
issued per cycle. It can be in
pipeline 0 or pipeline 1"

(Side note, LS=Load/Store above.)

So putting those two things together, you can only issue a single Load or Store at a given cycle, and for loads in particular if there is an L1 data miss that will cause a stall/replay to occur.

As far as the L3 interconnect, its benefit is to allow multiple L3 initiators transfer data to/from multiple targets simultaneously. In this case, you have only a single initiator (the AXI controller of the Cortex A8 subsystem) and so you are not seeing the full benefit of the architecture. However, if you were using multiple masters, e.g. if the DMA was reading from GPMC, then you would greatly improve your performance. For the case of accessing non-cacheable memory (as is usually the case for GPMC devices) you will see a tremendous performance improvement using DMA for the reasons discussed above. If you use the CPU to do multiple reads in a row from non-cacheable memory you will see huge gaps between reads while the CPU is stalled waiting for the data to return. The DMA on the other hand will do the entire transfer in one big chunk.

I hope this helps clarify what you are seeing and the intent of the architecture.

Best regards,
Brad

0 Man Nguyen over 14 years ago in reply to Brad Griffis

Prodigy 110 points

8446.cortex_a8_instruction_decode_pipeline_understanding_2011_08_24.pdf

I apologize that the drawings from the post that started this thread suggested a scenario with ldrh and strh being issued in parallel. That was not my intent. My background with ARM architectures is about 2 generations out of date, with ARMv4 (StrongARM) and ARMv5 (XScale) from more than a decade ago, so I had to update my understanding from reading David Williamson's paper "ARM Cortex A8: A High Performance Processor for Low Power Applications". The attached PDF outlines my understanding of Section 3.4. If my understanding is reasonably accurate, then load data misses will include a 9 core cycle penalty which, with the Logic PD SOM-M2 core being at ~600 MHz, amounts to 15 ns latency between when a load is completed from GPMC and data arrived in core to when the replayed ldrh catches the response. By the time the load completes, the following strh, if dependent on the result of strh in the worst case, will be right behind the ldrh in the replay portion of the pending-replay queue. With no outstanding transactions to block it and with the AXI bridge capable of handling multiple outstanding requests, the strh will be fired off, forgotten, and retired - in line with the "fire and forget" philosophy often mentioned in the ARM architecture specification. Carried over repetitions beyond the first iteration, the net effect will be "nearly parallel" load and store pairs separated by about ~15 ns between them.

Man

0 Brad Griffis over 14 years ago in reply to Man Nguyen

TI__Guru*** 125430 points

Man,

If all of this discussion is centered around xloader/uboot then I assume you are currently operating with the MMU disabled. In that case all data accesses are treated as "strongly ordered" (noncacheable, nonbufferable, serialized).

Man Nguyen said:
If my understanding is reasonably accurate, then load data misses will include a 9 core cycle penalty which, with the Logic PD SOM-M2 core being at ~600 MHz, amounts to 15 ns latency between when a load is completed from GPMC and data arrived in core to when the replayed ldrh catches the response.

You are missing a fundamental part of this whole discussion. Specifically that reads from "somewhere in the system" (in this case GPMC) will incur LARGE overheard -- hundreds of cycles. In my last post I highlighted "memory system stall" in red. The numbers you are quoting are ideal/impossible numbers for the case that we could somehow return data to the AXI in a single cycle. That's definitely not the case. The AXI has to issue a command into the L3 interconnect which then forwards that command to the GPMC, which then does the read, which then sends the data back to the L3 interconnect, which sends the data back to the AXI. These cycles are all in addition to the cycles you quote and the CPU will "replay" the instruction while it's twiddling its thumbs waiting for the data.

Brad

0 Man Nguyen over 14 years ago in reply to Brad Griffis

Prodigy 110 points

Brad,

Yes, all of this discussion is centered around xloader/uboot. The information that you provided about all data accesses being treated as "strongly order" (noncacheable, nonbufferable, serialized) was exactly what I needed to fill in gaps in my understanding. With all the performance analysis experiments I have done so far, I came to suspect strongly that to be the case, and I have been digging through the Am35x TRM and ARM Cortex A8 specification to find conclusive statement of that - with no firm conclusion until you said so.

See the following PDF for results of experiments that I've done ot get a sense of the AXI-OCP-L3 overhead as it pertains to communication between MPU SS and GPMC and MPU SS and SDRC.

4073.SKMBT_28311082509200.pdf

I should clarify that the "15 ns latency" is the "jitter" between when a load data access between MPU SS and GPMC completes it round trip back to the MPU SS and when the load data instruction goes from being replayed to E3 and catches the data retrieve. It was never my expectation that this be the transaction time. The element that was very unexpected for me was the large overhead -- hundreds of cycles as you mentioned -- in the AXI/OCP/L3 communication which I also roughly discovered in my experiments documented in the above PDF - after nulling out cost of bus transactions specific to GPMC and SDRC.

Bottom line is that we're on the same page, and I am thankful for your help.

One more related question:

1. Why does it appear that the core does not "fire and forget" the store data transactions (see measurements for write GPMC ECC Control, write EMIF4_PERF_CNT_SEL in the above PDF)? That is, it appears to spin and wait for the previously issued transaction to fully complete before issuing the next in the pipeline. Is it because the store buffers along the communication path is zero-deep causing a "store buffer full" stall and causing the core to replay the strh (send it and subsequent instructions back into the replay portion of the replay-pending queue)?

Man

0 Brad Griffis over 14 years ago in reply to Man Nguyen

TI__Guru*** 125430 points

Man Nguyen said:
Why does it appear that the core does not "fire and forget" the store data transactions (see measurements for write GPMC ECC Control, write EMIF4_PERF_CNT_SEL in the above PDF)?

Strongly ordered is the most stringent (aka SLOW) of the various memory types. It waits for the access to complete before executing the next instruction.

DMA will avoid all these stalls and allow you to massively improve your access speed (assuming that's your goal).

0 Man Nguyen over 14 years ago in reply to Brad Griffis

Prodigy 110 points

Yes, it's my goal to achieve copy of object code image from NAND Flash to SDRAM as fast as possible.

The starting point observation with uboot was copy of 4 MegaBytes in 5.2 seconds. With my own implementation, without making any changes to the operating environment established by uboot, I was able to do copy of 4 MegaBytes in 0.9 seconds. My goal is 4 MegaBytes in 0.2 seconds.

Of the ~150 ns to ~200 ns AXI/OCP/L3 overhead, depending on which L3 module is the target (GPMC or SDRC/SMS or SDRC/EMIF), that I've observed, what is contributing the most delay? I'm thinking L3 because of all the loading from modules hanging off it and the overhead from arbitration. If L3 is the limiting factor, how does DMA overcome that as compared to MPU SS/Core? I'm assuming here that I can bring MMU into the picture and configure the memory types to enable the MPU SS/Core to virtually issue load from GPMC and store (fire and forget to enable subsequent load to go after a couple core cycles) to SDRC in parallel, separated by just a few core cycles (on order of ~1.7 ns). If there's a limitation inherent in the AXI/OCP/L3 pathway between the MPU SS/Core -> GPMC and MPU SS/Core -> SDRC that I can't overcome, then it would be enough for me to justify DMA.

Man

7711.visualization_of_dma_2011_07_28.pdf

By the way, the above PDF shows that I did consider in late July an implementation using DMA. Given what I know now about the communication overhead with AXI/OCP/L3, I have these questions:

1. If L3 is the most significant factor in the ~150 ns and ~200 ns with MPU to GPMC and MPU to SDRC communication AND if the same limitation also exists with DMA, then what packetization does DMA do to squash the overhead?

0 Brad Griffis over 14 years ago in reply to Man Nguyen

TI__Guru*** 125430 points

I don't know the exact timing breakdown... However, in your current setup the AXI is only issuing a single read/write request to the L3 interconnect. The DMA will be able to submit much larger accesses (I forget the max size, maybe 64 bytes?) that will amortize the overhead over multiple accesses. I recommend NOT enabling the MMU as having cache enabled to the GPMC space sounds like a recipe for disaster. I think DMA will be much cleaner/easier and even faster.

0 Man Nguyen over 14 years ago in reply to Brad Griffis

Prodigy 110 points

Brad,

Funny that you replied to my post while I was editing it with a diagram drawn up in late July and with additional question about how the packetization (64 bytes?) between DMA-GPMC and DMA-SDRC. Can you confirm that the packetization (and size of packetization) works with the diagram in the PDF?

By all means, I want to avoid enabling the L1 data cache and L2 unified cache, and if having to enable MMU means also enabling data cache, then I would prefer to stay away from such an implementation.

(I am a bit of a concrete constructionist, I want to avoid introducing anything new as part of a solution unless I fully understood why it's absolutely needed.)

Man

0 Man Nguyen over 14 years ago in reply to Man Nguyen

Prodigy 110 points

Found the following in TRM:

4466.Aug 25, 2011 12_20_31 PM.pdf

I think that answers my question.

0 Brad Griffis over 14 years ago in reply to Man Nguyen

TI__Guru*** 125430 points

Man Nguyen said:
Can you confirm that the packetization (and size of packetization) works with the diagram in the PDF?

I just double-checked the TRM. I should be using the term "burst" rather than "packet" to be consistent with the TRM. For example, see section 7.4.5 "Burst Transactions" in the AM35xx TRM. I confirm that 64 bytes is the maximum burst size.

Man Nguyen said:

(I am a bit of a concrete constructionist, I want to avoid introducing anything new as part of a solution unless I fully understood why it's absolutely needed.)

In my opinion it will be easier to configure the DMA than the MMU/cache. There are multiple levels of tables/descriptors for the MMU. Yuck... Also, the biggest burst you can get from the Cortex A8 is 8 bytes because the AXI will breakdown a cache line into multiple 8-byte transfers. The SDMA on the other hand can perform up to 64 byte bursts.

0 Brad Griffis over 14 years ago in reply to Brad Griffis

TI__Guru*** 125430 points

Please verify the answer(s) so we can close this thread.

If you have trouble implementing the DMA configuration please start a new thread so that the forum stays organized.

Processors

Processors forum

AM3517: xloader/uboot performance; MPU Core, GPMC, and SDRC performance