Real-world DDR performance

Carsten Hansen

Other Parts Discussed in Thread: AM3356, AM3359, OMAP3503

Hi,

We've been getting poorer than expected DDR memory performance on three ARM platforms, and were wondering if someone could help explain the results we're seeing, and/or possibly share some of their own experience.

To be clear, we're talking about uncached DDR reads and writes, performed on OMAP3 and AM335x and using regular ARM instructions, not DMA.

Overall performance, including instruction rates and cached memory bandwidth, is satisfactory and in line with expectations.

Just for background, we run Windows Embedded Compact 7.0 on all our platforms, and have been using the 'simple' benchmark application developed by BSquare and included in a number of OMAP3 BSP's provided by TI and, later, Adeneo.

We run the 'simple' application unchanged on the three platforms. It performs a number of tests, some of which involve allocating a 4KB page of uncached memory, and then using a series of ARM load/store-single or load/store-multiple instructions to access the memory.

Results:

Platform	CPU	Core clock	DDR type	Theoretical bandwidth	Test 1, uncached single read (ldr)	Test 2, uncached single write (str)	Test 3, uncached multiple read (ldm)	Test 4, uncached multiple write (stm)
1	AM3356 (custom board)	800 MHz	400MHz 16-bit DDR3	1600 MB/s	20 MB/s	25 MB/s	40 MB/s	51 MB/s
2	AM3359 (TI Starter Kit)	720 MHz	333MHz 16-bit DDR3	1333 MB/s	18 MB/s	23 MB/s	37 MB/s	47 MB/s
3	OMAP3503 (custom board)	600 MHz	166MHz 32-bit LPDDR	1333 MB/s	20 MB/s	28 MB/s	39 MB/s	56 MB/s

Suffice to say that these numbers are well below what we expect, for instance reading 40 MB/s (Test 3) on platform #1 is only 2.5% of the theoretical 1600MB/s.

While we understand that 1600 MB/s is not realistic, perhaps we were hoping to rather achieve something like 50-70% of that.
Or is it that our expectations are wrong?

The DDR timings are calculated based on TI's spreadsheet and the datasheet for the memory devices. I have attached the spreadsheet for platform 1, in case someone is able to spot something obvious. The memory device used on platform 1 is MT41K256M16HA-125IT.

platform1-timing.xls

The platforms are generally stable with no signs of memory integrity issues. DDR3 leveling has been performed where applicable.

over 10 years ago

0 Biser Gatchev-XID over 10 years ago

TI__Guru**** 393215 points

Hi Carsten,

Windows Embedded is not supported by TI, and this forum supports Linux only. I can give you the Linux benchmarks for comparison: http://processors.wiki.ti.com/index.php/Processor_SDK_Linux_Kernel_Performance_Guide

0 Carsten Hansen over 10 years ago in reply to Biser Gatchev-XID

Expert 1450 points

Hi Biser,

Thanks, we actually knew about that wiki page, but were unable to find any information about DDR performance under Linux (lots of data on storage, network, graphics etc. though).

I should also mention that we are aware that TI no longer supports Windows Embedded, but felt that this was a general enough topic that it could go in the Sitara processors forum.

We are mainly interested in knowing what sort of DDR performance to expect on AM335x (and OMAP3), and what might cause the unexpectedly low rates we are seeing.

Best regards,
Carsten

0 Biser Gatchev-XID over 10 years ago in reply to Carsten Hansen

TI__Guru**** 393215 points

The LMBench metrics table provides this information. Different benchmarks are explained here: http://www.bitmover.com/lmbench/man_lmbench.html

0 Carsten Hansen over 10 years ago in reply to Biser Gatchev-XID

Expert 1450 points

Thanks again Biser,

Correct me if I'm wrong, but I believe LMBench only measures cached performance. We ran LMBench on the Starter Kit, and got results that were consistent with cached performance under Windows Embedded.

What we're asking about is uncached performance.

Best regards,
Carsten

0 Biser Gatchev-XID over 10 years ago in reply to Carsten Hansen

TI__Guru**** 393215 points

Those are the only benchmarks currently available. I will ask the factory team if they have more information.

0 Brad Griffis over 10 years ago

TI__Guru*** 125430 points

Carsten,

This question comes up very regularly across all platforms (ARMs, DSPs, etc.). So frequently in fact that I wrote a wiki page to try and explain it clearly:

http://processors.wiki.ti.com/index.php/Common_Issue_Resulting_in_Slow_External_Memory_Performance

As a very high level data point, without the cache enabled I typically see performance deltas on the order of 100x. So I'm not at all surprised by the performance you're seeing. It's inherent in non-cached accesses. On a related note, if you configure the memory as "device" (i.e. buffered but non-cached) you can substantially increase your write performance since that will allow the ARM to make use of the write buffer, including write-merging, etc.

Brad

0 Carsten Hansen over 10 years ago in reply to Brad Griffis

Expert 1450 points

Brad,

Thanks, appreciate it.

We had already gone through that particular wiki page as well, but it seems more focused on ensuring that the cache is enabled, and less on measuring actual, uncached performance.

If I understand you correctly, DDR bandwidth of 40-50 MB/s is not surprising on the platforms we use.

The reason we are so interested in uncached rather than cached memory bandwidth is because of the requirements of our application on the AM335x, which are:
* Receive at least 10 (ideally, 20) MBytes/s on one LAN interface (that's DMA from EMAC to DDR3)
* Process data on the CPU (read data back in, none of which are in cache)
* Send all the data out on the other LAN interface (so invalidate cache + DMA from DDR3 to EMAC)

With these amounts of data we quickly hit the limits of the cache, and we are already having difficulty meeting the 10 MB/s requirement. Looking around for bottlenecks, we came across the seemingly poor DDR performance, and benchmarking two other platforms showed similar results.

Going back to the wiki page, I'm not sure CPU pipelining would be an issue. The test code uses a series of 100 ldm/stm instructions, then repeats in a loop, it's difficult to imagine this would affect the pipeline much.

Cached performance, as mentioned previously, is completely satisfactory, so we have little doubt that the MMU has been set up correctly.

We will try to benchmark using DMA though, as the wiki page suggests. Maybe we will see different and more accurate results then.

Thanks again,
Carsten

0 Carsten Hansen over 10 years ago in reply to Brad Griffis

Expert 1450 points

Brad,

Just to clarify, when you mention performance deltas of 100x, does that apply to running /everything/ cached vs. everything uncached?

In our case, the software (the benchmark application) itself runs from cached memory, but accesses an area mapped as uncached.

Running the benchmark application itself from uncached memory obviously would incur a much larger penalty as each instruction would have to be fetched from main memory, but that is not our scenario.

Best regards,
Carsten

0 Wolfgang Muees1 over 10 years ago in reply to Carsten Hansen

Genius 3685 points

DO NOT use uncached memory for DMA! The access speed of the CPU into this type of memory is slow.

Instead, use normal cached memory:

memptr = kmalloc(SIZE_XXX, GFP_DMA);

phys_ptr = dma_map_single(NULL, memptr, SIZE_XXX, DMA_xxx_DEVICE);

and then, inside your CPU function:

dma_sync_single_for_cpu(NULL, phys_ptr, SIZE_XXX, DMA_xxx_DEVICE);

do_something(memptr);

dma_sync_single_for_device(NULL, phys_ptr, SIZE_XXX, DMA_xxx_DEVICE);

0 Brad Griffis over 10 years ago in reply to Carsten Hansen

TI__Guru*** 125430 points

Carsten Hansen said:
If I understand you correctly, DDR bandwidth of 40-50 MB/s is not surprising on the platforms we use.

You have misinterpreted me here... The AM335x DDR bandwidth is excellent. There is no issue there. I am not surprised with your measurement due to the fact that you're using strongly ordered accesses.

Carsten Hansen said:
The reason we are so interested in uncached rather than cached memory bandwidth is because of the requirements of our application on the AM335x, which are:
* Receive at least 10 (ideally, 20) MBytes/s on one LAN interface (that's DMA from EMAC to DDR3)
* Process data on the CPU (read data back in, none of which are in cache)
* Send all the data out on the other LAN interface (so invalidate cache + DMA from DDR3 to EMAC)

I agree with where you started, i.e. that when EMAC puts data into DDR3 that it will not be in the cache (yet). However, the main point I'm trying to make is that when you go to read that data, it will be roughly 100x faster if the area is marked as cacheable.

Bottom line, I see no reason in the example above for you to be configuring the memory to be non-cacheable (strongly ordered). This is decreasing your throughput substantially. Before your second step (reading the data), you should invalidate that memory region so that you're sure you are getting "fresh" data. And once you're done making any modifications you should perform a writeback of the data to make sure it's sitting in the DDR3 ready for the EMAC.

Carsten Hansen said:
Going back to the wiki page, I'm not sure CPU pipelining would be an issue. The test code uses a series of 100 ldm/stm instructions, then repeats in a loop, it's difficult to imagine this would affect the pipeline much.

Please mark the memory as cacheable and try it again.

Carsten Hansen said:
Cached performance, as mentioned previously, is completely satisfactory, so we have little doubt that the MMU has been set up correctly.

I think we are having a disconnect in terminology. Let's please try to use the following terms to discuss:

Cacheable: MMU page marked so as to allow its contents to be allocated into the cache
Non-cacheable (or "strongly ordered" using ARM terminology): MMU page marked so as to NOT allow its contents to be allocated into the cache
Cache hit: Data already in the cache (extremely fast access)
Cache miss: Data non in the cache, its corresponding cache line is fetched from external memory and allocated in the cache

What I suspect might be the case, is that you're trying to quantify the penalty of a cache miss by doing a non-cacheable memory test. If that's the case, I don't recommend that... You would be better off invalidating a big chunk of memory and then reading it (while still marked cacheable).

0 Carsten Hansen over 10 years ago in reply to Brad Griffis

Expert 1450 points

Wolfgang and Brad,

Thanks very much for taking the time to explain this, it's all starting to make sense now.

We had no idea invalidating the cache was so much cheaper compared to uncached accesses, but we will now modify our application code and do some more testing of real-world scenarios, forgetting the 'simple' benchmark results.

I'll report back as soon as we have some results to share.

Have a great weekend,
Carsten

0 Matthijs van Duin over 10 years ago in reply to Carsten Hansen

Mastermind 8040 points

The core issue is that the cortex-a8 L1 memory subsystem is blocking: every ldr instruction stalls until the data is available. As a result, reading from non-cacheable memory will be dominated by the "ping time" to RAM rather than throughput. Using larger reads should help mitigate this somewhat. Stores are buffered hence not affected by this issue.

Note that unlike the integer core, the Neon subsystem is designed to accept streaming data from L2 and external memory, hence if you use Neon loads you should see a big speedup. (Do make sure the memory is normal uncacheable, not device or strongly-ordered.) The next fundamental limit you then run into is that the Cortex-A8 can still only have a single non-cacheable load outstanding on the bus interface, but at least then each load can be a big burst of data and the next one will already be queued. Unfortunately, unless you can process the data entirely in Neon, you will pay a big penalty when moving the data from Neon to the integer core.

In most cases, I agree using cacheable memory and explicit cache maintenance should be preferred.

Brad Griffis said:

Non-cacheable (or "strongly ordered" using ARM terminology): MMU page marked so as to NOT allow its contents to be allocated into the cache

That statement is however wrong. Strongly-ordered memory is far more constrained (and much slower) than normal non-cacheable memory. To illustrate, some timings I once did of a dumb bytewise copy routine (cycles/byte @ 800 MHz on a DM814x):

strongly-ordered	device	non-cacheable	L1 cacheable	from	to
246.8	240.7	240.7	101.2	strongly-ordered
178.0	178.0	125.8	10.8	device
178.0	108.8	108.8	5.9	non-cacheable

And for a simple Neon-based copy:

strongly-ordered	device	non-cacheable	from	to
17.46	17.34	15.04	strongly-ordered
12.83	12.77	1.10	device
12.89	9.02	1.31	non-cacheable

0 Carsten Hansen over 9 years ago in reply to Matthijs van Duin

Expert 1450 points

All,

I am extremely happy to report that using cached memory and explicit cache maintenance has more than doubled our throughput. We are now comfortably meeting the 10 MB/s requirement mentioned earlier, and it looks like we might even achieve the 20 MB/s target that seemed so out of reach just a few weeks ago.

There was some confusion on how our memory was configured (understandably, since, to be honest, we weren't exactly sure ourselves at that point). Before the changes, we were using Normal (i.e. not Device nor Strongly-ordered) Non-shareable Non-cacheable memory.

As Matthijs pointed out, Strongly-ordered memory comes with a lot of restrictions compared to Normal memory, for instance all accesses must happen in program order, repeated accesses are not allowed (unless specified by the program, of course), and speculative fetching is not allowed. This is all described in the ARMv7 reference manual, but was not an issue because, as it turned out, our memory was set up as Normal.

Anyway, thanks everyone, really appreciate all your suggestions and help on this one,
Carsten

0 Matthijs van Duin over 9 years ago in reply to Carsten Hansen

Mastermind 8040 points

Good to hear!

Some final tips:

Don't pay too much attention to the Architecture Reference Manual v7. It is overly broad and general, while the true behaviour of the Cortex-A8 is much simpler and more restrictive. If possible, locate revision B (2011) instead of the latest to avoid having to read around all the stuff about virtualization. Keep in mind the A8 is the very first ARMv7-A processor, and unlike later members still executes in-order. It never performs speculative data accesses (only instruction prefetching).

If you want more insight into the A8's behaviour, this paper is a good read. Also get the A8 TRM of course, but don't trust it too much.

If you're processing a dataset that doesn't fit in L2 cache, the A8 has a "prefetch engine" which can load data into or evict data out of L2 cache in the background. Unfortunately its completion irq is not physically connected to anything afaik in TI's instantiation of the A8.

0 Carsten Hansen over 9 years ago in reply to Matthijs van Duin

Expert 1450 points

Thanks again Matthijs, that's some excellent advice!

Processors

Processors forum

Real-world DDR performance