This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM3517: Slow memory access

Other Parts Discussed in Thread: AM3517

I'm busy bringing up a new board using the AM3517. I have an EVM as reference design. I am using my own operating system and no sources are used from a third party.

I am using current XLoader and U-Boot sources as reference together with current Ti docs.

So far all was well. I have 512Mb of DD2 working fine (no termination resistors, no low temp problems - using VTG_DYNAMIC).

As part of the initial validation I run a series of simple tests to gauge performance (this is a good way of checking that all clocks etc are setup as they should). MPU at 600Mhz, Core at 332Mhz, everything else as per XLoader/UBoot.

Maximum execution speed using a small ASM loop (no data access) gets close to the 1.2MIPS so that is fine and expected (with I-Cache enabled).

But - with data access things go horribly wrong.


MOV R2,#0x100000 ;try for about 1 million accesses

LP: LDR R0,=0x40200000 ;Internal RAM
  LDR R1,[R0],#4
  LDR R1,[R0],#4

.. Total of 10 lines.. of LDR R1,[R0],#4
  LDR R1,[R0],#4
  LDR R1,[R0],#4
  SUBS R2,R2,#1
  BNE LP

This results in around 4 million data access per second with Data cache disabled and 40 million with cache enabled (measured using a toggling LED and scope).

Using the same code accessing SDRAM is slightly slower.

That is pathetic. Something is wrong.

Trying the same on the EVM using XLoader and UBoot (without kernel image so it halts with panic) and then running the code produces identical results so I know my init is good (or at least the same as the EVM).

Only 4 million accesses to a non-cached internal RAM per second ? This is not much more than the speed I expect from an old 8-bit CPU. With a core clock of 332 Mhz things should be MUCH faster. Some interface or sync clock is not running at the correct speed (that's my guess anyway).

After fiddling with about every register I can think of I'm close to giving up. Anybody see the light that I am missing ?

A further hint is that DSS accesses to memory seem to run at high speed - so this (and the fact that the issue appears with internal and external memories) points to the CPU side of the L3 interconnect.

Any suggestion or idea to try out is most welcome...

Rainier

  • Hi Rainier,

    I see from your post you are using xloader and u-boot.  They do not enable the MMU.  On the Cortex-A8, you have to enable the MMU in order to use L2 cache.  I think this may be affecting your performance. There are references in Starterware on how to setup simple flat mapping configurations for the MMU. http://processors.wiki.ti.com/index.php/StarterWare

    There is not a dedicated release for AM3517 in Starterware, however if you get the AM335 version, low level functions like MMU and cache setup will be the same since they are both Cortex-A8s.  Here is a memcpy benchmark results run under linux.  http://sourceware.org/ml/libc-ports/2009-07/msg00000.html

    Results of memcpy benchmark for AM3517: (CPU clock = 600MHz, DDR clock = 166 MHz

    root@am3517-evm:~# ./memcpy_test
    --- Running correctness tests (use '-benchonly' option to skip) ---
    all the correctness tests passed

    --- Running benchmarks (average case/perfect alignment case) ---

    very small data test:
    memcpy_neon : (3 bytes copy) = 91.0 MB/s / 96.3 MB/s
    memcpy_arm : (3 bytes copy) = 82.2 MB/s / 87.0 MB/s
    memcpy_neon : (4 bytes copy) = 93.2 MB/s / 97.0 MB/s
    memcpy_arm : (4 bytes copy) = 64.4 MB/s / 71.0 MB/s
    memcpy_neon : (5 bytes copy) = 116.5 MB/s / 121.4 MB/s
    memcpy_arm : (5 bytes copy) = 73.0 MB/s / 88.6 MB/s
    memcpy_neon : (7 bytes copy) = 132.2 MB/s / 136.7 MB/s
    memcpy_arm : (7 bytes copy) = 89.6 MB/s / 124.4 MB/s
    memcpy_neon : (8 bytes copy) = 127.0 MB/s / 130.6 MB/s
    memcpy_arm : (8 bytes copy) = 100.5 MB/s / 129.9 MB/s
    memcpy_neon : (11 bytes copy) = 150.6 MB/s / 154.2 MB/s
    memcpy_arm : (11 bytes copy) = 121.8 MB/s / 179.5 MB/s
    memcpy_neon : (12 bytes copy) = 144.4 MB/s / 147.3 MB/s
    memcpy_arm : (12 bytes copy) = 132.2 MB/s / 174.3 MB/s
    memcpy_neon : (15 bytes copy) = 161.1 MB/s / 164.2 MB/s
    memcpy_arm : (15 bytes copy) = 151.1 MB/s / 217.2 MB/s
    memcpy_neon : (16 bytes copy) = 151.7 MB/s / 272.1 MB/s
    memcpy_arm : (16 bytes copy) = 159.9 MB/s / 214.4 MB/s
    memcpy_neon : (24 bytes copy) = 230.6 MB/s / 420.6 MB/s
    memcpy_arm : (24 bytes copy) = 203.5 MB/s / 270.1 MB/s
    memcpy_neon : (31 bytes copy) = 299.8 MB/s / 449.9 MB/s
    memcpy_arm : (31 bytes copy) = 235.4 MB/s / 324.3 MB/s

    L1 cached data:
    memcpy_neon : (4096 bytes copy) = 2210.0 MB/s / 2347.9 MB/s
    memcpy_arm : (4096 bytes copy) = 997.4 MB/s / 1677.3 MB/s
    memcpy_neon : (6144 bytes copy) = 2266.7 MB/s / 2357.2 MB/s
    memcpy_arm : (6144 bytes copy) = 1008.5 MB/s / 1694.2 MB/s

    L2 cached data:
    memcpy_neon : (65536 bytes copy) = 803.6 MB/s / 892.6 MB/s
    memcpy_arm : (65536 bytes copy) = 692.6 MB/s / 662.1 MB/s
    memcpy_neon : (98304 bytes copy) = 792.2 MB/s / 875.6 MB/s
    memcpy_arm : (98304 bytes copy) = 672.2 MB/s / 649.9 MB/s

    SDRAM:
    memcpy_neon : (2097152 bytes copy) = 298.4 MB/s / 310.9 MB/s
    memcpy_arm : (2097152 bytes copy) = 207.6 MB/s / 201.6 MB/s
    memcpy_neon : (3145728 bytes copy) = 298.0 MB/s / 310.8 MB/s
    memcpy_arm : (3145728 bytes copy) = 207.9 MB/s / 202.7 MB/s

    (*) 1 MB = 1000000 bytes
    (*) 'memcpy_arm' - an implementation for older ARM cores from glibc-ports

  • Thank you Jeff for your detailed answer.

    No, I am not running u-boot or anything Linux - I used these as reference. I am porting my own operating system from an ARM9 platform.

    Forget the caches - they work as expected. The trouble is what happens with a cache miss or access with the caches disabled. It takes around 300nS for a one memory cycle DDR2 read (memory clocked at 166MHZ, on my board I can clock it up to 300Mhz (600MHZ core clock) so I have some room).

    Now, 300nS is double the time it took my old ARM9 to do the same thing and that was running SDRAM at 100Mhz. This is what I don't get. It just HAS to be faster than that. If that memory cycle happens to be a 4 byte aligned read we are getting little over 12MB/s. (not for a copy - just read and discard).

    I am using the clock and DRAM setups copied from current u-boot code. Just to make sure I ran my tests on the EVM using the u-boot startup as well (to confirm that I had not missed a step).

    Would you perhaps know if Linux (which you are using) is changing the u-boot clock setups (indicating that both U-boot and me are missing something) ?

    You are getting 70MB/s - I am getting 12MB/s. Why ?

    Rainier

  • Rainier,

    I understand your concern, but I believe the latency you are seeing is about what is expected.  As I mentioned in your other post on latency, we measure latency at around 290ns for DDR on our AM3517 EVM.  I came from ARM7 all the way up to Cortex-A8, so I was also surprised to see latency that large.  Since that part was released, subsequent Cortex-A8 parts have been designed with improved latency, however you should still be able to match the performance that I posted. The Cortex-A8 has features to minimize cache misses, such as branch prediction buffers, so you rarely have to take these latency hits.  We have had another customer coming from ARM9 complain about worse performance.  In their case, L2 cache was not properly enabled. That is why I was asking about L2.  Sorry, but I have to ask again if you have enabled the MMU?  I believe this is the issue. I really don't think the memory timing values are the issue.  U-boot sets up timing and Linux does not do anything after that. So I think you should be good there.  Are you using a AM3517 EVM, so the memory timings would be correct, or are do you have your own custom board?

  • OK, I have to accept these latencies then. It also means I have to possibly throw away the AM3517.

    You see - the latencies ARE critical. Here is why:

    Texas Instruments refuses to give developers native access to the SGX (no, it is not Imagination Technologies - I asked them, they say Ti is allowed to give this info out).

    This leaves me with a problem. I don't use Linux or CE. I use my own operating system which is used in aircraft flight information displays. I can't use the Linux or CE drivers - every byte on the system is my own. So I decided the AM3517 should easily be sufficient if I just do what I always do - write the graphics drivers myself. Mostly that is not an issue as most operations are writes and the AM3517 write buffer works well giving high performance.

    The trouble comes in texture mapping triangles if you do texture filtering. This is a traditional cache problem where the cache becomes a hinderance rather than a boost. Effectively most texture reads are cache misses and there are lots of them. With effectively a maximum about of 3 million pixels per second read performance (assuming all are misses) it simply is a no go.

    In another thread you will notice I am trying to get Linux to work on this board (in order to take advantage of the SGX) so maybe I can construct a very small kernel just to host the SGX drivers - but perhaps the correct path to follow in this case is to ditch this chip and go for the iMX6 instead. It's a pity - I quite liked the AM3517 for a number of good reasons. I can't use a AM335x as none have a BT656 interface which I need.

    Rainier

  • There is no unified cache in this system, and what you described for the cache applies only to the CPU.

  • Yes quite. The CPU - I have not mentioned anything else. That is my concern. I don't have issues with the caches. The CPU->L3->IMIF->L3->CPU link however, well, sorry, have to say it: "sucks".

  • Hello,

    Did anyone had success in bringup of sitara with am35x ?
    I have tried to start with am335x project for sitara, but I get nothing yet on startup.
    I've tried to build it both in CCS and IAR, and just changed main to print some characters;
    int main()
    {
    ...

    while(1)
    {
    // wait until there is room in the FIFO.
    while ((0x49020000 + 0x0044) & 0x00000001)
    ;
    // send the character.
    *(volatile unsigned char*)(0x49020000 + 0x0000) = 'a';
    }
    }

    Is there some importnat difference that I should take care of (start script/compiler) ?
    Thank for any idea.

    Ran