This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

memcpy() speed on Cortex A15 vs DSP

Other Parts Discussed in Thread: SYSBIOS

Hi,

I am working with OMAP5432 SOC, running the QNX OS. I noted that DDR RAM access speed in a common case on DSP is much slower compared to Cortex A15.

Frequency settings: 1497/531/531 MHz (Cortex A15/DSP/DDR).

The best memcpy() speed which I is able to reach on DSP is about 60 MBytes/s. DSP binary is based on sysbios-rpmsg’s OMX server example(sysbios-rpmsg_2_00_12_32+glsdk1/src/ti/examples/srvmgr), it was complied with –o3 flag. L1/L2 caches are switched on. I tried different ti.sysbios.hal.ammu.AMMU, ti.sysbios.hal.unicache.Cache settings.

memcpy() speed on Cortex A15 is about 600 MBytes/s.

Does those numbers sound reasonable? What may be a reason for 10 times memcpy() speed difference?

Thank you.

  • Hello Ivan,

    The MPU subsystem implements an L2 memory system. This memory system consists of an L2 cache and associated L2 cache controller. The MPU L2 cache controller runs at full-CPU speed and is configured to have one 128-bit master port. The L2 cache controller is configurable via CP15 registers and is tightly coupled to the L1 memory system. The MPU L2 memory system supports ARM Instruction Set Architecture (v7).

    High-performance TI DSP (TMS320DMC64xTM) derivative (DSP_C0) integrated in a megamodule, including local level 1 (L1) and level 2 (L2) cache and memory controllers for audio processing and general-purpose imaging and video processing
    L1 and L2 shared cache - (part of the DSP megamodule located - SCACHE_MMU_DSP block)

    The SCACHE_MMU_DSP supports different page sizes: large, medium, and small. The number of large pages, number of medium pages, etc., is defined at design time. The maximum number of large pages is eight.

    The clock signal for DSP subsystem is provided by DPLL_IVA.

    The clock signal for Cortex A15 is provided by DPLL_MPU.

    #Q1: Does those numbers sound reasonable?

    - Yes, Cortex A15 uses NEON data engine. It improves the speed of data transfers.

    #Q2: What may be a reason for 10 times memcpy() speed difference?

    - The first reason is clock frequency, on which Cortex A15 and DSP work.

    - The second is the different architectures between Cortex A15 subsystem and DSP Subsystem.

    I want to notice that the memcpy () function is interpreted in different ways between both subsystems. See libcstubs /board-support/u-boot/arch/arm/lib/memcpy.S or memcpy.c

    Best regards,

    Yanko

  • Hello Yanko,

     Thanks for the answer. Could you or someone assist me a little bit more with this topic?

    I extended my test to verify DDR memory read and write speed separately. Test asm code based on LDDW/STDW instructions.

    According to my test read speed is around 100MB/s, while write speed is close to 390MB/s.

    #Q3: Does 100MB/s look like an expected value for DSP’s DDR memory read speed while write speed is four times faster? If doesn’t - what kind of issue can slow down read access in such way?

     

    Some details:

     Read test looks like:

          uint64_t tmp = 0;

          for(i=0; i<size; i+=8)

          {

             tmp += _amem8_const(addr+i);

          }

     

    and write test like

          uint64_t tmp = 0;

          for(i=0; i<size; i+=8)

          {

             _amem8(addr + i) = tmp++;

          }

     

    Tested memory region is placed in TILER 2D memory region.

     

    Cache and AMMU settings:

     

    /* TILER region: Large Page (512M); cacheable, posted */

    /* config large page[2] to map 512MB VA 0x60000000 to L3 0x7FFFFFFF */

    AMMU.largePages[2].pageEnabled = AMMU.Enable_YES;

    AMMU.largePages[2].logicalAddress = 0x60000000;

    AMMU.largePages[2].size = AMMU.Large_512M;

    AMMU.largePages[2].L1_cacheable = AMMU.CachePolicy_CACHEABLE;

    AMMU.largePages[2].L1_posted = AMMU.PostedPolicy_POSTED;

    AMMU.largePages[2].L1_allocate = AMMU.AllocatePolicy_ALLOCATE;

    AMMU.largePages[2].L1_writePolicy = AMMU.WritePolicy_WRITE_THROUGH;

    AMMU.largePages[2].L2_cacheable = AMMU.CachePolicy_CACHEABLE;

    AMMU.largePages[2].L2_posted = AMMU.PostedPolicy_POSTED;

    AMMU.largePages[2].L2_allocate = AMMU.AllocatePolicy_ALLOCATE;

    AMMU.largePages[2].L2_writePolicy = AMMU.WritePolicy_WRITE_THROUGH; 

     

    [ 594.962] L1_INFO

    [ 594.962]    version = 1

    [ 594.962]    ways = 4

    [ 594.962]    size = 32 kb

    [ 594.962]    slaves = 2

    [ 594.962]    masters = 1

    [ 594.962] L1_CONFIG

    [ 594.962]    secure = unlocked

    [ 594.962]    bypass = normal

    [ 594.962]    secint = non-secure

    [ 594.962]    secport = non-secure

    [ 594.962]    secmain = non-secure

    [ 594.962] L1_OCP

    [ 594.962]    wrap = non-wrap

    [ 594.962]    wrbuffer = non-buffered writes

    [ 594.962]    prefetch = follow MMU

    [ 594.962]    cleanbuf = no-clean/empty

    [ 594.962] L2_INFO

    [ 594.962]    version = 1

    [ 594.962]    ways = 8

    [ 594.962]    size = 128 kb

    [ 594.962]    slaves = 1

    [ 594.962]    masters = 1

    [ 594.962] L2_CONFIG

    [ 594.962]    secure = unlocked

    [ 594.962]    bypass = normal

    [ 594.962]    secint = non-secure

    [ 594.962]    secport = non-secure

    [ 594.962]    secmain = non-secure

    [ 594.962] L2_OCP

    [ 594.962]    wrap = non-wrap

    [ 594.962]    wrbuffer = non-buffered writes

    [ 594.962]    prefetch = follow MMU

    [ 594.962]    cleanbuf = no-clean/empty

    Thanks,

    Ivan

  • Hello Ivan,

    The DSP subsystem has Private direct memory access controller - DMA_DSP:
    For transfers between DSP megamodule internal memories, use the DMA_DSP with the locked feature.
    The DMA_DSP is based on two primary components:
    • DMA third-party channel controller (TPCC)
    • DMA third-party transfer controller (TPTC)

    My suggestion is to change the settings in the following TPTC register in DSP subsystem:

    TPTC_RDRATE[2:0] RDRATE - Read Rate Register

    Read Rate Control:
    Controls the number of cycles between read commands.  This is a global setting that applies to all TRs for this TC.
    0x0: Reads issued as fast as possible.
    0x1: 4 cycles between reads
    0x2: 8 cycles between reads
    0x3: 16 cycles between reads
    0x4: 32 cycles between reads

    Then try to test read speed again.

    Best regards,

    Yanko

  • Hello Yanko,

    Thanks for the suggestion.

    I tested read speed with different TPTC_RDRATE[2:0] RDRATE settings for both TPTC instances(TPTC0, TPTC1) and noted that read speed isn’t affected anyhow by this setting.

    According to my experiment it affects DMA_DSP DDR to DDR memory transfer speed a little bit. DMA_DSP speed with TPTC_RDRATE[2:0] RDRATE = 0x0 is about 5-10% faster compared to TPTC_RDRATE[2:0] RDRATE = 0x4.

    Btw, DMA_DSP DDR to DDR memory transfer speed is up to 500MB/s on my target.

    Is there any other suggestions?

    What additional information I can provide to help to localize this issue?

     

    Thanks,

    Ivan