This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

DM8148 DSP - DSP DDR performance issue

Guru 10750 points
Other Parts Discussed in Thread: CCSTUDIO

Hi,

I'm facing slow DDR access from the DSP what can be the reason? How can I set the DSP priority connecting to L3 BUS? any other priority that can help?

Many Thanks,

HR

  • HR,

    Have you enable DSP cache?

    The DSP performance in the SoC depends on many factors, most importantly is the EMIF/DDR characteristics but the chip level topology (bridging, interconnect, etc) also contributes.

    In the case where cache is completely disabled then each/every DSP read transaction is subject to the round trip latency of issuing a read from the CPU through the DSP subsystem, interconnect, bridges, EMIF, DDR, etc.

    Regards,
    Pavel

  • You can also configure the DSP static and dynamic pressure control in L3 interconnect.

    For dynamic pressure configuration in L3 bandwidth regulators, see DM814x TRM, section 1.12.2.3.3 Bandwidth Regulators. Bandwidth regulator increases pressure when the actual consumed bandwidth is lower than expected bandwidth and decreases the pressure once the expected bandwidth is reached.

    For static pressure configuration use register INIT_PRIORITY_0. Valid values are 0x0 (low), 0x1 (medium) and 0x3 (high).

    Priority control can be set in EMIF also, through DMM PEG registers. Priority is 3 bit field ( 0 ... 7 ) , 0 is highest priority. Priority determines prioritization of data transfers in EMIF. See DMM_PEG_PRIOx registers.

    Regards,
    Pavel

  • Hi Pavel,

    Yes we know about using the cache,

    To clarify L3 configuration priority -

    - For static DSP pressure configuration should it be INIT_PRIORITY_0 = 0xC0

    - Which bits in the DMM_PEG_PRIOx should be configured for setting the DSP in higher priority, there is the Master Connection ID table but how does this connect to the DMM_PEG_PRIOx register?

    Thanks,

    HR

  • HR,

    HRi said:
    For static DSP pressure configuration should it be INIT_PRIORITY_0 = 0xC0

    The correct value should be 0xFC (0b11111100), assuming that the MMU is involved:

    INIT_PRIORITY_0[7:6] MMU = 0x3 (high priority for the MMU port)

    INIT_PRIORITY_0[5:4] C674x_DSP_CFG = 0x3 (high priority for the DSP CFG port)

    INIT_PRIORITY_0[3:2] C674x_DSP_MDMA = 0x3 (high priority for the DSP data port)

    HRi said:
    - Which bits in the DMM_PEG_PRIOx should be configured for setting the DSP in higher priority, there is the Master Connection ID table but how does this connect to the DMM_PEG_PRIOx register?

    Check section 6.2.1.1 Priority Extension Generator (PEG).

    Register  DMM_PEG_PRIO1 ( offset 0x624) ,  Field P0 ( Bits 2:0 ), P1 (Bits 6:4) and P2 (Bits 10:8) would be used to change DSP and MMU priority.

    To set DSP MDMA priority to highest, set the value of 0b1000 in DMM_PEG_PRIO1[3:0] bits

    Regards,
    Pavel

  • Pavel,

    Thanks, regarding the DMM_PEG_PRIO1 what is the connection between P0, P1,... and Table 7-83. Master Connection IDs, it is not clear from section 6.2.1.1 - "The 16 priority entries are software-programmable with DMM_PEG_PRIO0 (for the first eight entries of the ConnID table) and DMM_PEG_PRIO1 (for the last eight entries)"

    Thanks,

    HR

  • HR,

    HRi said:
    Thanks, regarding the DMM_PEG_PRIO1 what is the connection between P0, P1,... and Table 7-83. Master Connection IDs

    The connection is with Table 1-178. ConnID Values (not 7-83).

    I think the 6-bit values are used for the DMM_PEG_PRIO0/1 registers.

    But you can also try with the 4-bit values, which are common for DSP and MMU : 0x2.

    Best regards,
    Pavel

  • Pavel,

    So if I take the 4 bit ConnID value from table 1-178 than it means that DMM_PEG_PRIO0 - P2 priority will be set for "GEM (DSP_MDMA)" & "GEM_CFG (DSP_CFG)" & "MMU" is this correct or I'm missing something,

    Thanks,

    HR

  • HRi,

    HRi said:
    So if I take the 4 bit ConnID value from table 1-178 than it means that DMM_PEG_PRIO0 - P2 priority will be set for "GEM (DSP_MDMA)" & "GEM_CFG (DSP_CFG)" & "MMU" is this correct or I'm missing something,

    This is correct. The priority related to index x is given in the Px field.

    Regards,
    Pavel

  • Pavel,

    So it should be DMM_PEG_PRIO0 = 0x800, DMM_PEG_PRIO0 handles the first eight entries of the ConnID table,  is this correct?

    Thanks,

    HR

  • HR,

    HRi said:
    So it should be DMM_PEG_PRIO0 = 0x800

    Thus you will also give high priority to P0 (A8 ARM) and P1 (Debug/JTAG), not only to P2 (DSP/MMU). The value should be 0x844.

    HRi said:
    DMM_PEG_PRIO0 handles the first eight entries of the ConnID table,  is this correct?

    This is correct.

    Best regards,
    Pavel

  • Hi Pavel,

    What should be the cycles count for the DSP reading/Writing data from the DDR with/without cache?

    Thanks,

    HR

  • HR,

    You can refer to the below links:

    http://processors.wiki.ti.com/index.php/TI81XX_PSP_04.04.00.02_Feature_Performance_Guide

    http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/716/t/198610.aspx

    http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/716/t/264204.aspx

    http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/716/t/194014.aspx

    http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/717/t/105784.aspx

    http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/717/t/203063.aspx

    http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/717/t/136690.aspx

    Best regards,
    Pavel

  • Hi Pavel,

    In the thread - http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/716/t/198610.aspx there is a comment -

    "There are some issues w.r.t to bus access in DM81xx. AFAIK, it might get fixed in newer silicon revision. However, I cannot guarantee this."

    Was this fixed? is it mentioned somewhere in the device documentation?

    Thanks,

    Haim

  • Haim,

    The difference between the device silicon revisions are described in the silicon errata document:

    http://www.ti.com/lit/er/sprz343c/sprz343c.pdf

    BR,
    Pavel

  • The changes between DM814x silicon version 2.1 and 3.0 are related to DMM arbitration, DDR Symmetry, HDVPSS lock up when accessed from CCS/JTAG, IO latch up, SATA gen3.

    See the DM814x silicon errata for more details.

    Regards,
    Pavel

  • Pavel,

    I haven't found yet the DSP R/W performance number (cycles) accessing the DDR with/without cache, we will try to do the MMU Bypass according to the device errata sprz343c.pdf Advisory 3.0.25,

    BTW - sprz343c.pdf is from march, 2013 and the comment at http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/716/t/198610.aspx is from Oct. 2013 so I assume the R/W DSP buss issue still exist,

    Thanks,

    HR

  • Hi Pavel,

    We are getting on 32bit DDR Non-Cacheable Read 222 cycles and on 32bit DDR Non-Cacheable Write 26 cycles, is this the real numbers?

    Thanks,

    Haim

  • Pavel,

    The cycles count are from the DSP side

    32bit DDR Non-Cacheable Read - 222 cycles

    32bit DDR Non-Cacheable Write - 26 cycles

    Thanks,

    HR

  • Hi Pavel,

    Any update regarding the DSP-> DDR read/write cycles count? Does the numbers we are getting is correct?

    Thanks,

    HR

  • Hello Pavel,

    I encountered the same issue with the DDR3. I wrote a benchmark test (an assembly code), which runs 2560 loops, and in each loop 32 bits are read from a non-cached DDR address. In addition, the read address is incremented each loop by 4. In my setup, the DSP doesn't use the MMU, the MMU is disabled for the DSP.

    Running the same loop, but instead of incrementing the address by 4, I increment it by 16, gave the same cycle count results.

    I would expect that the 128 bits width between the System MMU / EDMA and the L3 Interconnect will cause my first benchmark test to take much less cycles. However, it didn't. It seems from my measurements that the actual effective width that the DSP has with the L3 Interconnect is of 32 bits and not 128 bits.

    Is that really the case? Does the DSP has an effective interface of 32 bits width with the L3 Interconnect when accessing the DDR directly (and not through the DMA mechanism)?

    Thanks in advance,

    Elad.

  • HRi, Elad,

    I do not have the DSP bandwidth/performance numbers (clock cycles).

    Elad Roichman said:
    Does the DSP has an effective interface of 32 bits width with the L3 Interconnect when accessing the DDR directly (and not through the DMA mechanism)?

    The DSP Subsystem has MDMA (master DMA) and SDMA (slave DMA) which are used to communicate with the L3, through 128-bit bus. If these are not used (MDMA and SDMA), the DSP communicate with the L3 through the 32-bit CFG (configuration) bus.

    All DSP accesses through its MDMA port will be directed through the system MMU module where they
    are remapped to physical system addresses. When TPTC0/MDMA is routed through the MMU, accesses to the
    DSP SDMA port will still by-pass the MMU for increased performance.

    MDMA is a "port" name, has no relation to EDMA or the DSP Subsystem IDMA.

    Refer to the C674x DSP Megamodule/Subsystem TRM for more info:

    http://www.ti.com/lit/ug/sprufk5a/sprufk5a.pdf

    Regards,
    Pavel

  • Pavel,

    Does the measured cycles make sense? any chance to have someone in TI measuring the DSP<->DDR performance?

    Thanks,

    HR

  • HRi,

    Note that the actual throughput is very scenario dependent, so it is difficult to give general numbers. This therefore depends on what speed the memory is running at, what width is being used etc.

    Have a look also in the below e2e thread:

    http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/716/t/212586.aspx

    Best regards,
    Pavel

  • Pavel,

    If you will go to the end of the mentioned tread you will see the customer statement - "The simple answer is, thats just how long it takes....live with it. ;)" which probably TI can supply the correct number's and why we are getting them...

    Thanks,

    HR

  • HRi,

    One more document related to performance can be found in the below e2e post. Please have a look, might be in help:

    http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/716/p/323639/1130127.aspx#1130127

    Best Regards,
    Pavel

  • Pavel,

    This explains the BW but still doesn't explain the delay/cycles count,

    Thanks,

    HR

  • Those numbers are in the ballpark.  The reason for the difference ...

    Reads incur a round trip latency through the entire system:

    DSP CPU->Cache subsystem->Interconnect->EMIF->DDR->EMIF->Interconnect->Cache subsystem->DSP 

    The CPU is stalled the entire time waiting for the request to travel forward, and the read response/data to travel back.  Since non-cached requests are not pipelined, only one read request/response is in-flight at a point in time.

    Writes are "fire-n-forget".  The DSP is effectively unstalled immediately once the data is outside of the DSP CPU's immediate view.  This allows multiple requests in flight at a point in time.

    In general, the latency impact can be mitigated by a) enabling the cache, or b) using the EDMA to transfer data between off-chip and on-chip memory.

    Regards,

    Kyle

  • Hi Kyle,

    Thanks, The issue we have is that we see a better/same performance using the C64x+@700MHz DDR2 (C6424) Than the C674x@750MHz DDR3 (DM8148) on both we are using cache, any way to improve the DM8148 DSP<->DDR3 performance?

    Thanks,

    HR

  • HRi,

    HRi said:
    The issue we have is that we see a better/same performance using the C64x+@700MHz DDR2 (C6424) Than the C674x@750MHz DDR3 (DM8148)

    We do expect that the performance of non-cacheable DDR accesses do not scale directly with frequency across the fundamentally different device architectures. 

    HRi said:
    any way to improve the DM8148 DSP<->DDR3 performance?

    a) enabling the cache, or b) using the EDMA to transfer data between off-chip and on-chip memory.

    If you are using CCStudio project for your DSP benchmark, you can also try to enable the DSP compiler and linker optimization with the -O3 flag.

    In CCS, Project -> Properties -> Build -> C6000 Compiler -> Optimization
    Optimization level (--opt_level, -O) 3
    
    Select "3" from the drop down menu. 

    Then Project -> Build All You should have in the console window: Invoking: C6000 Compiler "/home/users/pbotev/ti/ccsv5/tools/compiler/c6000_7.4.2/bin/cl6x" -mv6740 --abi=coffabi -O3 -g .... Invoking: C6000 Linker "/home/users/pbotev/ti/ccsv5/tools/compiler/c6000_7.4.2/bin/cl6x" -mv6740 --abi=coffabi -O3 -g Then load and run the new *.out file.

    BR
    Pavel

  • Hi Pavel,


    I am also trying PEG on DM8168 for giving highest priority to DSP,

    Tried to set DMM_PEG_PRIO1 ( offset 0x624) to 0x00000844 as you described,

    but it seems there is no any influence,

    should I set DMM_PEG_PRIO0 ?

    And I also found section 4.2.1.1 Priority Extension Generator (PEG) of sprugx8b.pdf described this:

    "However, for HD_VPSS, the peripheral itself generates a priority. The
    DMM_PEG_PRIOx field for HD_VPSS is bypassed and the priority indicated by HD_VPSS access is sent
    to SDRAM controller."

    Does this mean that DSP priority will be always lower than HD_VPSS ?


    thanks

    Andrew

  • Andrew,

    Andrew Huang said:
    I am also trying PEG on DM8168

    Please open a new thread in the DM816x forum:

    http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/717.aspx

    Regards,
    Pavel

  • Since this old thread got bumped anyway, I thought I'd mention:

    Pavel Botev said:

    I think the 6-bit values are used for the DMM_PEG_PRIO0/1 registers.

    But you can also try with the 4-bit values

    PEG indeed uses 6-bit ConnIDs, while TILER and PAT use 4-bit ConnIDs.

    (The number of connIDs differentiated by each module can actually be read from the byte at offset 0x208 (TILER) / 0x408 (PAT) / 0x608 (PEG), which on the dm814x therefore read as 16, 16, and 64 respectively.)

    Pavel Botev said:

    INIT_PRIORITY_0[7:6] MMU = 0x3 (high priority for the MMU port)

    My understanding is that since the MMU has a bandwidth regulator to provide dynamic initiator pressure, the static initiator pressure configured in the INIT_PRIORITY has no effect.  (In other words, the MMU having a field in INIT_PRIORITY is a documentation error)

    In case of doubt, the true initiator pressure can be verified by performing an invalid L3 access and inspecting the resulting L3 interconnect error.  The "expansion slot" (memory region 0x45000000-0x45ffffff) is convenient for this purpose since nearly all initiators can reach it and any access will always result in error.  The associated target agent is located at 0x44000600.  When an error is logged, the register at offset 0x48 reads as 0x80001 and the register at offset 0x4C will contain the initiator pressure in bits 6-7.  Write -1 to offset 0x48 to clear the error (no new error can be logged otherwise).

  • Matthijs van Duin said:

    Since this old thread got bumped anyway, I thought I'd mention:

    I think the 6-bit values are used for the DMM_PEG_PRIO0/1 registers.

    But you can also try with the 4-bit values

    PEG indeed uses 6-bit ConnIDs, while TILER and PAT use 4-bit ConnIDs.

    (The number of connIDs differentiated by each module can actually be read from the byte at offset 0x208 (TILER) / 0x408 (PAT) / 0x608 (PEG), which on the dm814x therefore read as 16, 16, and 64 respectively.)

    Pavel Botev said:

    INIT_PRIORITY_0[7:6] MMU = 0x3 (high priority for the MMU port)

    My understanding is that since the MMU has a bandwidth regulator to provide dynamic initiator pressure, the static initiator pressure configured in the INIT_PRIORITY has no effect.  (In other words, the MMU having a field in INIT_PRIORITY is a documentation error)

    In case of doubt, the true initiator pressure can be verified by performing an invalid L3 access and inspecting the resulting L3 interconnect error.  The "expansion slot" (memory region 0x45000000-0x45ffffff) is convenient for this purpose since nearly all initiators can reach it and any access will always result in error.  The associated target agent is located at 0x44000600.  When an error is logged, the register at offset 0x48 reads as 0x80001 and the register at offset 0x4C will contain the initiator pressure in bits 6-7.  Write -1 to offset 0x48 to clear the error (no new error can be logged otherwise).

    [/quote]

    Can someone from TI confirm that on C674x MMU overrides INIT_PRIORITY and in fact INIT_PRIORITY has no effect?

    Does the same statement hold true for EDMA3 transfers.

    Thanks,

    Andrew

  • Andrew Elder said:

    Can someone from TI confirm that on C674x MMU overrides INIT_PRIORITY and in fact INIT_PRIORITY has no effect?

    Does the same statement hold true for EDMA3 transfers.

    I think more generally there's a pressing need to have some kind of authoritive statement from TI on how prioritization in the interconnect works exactly.  Not just how initiator pressure is selected, but also the topology of the L3 since its switches are where arbitration based on these values takes place.  This is essential information for performance analysis, yet the TRM shows the L3 as a kind of "black box".

     

    Here's at least some results of testing I did:

     

    EDMA TC 0 and TC 2 (which do not have bw regulators) use INIT_PRIORITY as expected.

    EDMA TC 1 and TC 3 have bw regulators providing dynamic pressure and ignore INIT_PRIORITY.  Although their prioritized bandwidth is zero by default, they have a 1-byte quotum (minimum) therefore the first access after reset gets high pressure (default 3).

     

    DSP MDMA uses INIT_PRIORITY, however it connects only to the MMU.  If the MMU is truly shut off via PRCM then you'll see this pressure and initiator id 0x08 in the error logged in the MMU target agent on every MDMA access.

    However, if the MMU is turned on, even if "disabled" via its config or using MMU_CFG in the control module, then it proxies the DSP traffic and therefore

    • initiator ID becomes 0x0A (MMU)
    • pressure is obtained from the MMU bandwidth regulator (INIT_PRIORITY ignored)
    • every error shows up twice:  once for the original DSP -> MMU access (in MMU target log) and once for proxied access (with MMU as initiator)

     

    When TC 0 / 1 redirection is enabled in MMU_CFG then (for writes only, reads are unaffected due to erratum) their situation becomes the exactly the same as that of the DSP.  Since their original pressure values are used to reach the MMU, presumably they are used for arbitration of access to the MMU.  In this case the INIT_PRIORITY value for MDMA is therefore not entirely useless, however I suspect few people redirect TCs through the MMU anyway.