This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA4AH-Q1: about InnerShareable & OuterShareable

Part Number: TDA4AH-Q1
Other Parts Discussed in Thread: PROCESSOR-SDK-J784S4, , AM69, TDA4VH

Tool/software:

Hi TI team,
We are using PROCESSOR-SDK-J784S4(TDA4AH-Q1) and considering two patterns for how to mount the OS.
(Attachment: document.xlsx)

I have two questions about Shareable L2 cache.

Q1.
Regarding the Inner-Shareable/Outer-Shareable setting for A72 Cluster,
Is my understanding correct?
Please let me know if there is any mistake.

Q2.
Where can the settings in Q1 be configured?
From the following documents, it seems to be MSMC (Multicore shared Memory Controller), but we cannot read the specific setting method.

J784S4, TDA4AP, TDA4VP, TDA4AH, TDA4VH, AM69 Processors
Technical Reference Manual
https://www.ti.com/product/ja-jp/TDA4AH-Q1#tech-docs

document.xlsx

  • Hello,

    We will consult experts internally and get back on this.

    - Keerthy

  • Thanks for the reply. We will wait for the information.

  • Hi TI team,
    What is the status after that?
    If possible, could you let us know when you expect to be able to respond?

  • Hi Takeshi-san,

    Apologies for the delayed response.

    Inner share is used for SMP for the A72 cores. 

    https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1283968/tda4vh-q1-cache-coherency/4876451#4876451

    This is a comprehensive response.

    - Keerthy

  • Hi Keerthy,
    Thank you for your response.

    e2e.ti.com/.../4876451
    > For TDA4VH each A72 cluster is direct connected to the MSMC3 so both A72 clusters are fully coherent between each other
    From the above, I understand that the A72 clusters have full cache coherency with each other,
    Do you also support coherency between A72 CL0 and A72 CL1?

    Although we are trying to deploy the OS as follows,
    I would like to know if we need to configure in software to have coherency between Cluster.
     A72 CL0 : OS#1
     A72 CL1 : OS#1

  • We are also trying to deploy the OS as follows.
    Coherency is supported by each other's Cluster, so we don't think software configuration is necessary.
    Is this correct?
     A72 CL0 : OS#1
     A72 CL1 : OS#2

  • Hi Takeshi-san,

    The two Cortex-A72 clusters by design have the ability to be coherent with each other. In order to run traditional ARM HLOS SMP across both clusters (cl0:4 core + cl1:4 core) each Cortex-A72 cluster needs to define common memory as inner sharable. TDA4VH since its launch has run SMP Linux and QNX.  Full coherency across the clusters and with MSMC-DDR and MSMC-SRAM does require MSMC3 snooping to setup properly.  MSMC3 snooping will also facilitate partial IO coherency assists as described in the diagrams referenced above in the other E2E threads.  Each A72 cluster does by default set the signals BROADCASTINNER & BROADCASTOUTER = 1 (as the ARM A72 TRM requires). I would expect this would be the default setting for 8-core SMP or  4-core-SMP + 4-core-SMP.  Using another setting would break any coherency features outside of the L1/L2 for each core.  It is notable that other out of cluster mechanisms such as exclusives (when not handled by local monitors) also rely on the MSMC3 snoop machinery. The setup as found in ATF code is critical for stable TDA4Vh operation: https://git.ti.com/cgit/atf/arm-trusted-firmware/tree/plat/ti/k3/common/k3_helpers.S#n108, ensure these settings are matching our reference!

    It is possible to experiment with per ARM cluster settings as you explore in your xls by setting up control registers before booting each cluster.  The registers are part of the register addendum.  However, our recommendation would be to use the defaults.
    45A0 1028h : COMPUTE_CLUSTER_DMSC_WRAP_0_DMSC_BOOT_PM_CONFIG0.BROADCAST_INNER0
    45A0 2028h : COMPUTE_CLUSTER_DMSC_WRAP_0_DMSC_BOOT_PM_CONFIG1.BROADCAST_INNER1

    The above two registers are only meaningful for the ARM clusters as these signals are tied to the A72 implementation details.  While the DSPs share the same MMU format their micro-implementations are not identical to ARM. For IP hooked to native TI CBASS buses sharing (non,inner,outer,system) is conveyed with the Sdomain bus sideband.  If a person skims ARM docs too quickly some confusion can be had between inner/outer sharing and caching features and they are distinct.

    Regards,
    Richard W.
  • Hi Richard,
    Thank you for the detailed explanation.
    I understand much better now.

    We are reviewing the answers within our team.

  • Hi Richard,
    Sorry for the late reply.
    We have checked with our team and confirmed that there is no problem with the default settings.

    We are also considering the following pattern.

    [6-core-SMP + 2-core-SMP]
    A72 Cluster0
        Core0:OS#1
        Core1:OS#1
        Core2:OS#1
        Core3:OS#1
    A72 Cluster1
        Core0:OS#1
        Core1:OS#1
        Core2:OS#2
        Core3:OS#2

    Since the physical memory area used by the OS is completely separated, L1/L2 cache operation is not considered to be a problem.
    However, we believe that among the Cache and TLB maintenance operations, the non-addressable operations will affect all cores in the cluster.
    Therefore, we consider this to have an impact on performance, but please let us know if there are any other issues or concerns.

    [Example]
    Cache maintenance
        IC IALLUIS
    AArch64 TLB maintenance
        TLBI <Type:All,VMALL,VMALLS12>

    Best regards,
    Takeshi K.
  • Hi Takeshi-san,

    Yes, in a 4+4 arrangement isolation is better due to local L1/L2 caches.  There is still L3 & DDR coupling which if not well handled can cause some jitter.  The simple way to plan is to think in terms of 80/20 rules, effects when headroom is low (high contention) might be non-linear compared to linear in average or low loading use cases.

    In your 4+4 or 6+2 use case, probably VMID usage would help in isolation and also cut off some of the Cache and TLB maintenance operations broadcast costs you cite.

    Regards,
    Richard W.
  • Hi Richard,
    Thanks for the information.
    Using VMID is a good idea.
    We will consider it as soon as possible.

    Looking at the Arm specs, I assume the VMID is tagged to the TLB from VTTBR_EL2.VMID.
    (developer.arm.com/.../1-0)

    We are using PROCESSOR-SDK-LINUX-J784S4, is the VMID set in this environment?
    (I am assuming that it is not set).

    Also, please tell me about the following registers,
    6D00 A300h COMPUTE_CLUSTER_DRU0_0_TLB_DBG_DATA0.VMID     <-  for debugging?
    6D00 B880h COMPUTE_CLUSTER_DRU0_0_VTBR.VMID                         <-  VMID value?

    Best regards,
    Takeshi K.

  • Hi Takeshi-san,

    An SDK9.x default build/boot of Linux does not touch the VMID.  A quick look in the debugger confirms this.

    The DRU (data routing unit) is a MSMC infrastructure specific DMA engine.  It can do things like pre-warm L2 and L3 caches (along with other types of transactions).  For this to work with cluster data semantics it must be VMID aware.  The DBG registers allow reading from TLB CAMs (content addressable memory).  The TLB when hit for an transaction look up will have the VMID as a member.  The debug register allows seeing the values of active TLBs.
    Regards,
    Richard W.
  • Hi Richard,
    Thanks for the information.

    We have discussed this with our team and decided not to use VMID.
    Our system has virtualization disabled and we considered that using VMID would have a significant impact.

    Now that the question is clear, we will close it.
    Thank you very much for your kind response.

    Best regards,
    Takeshi K.

  • Hi Takeshi-san,

    OK, glad you are able to make a more informed decision.

    If you have to write all the code from scratch based on needs (say safety, licensing, ... ) it would take up front time.  Through it should be low cost after that.  It seems like virtualization options are more these days which for some applications provides a faster path to usage.  

    Regards,
    Richard W.