This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TCI6638K2K: SoC Hang problem

Part Number: TCI6638K2K

Hello,

We are using our Custom HW Design based on K2KEVM, using TCI6638K2K Rev 2.0 with MCSDK 3.1.1.4

 

We are trying different test senarios and test cases and actually we mainly see 2 fatal problems under heavy traffic:

  • SoC becomes unresponsive and it seems like all the cores freeze.
    • When connected via debugger we can access L2SRAM but we can not access DDR3A/B.
    • We have already reviewed the application for Silicon Errata Advisory 36. 
    • We checked that the parallel DSP's are hung at Qmss_queuePushDescSize()
    • If we add delays (using an unnecessary mfence(), or pushing via VBUSP) between consecutive or parallel queue pushes the problem occurs rarely or disappears (Not sure for this).
    • If we use DDR3B for External Linking RAM instead of DDR3A, we do not encounter the problem and we are not sure if this is a temporary solution or not.
  • We encounter AIF overflow(ingress) and AIF starvation (eggress) error counters incrementing showing that AIF faces considerable amount of STALLS.
    • AIF Ingress and Egress descriptors are in MSMC and L2SRAM regions respectively.
    • If we move some of the highly utilized descriptors from Ext Link Ram to Int Link Ram Region, starvation problem still occurs but much less than beforere.

We can't find the root cause of both problems, any help will be appreciated...

Thanks & Regards..

  • These kinds of problems can be particularly difficult to solve.  Since QMSS is a central resource to the system, finding cores at a queue push does not necessarily mean QMSS is the problem.  In fact, QMSS push/pops do not block, so if a core is hung on a read/write access to a valid QMSS register, then something catastrophic has likely happened.

    Are you able to determine what happens first, the AIF2 errors or DSP hangs?  From  your description, it sounds like AIF2 memory usage may be the majority of the problem.  Since AIF2's timing cannot be delayed without causing serious problems (like breaking the link, descriptor starvation, etc), it is recommended to use the lowest latency resources for it.  You should not be using external linking RAM for AIF2 descriptors.  K2K has double the linking ram vs. most other Keystone devices, so this should not be a problem.  You can also give the AIF2 pktDMA priority over all other pktDMAs in the system (see section 4.2.1.4 of the Navigator user guide), but this is often not necessary.

    If you can make the problem disappear by changing the resources used by AIF2, then delays and/or starvation was likely the issue.

  • Hi,

    I think we are facing two different problems; The SoC Hang problem and the AIF starvation problem. We are not sure if they are correlated or not.

    After we moved QMSS external link RAM from DDR3A to DDR3B, the SoC Hang problem disappeared but we faced the AIF starvation problem.

    AIF starvation occurs rarely (@20MHz, 8 AxC, 2 link, 1 pe_db_starvation counter increased per 8-10 hours) AIF2 descriptors are allocated on internal link RAM and AIF PktDMA has the highest priority in SoC. Additionally we are sure that the SoC Hang occurs before AIF starvation.

    The main problem here is the SoC Hang problem since we are not sure about the root cause of problem. Adding unnecessary delays (_mfence() etc.) in qmss queue pushes or moving external link ram from DDR3A to DDR3B are somehow helped us to overcome(??) the SoC Hang problem in our application. We are not sure if we postponed or solved the problem permanently.

    Is there any possibility that some sort of  bus congestion could be the reason for this? For example I have read a statement regarding this in “KeyStone Connectivity and Priorities”  document.

    In addition to this in the “KeyStone II Architecture Antenna Interface 2 (AIF2)” document I have read another similar statement below:

    And actualy this is the reason why we posted these two problems under the same topic.

    We captured the external linkram WR transactions via ProTrace for DDR3A&B. The average duration between access-to-access is almost the same for both.

    Thanks in advance..