This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA4VM: MSMC Register Bus Error

Part Number: TDA4VM
Other Parts Discussed in Thread: DRA829,

Hi TI:

    We meet some problems about MSMC. A few hours after the TDA4 was started, it was found that the A72 hung up without any console log output. Connecting with Trace32, we found that A72 and C71 can be "attach", but can not be "break";  We also found that registers such as GPIO/PCIE/GIC can be accessed correctly, but MSMC/CLEC registers will generate bus error. We want to know what kind of events can cause MSMC's register access( also include MSMC SRAM) to generate a bus error. Byt the way, after we rolled back some of the CNN Stereo code, the problem disappeared, it maybe related to the CNN Stereo feature. We really hope that TI can provide some ideas to help us locate this problem.

    Thanks.

Best regards

lvan.zhang

  • Hi:

         We found in J721E DRA829/TDA4VM Processors Silicon Revision 1.1/1.0 that "The C71x Memory System and CPU May Stall Indefinitely in the Presence L1D Snoops" and "DMA Accesses to L1D SRAM May Stall Indefinitely in the Presence Cache Mode Change Or Global Writeback in Specific Conditions".

         Does those issues will make MSMC Register accessed Bus Error?

  • Hello,  I was asked to review your question.  Are you seeing this behavior on a TI-EVM or is it on your own customer design?

    What you describe appears to be a hang based on a some kind of bus protocol violation (an endpoint accepted a transaction but did not complete it). The CPU will not enter debug halt until all current transactions are completed.  This is the situation you describe where you can attach but not break but still can read via dap some other system bus endpoints. When did this issue start happening?  Was some new code added which destabilized your system?  Identifying what changed may be critical to understanding the issue.

    This type of issue can be triggered by things like a PDN (power delivery network) issue at the board level or sometimes by a programming sequence violation. To check the PDN angle, you might try to increase the voltage to the CPU domains by something like 100mV and then re-test (change PMIC settings).  If the issue goes away, then its possible some kind of transient noise event has caused a localized brownout triggering a the failure.  Another place to explore is the stability of your systems DDR as it will be sensitive to noise events in a similar way to the CPU. You might trying slowing down from 4266MHz to something slower like 3733MHz.

    Whenever this issue happens you might also try to see if the code flow path was the same on the A72 and the C7x from test to test (to include checking the last ProgramCouter (PC) for the A72).  This will be possible even if you cannot 'break' but can only 'attach' with TRACE32.  The easy way to get the last PCs is to use TRACE32's 'snooper' PC sampling method.  The snooper when configured to do PC sampling via DAP (not-the-stop-and-go method) will read the CPU's PC through a DAP port address export.  If the CPU PC 'sticks' waiting for a completion, that same (virtual) PC address will be continuously re-sampled.  This will be the last instruction executed.  If you look back in the snooper log you will get the sample based flow of PCs into the hang.  You should look to see if the system follows the same path into the hang.  I would do this on the A72 first then try the C7x. The sample based history's resolution is proportional to the JTAGs max sampling rate, on a clean design you probably can push the clock to just below 50MHz, for TRACE32 you can do a system.detect.jtagclcok to see the fastest rate.  For a 'precise' history into the hang you can also setup the CPUs to trace into their local onchip trace buffer.  It is possible to read out the onchip trace buffer from TRACE32 even if the CPU is hung.  If you have loaded your ELF file with the /plusvm option the debugger can decode the onchip symbols even if the target cpu and memory is not accessible with the setting onchip.ACCESS.VM.  The onchip buffer will hold the last few thousand instructions which can provide a good clue.

    An issue which only happens sparsely (after a few hours) can be challenging to debug. I would recommend first trying big knob items like 'voltage boost', 'frequency-reduction (cpu and or ddr), and also compare carefully from a working to a non-working point for your system software.  Additionally, finding ways accelerate the issue likely will be needed to resolve it if it persists. Using the debugger to understand the final failure point and the history just before the issue is a good way to find clues.

    Regards,

    Richard W.

  • Hi Richard:

    I get some records of this issue from my colleagues, and the PC always points to the same value.

  • HI,

    There is an errata around this i2064: https://www.ti.com/lit/pdf/sprz455

    i2064 C71x: DMA Accesses to L1D SRAM May Stall Indefinitely in the Presence Cache Mode Change or Global Writeback in Specific Conditions Details: DMA reads or writes to L1D SRAM may stall indefinitely. These transactions are required to sensitize this condition: 1. L1D Cache Mode Change or Global Writeback/Writeback w/ invalidate. These are initiated by ECR writes to CPU registers. 2. CPU loads while the cache mode change or global Writeback is in progress. This can be due to a CPU transaction that is scheduled in parallel with the MOVC instruction that writes to the ECR register. 3. DMA Reads or Writes to a buffer in L1D SRAM. These transactions do not need to be to the same address, but #2 and #3 have to be in flight when #1 is in progress. In this case, the DMAs stall indefinitely even after the cache mode change or global Writeback finishes. Workaround(s): Avoid doing DMAs to buffers mapped to L1D SRAM.

    This could be one possibility as well.

    - Keerthy

  • Hi Keerthy:

    Is there some abnormal callback code in the SRAM of MSMC? We observed that every time an EL3 level error is entered, the PC pointer is in the address segment of 0x70000000

  • Hi,

    Can you share the logs for our reference?

    - Keerthy

  • Hello,

    Yes. 0x70000000 is the start of MSMC & ATF(Arm Trusted Firmware) is loaded. That has exception handling and probably that
    is why you see that address in ATF when you see crash.

    - Keerthy