AM5K2E04: Keystone II SOC lockup

Tim Froggatt

Part Number: AM5K2E04

Hello. I am experiencing occasional CPU lock-ups in the AM5K2E04, and I'm trying to work out why.

I've been reading the ARM errata... developer.arm.com/.../

I think that the most likely candidate is 814169: "A series of store or PLDW instructions hitting the L2 cache in shared state in an ACE system might cause a deadlock".

I have a few questions...

1) Is that plausible - is this SOC affected by this erratum? Or alternatively, are there any other known causes of CPU lockups?

2) If this erratum is relevant, is there any information about how the other elements of the SOC are connected to the ACE system? Are there caching masters in the system apart form the A15 CorePac? Which peripherals can put lines of the L2 cache into the required "shared state" and which can emit the relevant snoops?

3) Also, is there any mitigation for this in the chip's design or the Linux drivers? And if so, what is it?

4) Is it possible to disable the ACE interconnect? (If it fixes these hangs, I'd be prepared to do that even if it meant coherency had to be handled in software).

Any help for any of the above questions would be really useful. Many thanks!

over 1 year ago

0 Praveen Rao over 1 year ago

TI__Mastermind 48143 points

Hello Tim,

CPU lock-up could be due to multiple reasons. Can you share the logs seen when the lock-up occur? Also can you tell what software SDK you are running on the CPU and who provided it?

Please note that this is a pretty old SOC and we have not hear any lockup issue from any other source.

Thanks.

0 Tim Froggatt over 1 year ago in reply to Praveen Rao

Prodigy 30 points

Hi. Thanks for your help. Yes indeed, it does seem that there could be multiple reasons for the CPU lock-up. This erratum seems like a possible candidate, and that's why we'd like to understand a bit more about how the elements within the SOC are connected to ACE, so that we could confirm or deny the possibility.

Regarding your questions...

We are writing our own software form scratch with the arm gnu toolchain: developer.arm.com/.../GNU Toolchain

Nothing is logged when the lock-ups occur, because there does not seem to be any prior warning that the CPU will lock. We have attempted to debug over JTAG, but JTAG cannot do anything with any of the CPU cores - presumably because of the lock-up. We can access the program counters, but they are not advancing - every core is stuck. The PC indicates that mostly lock-up is happening around queue management register access, but occasionally we see it happen around other register access - GIC or EDMA registers for example.

In all cases I have seen, this seems to happen with INTs disabled, but that may just be coincidence since these register accesses typically are happening in the interrupt handlers.

0 Praveen Rao over 1 year ago in reply to Tim Froggatt

TI__Mastermind 48143 points

Hi, Thanks for the details. I will pass it our experts to provide their views on this. Please bear with us.

Thanks.

0 Richard Woodruff over 1 year ago in reply to Praveen Rao

TI__Mastermind 23605 points

Hello,

A frozen Cortex-A as you describe is something which does happen during new system development. The underlying cause tends to be a memory transaction which is stuck and unable to complete. This can be triggered by your board if its power delivery networks have power dips due to capacity issues or noise. A quick check is to boost your main rail voltages by like 10% and see if the issue heals or the rate changes. An alternate is to slow down clocks to big current consumers like your 4xARM. Another issue point can happen if some peripheral is mis-programmed (say to fast a clock or issuing some local reset without first gracefully de-configuring active interfaces), this can result in a transaction not being properly completed and the ARM might hang waiting for it. The debugger can help find some of these, the debugger can get PCs for the A15 cores, it seems you already are doing this, if you use ETM you can get a quite a bit of history into a hang. One other thing to do, is at A15 stall time, attach with the debugger to the DAP (hidden target if CCS or other methods for other JTAG debuggers) and probe around your slave address spaces. If you touch find a slave has collapse (all bus errors) which should be alive, its collapse might just be the reason the ARM is stuck. Code audits around that makes sense. If you have written your own MMU code this is an area which should be checked, a poorly created pagetable can cause a fetch into some incorrect region which could also trigger a hang. Yes, errata also can result in hangs, but the other sources listed above tend to be much more likely. You might try running some kind of commodity mature code (like a TI reference Linux image) to see if it hangs on your board. This might help sort out if its some kind of board PDN hardware issues or if its from the new code you are developing.

Regards,

Richard W.

0 Tim Froggatt over 1 year ago in reply to Richard Woodruff

Prodigy 30 points

Thanks Richard. This is very helpful. We'll look into the power and clocking of the SOC and the RAM in the first instance and see if that helps.

One final question - do any of the other components within the SOC act as caching masters within ACE, or is the ARM CorePac the only caching master?

Many thanks.

0 Richard Woodruff over 1 year ago in reply to Tim Froggatt

TI__Mastermind 23605 points

Hello Tim,

The ARM complex (ARM cores + SCU + L2) provide full data coherency inside the cluster and they send coherency messages as necessary to the MSMC2. The MSMC2 interconnect provides coherency services for transactions tagged with the sharable attribute. This allows IO coherency with other masters in the system. For example an EDMA (which has no cache of its own) can trigger (thanks to MSMC2) coherency traffic which can force the ARM to write out dirty data or cause the ARM to refetch data if a newer copy is presented to the MSMC2. MSMC2 has tracking resources for the DDR address space and for the the shared SRAMs contained within it. Transactions from the ARM marked as shared participate, those which are non-shared bypass it. Other masters set the shared attribute by programming MPAX regions. The C6x DSPs have embedded MPAX arrays, and external masters in the SOC fabric get sharing added or not based on per interconnect port MPAX arrays. This level of IO coherency removes the need for the ARM to do data cache operations for shared data, and lowers the amount another core may need to do. A DSP which has a local cache still has to do some cache operations but its lower as it gets a lift from IO coherence also. If sharing is not set then full SW maintenance is needed.

Regards,

Richard W.

0 Tim Froggatt over 1 year ago in reply to Richard Woodruff

Prodigy 30 points

Hello Richard,

Thanks for the clarification. Our SOC doesn't have DSPs and we're not using HyperLink, so we have no external masters in the SOC fabric. Since the MSMC provides coherency to other peripherals like EDMA and the network controller, I think the ARM CorePac is the only caching master in our case.

Kind regards,
Tim

0 Tim Froggatt over 1 year ago in reply to Richard Woodruff

Prodigy 30 points

Hello Richard. Thanks again for your help with this.

We have some further information. Our board connects the PCIe bus to a socket for an optional NVMe hard drive. We have discovered that:

1) Over-clocking the CPU to 1.6GHz significantly increases the rate of freezes.

2) Under-clocking the CPU does not seem to stop freezes occurring. (Freezes are rare and sporadic so it's difficult to know if their prevalence has changed with under-clocking).

3) Fitting the NVMe drive drastically reduces the rate of freezes, even when over-clocked to 1.6GHz (possibly this eliminates freezes totally but it's too early to be sure).

4) Fitting the NVMe drive also increases noise on our 0.85V supply - I mention this for completeness, although possibly it is unrelated since we'd expect noise to make things worse not better.

5) If we don't power up the PCIe sub-system at all in software, then we drastically reduce the rate of freezes, but they do still occur very rarely with no NVMe drive fitted.

Do you have any suggestions why fitting an NVMe drive or not enabling the PCIe would make our system more stable?

Many thanks.

0 Richard Woodruff over 1 year ago in reply to Tim Froggatt

TI__Mastermind 23605 points

Hello Tim,

On TI 'cousin' SOCs to AM5K I've seen hangs like you describe when PCIe IP is clocked and not present. When PCIe is not clocked, its address space when accessed should return an abort (on a deliberate or errant access), null-response in cba terms. When the address space is clocked but the module is not configured properly the access can pass in to the subsystem and hang for ever in some conditions. On some SOCs I recall being told a loop back at the pads needs to be configured to allow an access to complete-fail in a way which does not hang if no module is installed/fixed.

One general rule of thumb is if system software is not prepared to install a driver for an IP then the clock to that module should NOT be clocked and brought out of reset. Once a module is clocked and out of reset there can be a statefull module-specific way to safely park an IP. Often I see a bootloader thinking its doing a service by setting up a bunch of plls and blindly turning on all the module clocks and really that often is a disservice as things like DSPs might try to boot to random memory addresses. Sometimes that results in a hang on boot if a memory address tendency points to some poising memory. This also extends to bootloaders lazily writing '0xffffffff' (enable all) to control registers where some of the bit fields are reserved. Some times reserved bits are unconnected logic and writing something can result in an unclearable status or other bad effect.

In the past I have tracked down similar hangs to the PCIe range when the A15 MMU is not well formed. IO ranges should have the NX bit set so speculative fetches can not happen to it from the Cortex-A. Also, when dynamically changing an MMU entry the 'break before make' rule needs to be followed to not open up a speculative access window. I recall catching such an issue on QNX + an auto application a long time back where a sw-driver had mapped the PCI region without setting the NX. Its hang was sparse (1 in 2 days). For that chip I could attach via JTAG at hang see based on bus status that the hanging segment was PCIe. I was able to use a bus monitor to capture transactions in a loop and when it hung a day later the issue was root caused to a i-cache speculative fetch to the PCIe region (the transaction was caught in the buffer). In that case the icache burst access was illegal for pci (but its legal for the arm to guess wrongly to non-nx memory) and that caused the target space to collapse. Simply by setting the MMU to map the region with the IO + NX attribute the hang was removed.

Finally, I have seen hangs with pcie targets installed. In those cases the target was supplying a clock to the system and sometimes that external target would crash in a way in which the clock it supplied stopped. If the clock stop why the PCI device was master a transaction to the DDR the system hung. If the PCI clock was somehow restarted the hang resolved.

I would suggest validating at run time the live MMU tables are well formed. I usually use the TRACE32 MMU decoder for this as all the possible formatting cases are non-trivial to hand decode. If NX is not set for that range, fix it. MMU.info <address> is a fast way to tell, but mmu.list.pageable may show it. If you dynamically build tables more effort is needed to verify. I would suggest setting up some kind of transaction monitoring for the PCIe range from the CPU. If your code uses VIRT=PHYS then a range watchpoint (2 debug comparators) might catch it if it was from normal code. Speculative accesses will bypass the comparators but a cptracer might be able to catch it. I've used cptracer2 and ocpwatchpoints in cousin SOCs to check the PCI space, I'm not sure if that is an option in AM5K.

Regards,

Richard W.

0 Tim Froggatt over 1 year ago in reply to Richard Woodruff

Prodigy 30 points

Thanks Richard, that's very helpful. Regarding not clocking PCIe when there is no device, the datasheet says that there are multiple clocks, so I guess we need to make sure none of it is clocked. We haven't enabled PCIe in the LSPC and we haven't setup the PCIe serdes - would that be sufficient to ensure it isn't causing problems? Or should we explicitly do something like resetting the PCIe serdes lanes and/or disable the serdes PLL? Anything else?

Not enabling in the LSPC and not setting up the serdes seems to have helped in some cases, but we've still seen a few hangs. So either this isn't the correct way to disable PCIe, or we have another separate problem.

0 Richard Woodruff over 1 year ago in reply to Tim Froggatt

TI__Mastermind 23605 points

Hello Tim,

IP bring up typically only has a few proper init sequencing steps. Doing the steps in an arbitrary order at best results in an isolated failure and worst one with broader system issues. The TRM (and hopefully example code) typically document what we test and should work. We do not look for and document all negative combinations. As such, I'd say if you are not using a module leave all its resources in the safe startup parked state. If you are trying to do minimal surgery to the existing code, then target clocks which block the front end of the traffic so the interconnect stops inbound traffic.

Having multiple long run endurance issues is a non-trivial product is common in my sampling. Usually a spike of issues of various types happens with new code or conditions. After all the common errors are worked out a few hard to find long run issues exist. I do not know of any non-trivial and successful (high volume, actively used in the market) system which didn't have ramp time endurance issues. For system hangs, usually, its possible to 'attach' the the JTAG TAP controller and probe slave addresses you expect to be alive (using the debug-subsystem's DAP port initiator : AHP-AP for KS2). Using these probes sometimes you can map out the extent of a hang and use that information to craft experiments and reviews to fix or work around the issue.

Regards,

Richard W.

Processors

Processors forum

AM5K2E04: Keystone II SOC lockup