Tool/software:
Hello there,
18 months later(!) I thought I'd post an update because we appear to have solved our problem. Perhaps this information will be useful to someone else.
The root cause of our SOC lockup turned out to be a bug buried in our MMU code. Occasionally, we were loading an incorrect table entry, when there should have been a empty (page fault) entry.
This could cause the CPU cores to lockup, but (until we discovered the cause) it was very hard for us to re-create under test conditions. It only showed up rarely in some real-life usage patterns when we were updating our memory map in particular ways.
A few interesting things to note...
The broken table entry covered the physical addresses of the MMU tables themselves. I have no evidence that we ever corrupted the MMU tables, but I can't be sure because after the cores locked up, I could only examine the RAM and had no way of knowing what was in the CPU caches.
A potential result of this incorrect table entry was that the same virtual address might end up referenced by two TLB entries (as a not-global 2MB region, and as 4K pages with specific ASID). I did lots of testing and I am convinced that our code never actually accessed the affected virtual addresses (or any nearby addresses) when the page table was incorrect. So if we did end up with multiple TLB entries, I can only assume it was due to some kind of speculative pre-load into the TLB. It was not an explicit memory access.
By deliberately re-writing the table entry as repeatedly wrong and then correct, I was able to trigger lockups and also other weird symptoms such as data access faults - including sync and async bus decode errors and access permission faults.
We initially thought that the problem was related to the PCIe bus, because when the SOC locked up, often the PCIe registers were inaccessible over JTAG. And when we disabled PCIe, this made our system drastically more stable. However, fixing our MMU code removed all the problems we were having with PCIe, so I now think that this was a symptom rather than a cause. It seems likely that our faulty MMU table could somehow cause random bus accesses, including to the PCIe registers.
For the sake of completeness, I'll mention that whilst carrying out tests, we also discovered we could produce very similar symptoms by another means: locking-up the CPU by re-configuring the 10GB ethernet hardware after soft reboots. This is documented in the SOC silicon errata and seems unrelated to our MMU problem.
Finally, I'd like to say a big thank you to Richard Woodruff for all his help and insights, without which we might never have worked it out.
Best wishes,
Tim