AM5K2E04: Keystone II SOC lockup

Tim Froggatt

Part Number: AM5K2E04

Hi Richard,

We've done some further investigation of our problem. We believe that the PCIe issue is now fixed, but there is still [at least] one other lock-up issue that we're finding very difficult to hit in testing.

So far, we've been able to investigate six of the these CPU lock-ups over JTAG, with varying symptoms.

----
(1) In this case, many of the peripheral memory mapped registers were not accessible.

Not accessible: I2C, UART, SPI, any SERDES, PLL controllers, NET CoProcessor, Power sleep controller, GPIO, Semaphore system, Queuing system, DDR PHY

However we could access some systems: CIC0, CIC2, EDMA0, MSMC, SDRAM (these all seem to be on TeraNet3_C).

We then tried accessing the DDR RAM. Some accesses were fine, but the first time we tried to access a CPU3-specific memory region (our first memory access that might have been cached on CPU3) the debug system hung and we couldn't debug any further. We could no longer connect to the JTAG TAP.

----
(2), (3) On these lock-ups, we had more success with peripheral memory mapped registers. Until we tried to access the USB PHY. Then the debug system hung again and we couldn't debug any further. Before that, we'd already tried a few DDR RAM memory accesses and several peripherals. What we had tried were all ok - NET CoProcessor, all the SerDes, PLL controller, Power sleep controller, I2C, UART, INT controller, CIC, GPIO, BOOTCFG

We haven’t even enabled the USB subsystem in the code, so it's surprising that it hung the debug system.

----
(4), (5), (6) For the other 3 lock-ups, we seem to be able to access all the DDR RAM and all the memory mapped registers of peripherals that are enabled. Accessing the USB PHY returns an error (we'd expect that because it's not enabled) but it doesn't hang the debug system as in (2) and (3).

Nothing appears to be obviously wrong over JTAG, except that all of the CPU cores are stuck. It doesn't suggest any particular sub-system as they are all in the state we'd expect (accessible if enabled, errors if not enabled). We currently have one machine in this state. Is there anything else you think that we could try to work out why the CPU cores locked up?

Many thanks for any suggestions you might have,
Tim

over 1 year ago

0 Richard Woodruff over 1 year ago

TI__Mastermind 23615 points

Hello Tim,

You have described some observations between units at some kind of issue time. It would be also good to capture snapshots from several 'good-reference units' and compare good-a vs. good-b, then good-a, vs,. issue-a, ... I would suggest trying to capture details from good units at approximate points where issue logs fail. Differential analysis of registers and address space accessibility can provide valuable clues.

Next, some information needs to be collected about scope and frequency of issue. How any units are under test, how many fail, and at what frequency (1min, 1hour, 1day, ...). Are their any units which do NOT fail. What is different about them? What commonalities exist? Do they freeze at some similar point in the application data flow? ...

Regards,

Richard W.

0 Tim Froggatt over 1 year ago in reply to Richard Woodruff

Prodigy 30 points

Hi Richard,

I've attached here what information we have so far, comparing good units against the 6 lock-ups...

Lockups.rtf

Richard Woodruff said:
How any units are under test

Since we fixed the PCIe issue, we haven't been able to create fails under test conditions. The only fails that we see are when the devices are deployed in the real world, which makes it hard to debug.

There are, so far, about 23 real-world units in heavy use (they are being used to route network traffic). Of those, 7 are routing customer internet traffic and these seem susceptible to the lock-up. Other network-usage patterns so far haven't caused any lock-ups since the PCIe fix. Of those 7, four seem to be locking-up regularly - across them all we get a couple of lock-ups about every 8-9 days typically. Two out of the 7 are newly deployed but have been in use for two weeks without a lock-up. The final unit was up for 104 days before it locked-up.

Those 3 units that rarely or haven't yet locked-up are running earlier software than the other 4 (built with an older ARM toolchain), so we're trying to isolate if a software change might have caused this. Because our time between failures is many days, this is proving difficult.

Richard Woodruff said:
Do they freeze at some similar point in the application data flow?

They don't always freeze at the same point. However, there seems to be a subset of places where the freezes occur - always at or very near a register or RAM access. Often we are accessing the queuing system or reading from one of our own network buffers (our network buffers are memory which can be written by the 1G or 10G network systems and can pass between CPU cores). Not sure why there are certain accesses that seem to be problematic, whereas there are plenty of memory accesses where we never see lock-ups.

Also we don't know which of the 4 CPU core accesses was the originator of the freeze (or if it's a combination of them) or if it's a peripheral causing the freeze of the memory system and then the CPUs just get stuck the next time they do a particular type of access.

One thing that is worth saying is that even exact same software versions will not lock-up at the same place. But the failure mode in terms of which memory and registers are/aren't accessible has so far been consistent per software revision. Although we don't have much data so that might be coincidence.

As I mentioned, we currently have one machine locked-up and attached to JTAG, although this is due to be returned into service very soon. If you can think of anything else that we could probe on that machine specifically then let me know. Otherwise we'll probably need to wait several days before we get another chance.

Many thanks,
Tim

0 Richard Woodruff over 1 year ago in reply to Tim Froggatt

TI__Mastermind 23615 points

Hello Tim,

The lockup.rtf and your comments improve the context a lot. It looks like lockup1 is may be different or maybe the emulator's jtag probing triggered a general address space collapse. If I remove all the paths which can't be touched in the good case, largely, it looks like all the slaves are alive. would suggest using the DAP to read the PCs of the stalled cores to understand what they were last doing when they froze up (DBGPCSR for ARM, most cores provide a path). I would also run a pc-sampler over JTAG DAP to log core activity over a long time. At hang time this can provide some useful clues about current hang and a rough history into the hang.

You mention the CPUs in this case are stuck. Does this mean you can connect to them or are they unhaltable (waiting for pipeline clear to enter debug mode)? Even if the PC can't be halted it is possible to read its DBGPCR over the APB or DAP. In a debugger like TRACE32-Lauterbach, you just give the qualifier APB:0x<address> or DAP:0x<address> where with CCS GEL you use *<addr>@DATA or *<addr>@System_View. I like to use the TRACE32 'snooper' as it will log to a file each cores PC over the DAP and this even can survive a power cycle, this allows nearly unlimited logging. You would need to code something up for CCS to get the same and its sample rate will be 100x slower.

In general, now that PCI is removed, its good to try some big knob tests to shake out any timing or electrical issues in the PCB. Boosting logic VDD rails by like +50mV is a good test also reducing CPU PLLs by like 10% would be good. If the failure rate shifts or heals then you dig into that angle.

Regards,

Richard W.

0 Tim Froggatt over 1 year ago in reply to Richard Woodruff

Prodigy 30 points

Thanks Richard. We'll try to look into those things.

Richard Woodruff said:
You mention the CPUs in this case are stuck. Does this mean you can connect to them or are they unhaltable

All 4 CPU cores are unhaltable. But we have read the program counters so we know where they are locked-up. Every time it locks up, the program counters are different, although there does seem to be some commonality in the sorts of places where it will freeze. We often see it in the queuing system (getting or queuing network descriptors) or memory access to our own network buffers (memory which might often require snoops).

Kind regards,

Tim

0 Richard Woodruff over 1 year ago in reply to Tim Froggatt

TI__Mastermind 23615 points

Hello Tim,

Chances are good the COREs are stuck waiting for a memory transaction to complete. If you can read/write the base of the memory (assume DDR, but also SRAMs) then that slave is likely not fully collapsed (which is a more common failure relatively speaking). You should re-verify you can read/write at hang of memories the A15 might touch. If the gross memory (at base works) then it might be there is a small sub-range which is deadlocked between users somehow. If you do a pattern write at the base of memory to the top from the debugger you might find a range which does not respond. That range would be the one to zero into in how its accesses from each core. Something like some other master is stopped when accessing it and the other core tries to make an access and is stuck waiting.

Regards,

Richard W.

Processors

Processors forum

AM5K2E04: Keystone II SOC lockup