AM5K2E02: ARM Core Stall

Nuri Eady

Part Number: AM5K2E02

We are seeing a mysterious lockup issue with the AM5K2E02. We have found the processor to hang occasionally in the same section of code. When the processor hangs we are not able to access any core registers using the debugger however we can accesses other peripheral registers and DDR memory. Reviewing the errata we found an ARM specific errata that we had not yet fully dispositioned. The errata is copied below for reference. After implementing the work-around that uses the hazard timeout detection the frequency of lockups was reduced significantly however we continue to see core stalls in the same area of the code. The difference now after enabling the hazard timeout detection is that the core starts executing again after we get an external interrupt that causes the program counter to jump to an interrupt handler. The core by this time however has stalled for too long and so we get cycle overrun problems.

The current timeline we believe we are seeing goes as follows:

1) A function is executed which causes the core to stall.

2) After 1024 clock cycles the hazard detect timeout reissues the instruction to the L2.

3) The core either is released from deadlock and immediately deadlocks again or the reissued instruction does not release the core from deadlock.

4) An external interrupt causes the PC to jump to an interrupt handler however the core is still stalled.

5) The hazard detect timeout triggers again and causes the core now to start executing the code in the interrupt handler. In this state we are able to attach to the core and debug normally.

Prior to enabling the hazard detect timeout the core appeared to hang at step 3 above, we would eventually see the PC change to the interrupt handler address after the external interrupt but the core would remain hung and we would not be able to attach to it using the debugger.

The function that is being executed when the core appears to stall is pretty standard code, nothing out of the ordinary however there are quite a few pointers that are being de-referenced which may be causing L2 congestion at the time. The last instruction that we see in the instruction trace buffer before the jump to the interrupt handler is a "beq". It should also be noted that this function gets executed 100 times per second and its only after about 8 hours of running that the case occurs where the core hangs long enough to overrun the external interrupt. The SOC has two ARM cores however code is only executing on one of the cores, the second core is in a disabled state.

How might determine what is causing the core to deadlock?

798870: A memory read can stall indefinitely in the L2 cache

Category B

Products Affected: Cortex-A15 MP Core -NEON.

Present in: r2p0, r2p1, r2p2, r2p3, r2p4

Description

If back-to-back speculative cache line fills (fill A and fill B) are issued from the L1 data cache of a CPU to the L2 cache, the second request (fill B) is then cancelled, and the second request would have detected a hazard against a recent write or eviction (write B) to the same cache line as fill B then the L2 logic might deadlock.

Note: This erratum matches bug #5418 in the ARM internal JIRA database.

Configurations affected:

This erratum does not affect the processor if the revision and variant reported by the MIDR is r2p0, r2p1, r2p2, r2p3 or r2p4 and REVIDR[2] is set to 1.

To be affected by this erratum, the “L2 arbitration register slice” configuration option must be included

Conditions

1) The L2 memory system is congested, typically due to requests from other cores, or slow memory responses

2) Four data read requests are issued from one CPU (CPUA) to the L2 and added to the Load Request Queue (LRQ)

3) The four read requests are issued from the LRQ to the L2 pipe, but all stall due to the congestion

4) An eviction or write to line B (write B) is issued from CPUA to the L2 and is added to the Write Request Queue (WRQ)

5) Two speculative cache line fills (fill A and fill B) are issued from CPUA in sequential cycles

6) Fill B is cancelled and the actual fill is not issued

7) Fill B is to the same cache line as write B

8) Write B completes

9) A read request (read C) is issued from CPUA

Implications

If the erratum conditions are met, a false hazard is created between the final read request (read C) and write B (which has completed and deallocated). Read C cannot be issued from the LRQ until a deallocation event is seen for write B. However, as write B has already deallocated, this never occurs. Read C therefore deadlocks. Other cores are likely to stall at the next DSB instruction. Interrupts will not break the deadlock.

Workaround

When Bit[7] of the L2 Auxiliary Control Register (L2ACTLR[7]: Enable hazard detect timeout) is set, any memory transaction in the L2 that has been stalled for 1024 cycles is reissued to verify that its hazard condition still exists. In the erratum case, if the final read (read C) reissues, it determines that no hazard exists and completes normally.

However, setting this bit is not a complete workaround because the L2 may go into idle state if there is no activity for 256 cycles. If any core is active and doing any memory transactions (including cacheable loads hitting the L1 data cache), or if there are any ACP requests or ACE coherence requests, the L2 will stay active. If the L2 is idle, however, the reissue of Read C will be blocked.

For a single core A15 without ACE or ACP traffic there is no complete workaround. Setting L2ACTLR[7] will help, but will not completely remove the erratum. However it is not expected that any single core configuration will include the “L2 arbitration register slice” configuration option.

For an A15 with two or more cores active, or with ACE or ACP traffic, setting L2ACTLR[7] to 1’b1 is a good workaround. The L2 may go idle, but will be periodically active and allow the request to reissue. Typical worst case delay would be the next timer interrupt. If all but one of the A15 cores are powered down (such that there will be no further L2 traffic or interrupts to other cores), a watchdog timer outside of A15 should be used to wake up an additional core if the single active core becomes unresponsive.

over 5 years ago

0 Yordan Kovachev over 5 years ago

TI__Guru**** 161600 points

Hi Nuri,

Do you use the TI RTOS on your device? Which version?

Is there a particular application that causes the A15 to stall or is this random (on different occasions different processes are stalling the cortex)?
Is it possible to share the code?

Best Regards,
Yordan

0 Nuri Eady over 5 years ago in reply to Yordan Kovachev

Prodigy 230 points

Hi Yordan,

We currently are running with a minimal custom scheduler so no OS. The stall we are seeing always happens at the same point in the code however only on occasion. The code can run hundreds of thousands of times before the core stalls however when it stalls its always in the same spot. Prior to enabling the hazard detect timeout as described in the errata (798870) the core would get into a state where we could not connect to it using the debugger. After enabling the hazard detect timeout the core appears to stall until an external interrupt happens after which we can then connect to the core using a debugger. We are trying to determine if we are seeing an instance of errata 798870. Is there a way we can determine if the hazard detect timeout occurred? Maybe a counter that increments every time it happens or some other status register bit that gets set?

Thanks,
Nuri

0 lding over 5 years ago in reply to Nuri Eady

TI__Guru* 95265 points

Hi,

Do you have a description of your system and possibly the code snippet caused core stall? The A15 core from TI K2E is Cortex-A15 processor revision R2P4 and REVIDR is 0x20A. So the REVIDR[2]=0, it looked the ARM errata applies to this device.

Regards, Eric

0 lding over 5 years ago in reply to lding

TI__Guru* 95265 points

Hi,

For this issue needs software workaround:

"Write the value '1' to L2ACTLR[7]: Enable hazard detect timeout. When this bit is set to 1, any memory transaction in the L2 that has been stalled for 1024 cycles is reissued to verify that its hazard condition still exists.
In the erratum case, when the final read (read C) reissues, it determines that no hazard exists and completes normally.
For an A15 with two or more cores active, or with ACE or ACP traffic, setting L2ACTLR[7] to 1 is a good workaround. The L2 may go idle, but will be periodically active and allow the request to reissue. Typical worst case delay would be the next timer interrupt. If all but one of the A15 cores are powered down (meaning there will be no further L2 traffic or interrupts to other cores), a watchdog timer outside A15 should be used to wake up an additional core if the single active core becomes unresponsive."

Regards, Eric

0 Bill Smeed over 5 years ago in reply to lding

Prodigy 40 points

It’s tough to share the actual C code as there are restrictions, and by itself, I suspect it is only part of the issue. The code, among plenty of other code, is run at a 50Hz rate, and the core may run anywhere from 10 minutes to 8 hours before the stall appears, though typically happening after 1-2 hours. This is only code sequence that we’ve detected the stall at. I suspect the remaining variables have to do with what happens to be currently cached, either data, instruction, branch prediction, etc… all creating just the right type of scenario to go along with the code to cause the stall. We don’t know if the stall is specifically the same as the errata, though it does fit the symptoms. This code structure is new, however the code did exist before. Previously, the functionality of “func1” was inlined with the rest of the “calling” function. The change that began to introduce the stalls occurred when that inlined functionality was moved to “func1”. When the code had been inlined, a number of floating point values remained in registers, whereas with the change, a number of those values were written to memory prior to the function call, and then read while in the function, so certainly a number of memory transactions were introduced that didn’t exist before.

I’ve posted an assembly listing leading to the stall point. This includes not only the function where the stall happens, but the code leading to the call of this function. Trace data shows it only happens along this trace, so even though the function where the stall occurs is called from a dozen locations (including twice from the calling function shown by the trace), it always is done on the second call from this specific function, so I’ve included some of the assembly leading up to it as well. The assembly is more useful to pass along, as I would expect compiler option changes, such as optimizations, to reorder statements and make this issue disappear. The code is built with gcc using its “-O1” optimization. At the C level, there is nothing particularly interesting about this code sequence, which is why we’re trying to track the source of what is actually happening to cause this stall.

We do have both instruction and data cache enabled, as well as branch prediction and the MMU. HW Cache coherency is also configured as well, though at the time this code is running, there are no DMA transactions in progress nor are there any interrupts (we only have 4 interrupts enabled, 2 of which are asynchronous and are tied to discretes and are designed to signal to the code to shutdown. The two periodic interrupts, one is from a discrete and comes at 200Hz, though we know this inserted code runs in-between those interrupts. The other interrupt is a timer interrupt and doesn't go off during this period of this code executing). Only one core is enabled. This code is executing while in User mode.

From calling function, steps leading up to the second call of “func1”.
Float64 localVar01;
Float64 localVar02;
Float64 localVar03;

  LocalVar02 = LocalVar03 - LocalVar01;
82a11a78:	ed5b0b09 	vldr	d16, [fp, #-36]	; 0xffffffdc

  LocalVar03 = (LocalVar03 + LocalVar01) / globalVar01;
82a11a7c:	ee380b20 	vadd.f64	d0, d8, d16

  globalVar02 = (globalVar03 * LocalVar02) / fmax(LocalVar03, globalVar04);
82a11a80:	e3003bf0 	movw	r3, #3056	; 0xbf0
82a11a84:	e34835a0 	movt	r3, #34208	; 0x85a0

  LocalVar02 = LocalVar03 - LocalVar01;
82a11a88:	ee388b60 	vsub.f64	d8, d8, d16

  globalVar02 = (globalVar03 * LocalVar02) / fmax(LocalVar03, globalVar04);
82a11a8c:	edd30b00 	vldr	d16, [r3]
82a11a90:	ee288b20 	vmul.f64	d8, d8, d16

  LocalVar03 = (LocalVar03 + LocalVar01) / globalVar01;
82a11a94:	e3003b28 	movw	r3, #2856	; 0xb28
82a11a98:	e34835a0 	movt	r3, #34208	; 0x85a0
82a11a9c:	edd30b00 	vldr	d16, [r3]

  globalVar02 = (globalVar03 * LocalVar02) / fmax(LocalVar03, globalVar04);
82a11aa0:	e3003ad8 	movw	r3, #2776	; 0xad8
82a11aa4:	e34835a0 	movt	r3, #34208	; 0x85a0
82a11aa8:	ee800b20 	vdiv.f64	d0, d0, d16
82a11aac:	ed931b00 	vldr	d1, [r3]
82a11ab0:	eb010ef0 	bl	82a55678 <fmax>
82a11ab4:	e30b3438 	movw	r3, #46136	; 0xb438
82a11ab8:	e34835a0 	movt	r3, #34208	; 0x85a0
82a11abc:	ee880b00 	vdiv.f64	d0, d8, d0
82a11ac0:	ed830b00 	vstr	d0, [r3]

  func1(&globalVar02, (&(GlobalVar05)),
82a11ac4:	e58d6000 	str	r6, [sp]
82a11ac8:	e24b202c 	sub	r2, fp, #44	; 0x2c
82a11acc:	e58d2004 	str	r2, [sp, #4]
82a11ad0:	e2855010 	add	r5, r5, #16
82a11ad4:	e58d5008 	str	r5, [sp, #8]
82a11ad8:	e1a00003 	mov	r0, r3
82a11adc:	e30018a8 	movw	r1, #2216	; 0x8a8
82a11ae0:	e34815a0 	movt	r1, #34208	; 0x85a0
82a11ae4:	e30028a0 	movw	r2, #2208	; 0x8a0
82a11ae8:	e34825a0 	movt	r2, #34208	; 0x85a0
82a11aec:	eb0105b7 	bl	82a531d0 <func1>

*** [jump to func1] ***

82a531d0 <func1>:

void func1(const float64 *p1, const float64 *p2, const
  float64 *p3, const float64 *p4, const fcBool *p5, float64
  *p6, typedef_struct *p7)
82a531d0:	e52d400c 	str	r4, [sp, #-12]!
82a531d4:	e58db004 	str	fp, [sp, #4]
82a531d8:	e58de008 	str	lr, [sp, #8]
82a531dc:	e28db008 	add	fp, sp, #8
82a531e0:	e59be008 	ldr	lr, [fp, #8]
82a531e4:	e59bc00c 	ldr	ip, [fp, #12]
  if (*p5) {
82a531e8:	e59b4004 	ldr	r4, [fp, #4]
82a531ec:	e1d440b0 	ldrh	r4, [r4]
82a531f0:	e3540000 	cmp	r4, #0
82a531f4:	0a000002 	beq	82a53204 <func1+0x34>  *** [ON STALL, SHOWS AS LAST INSTRUCTION IN TRACE] ***
    *p6 = *p4;
82a531f8:	e1c320d0 	ldrd	r2, [r3]
82a531fc:	e1ce20f0 	strd	r2, [lr]
82a53200:	ea000008 	b	82a53228 <func1+0x58>
  } else {
    *p6 = ((*p3) * p7->var1) + ((*p1) * ...
82a53204:	edd22b00 	vldr	d18, [r2]  *** [ON STALL, PC SHOWS THIS IS NEXT INSTRUCTION TO BE EXECUTED] ***
82a53208:	eddc1b00 	vldr	d17, [ip]
      ((*p1) + globalVar01));
82a5320c:	edd00b00 	vldr	d16, [r0]
82a53210:	eddc3b02 	vldr	d19, [ip, #8]
82a53214:	ee700ba3 	vadd.f64	d16, d16, d19
  if (*p5) {
    *p6 = *p4
  } else {
    *p6 = ((*p2) * p7->var1) + ((*p2) * ...
82a53218:	edd13b00 	vldr	d19, [r1]
82a5321c:	ee600ba3 	vmul.f64	d16, d16, d19
82a53220:	ee420ba1 	vmla.f64	d16, d18, d17
82a53224:	edce0b00 	vstr	d16, [lr]
  }

  p7->var1 = *p6;
82a53228:	e1ce20d0 	ldrd	r2, [lr]
82a5322c:	e1cc20f0 	strd	r2, [ip]
  p7->var1 = *p1;
82a53230:	e1c020d0 	ldrd	r2, [r0]
82a53234:	e1cc20f8 	strd	r2, [ip, #8]
}
82a53238:	e24bd008 	sub	sp, fp, #8
82a5323c:	e59d4000 	ldr	r4, [sp]
82a53240:	e59db004 	ldr	fp, [sp, #4]
82a53244:	e28dd008 	add	sp, sp, #8
82a53248:	e49df004 	pop	{pc}		; (ldr pc, [sp], #4)

0 Bill Smeed over 5 years ago in reply to Bill Smeed

Prodigy 40 points

As we tried to find the cause of the stall, we had looked at the ARM errata 798870 as mentioned in earlier posts. Because we don’t have the ability to look at the precise state of the L2 memory system (I assume there isn’t a method for us to do that), we made the assumption that we are running into this errata, so we tried a number of different configurations. Those are listed below along with the results.

Original code (no mitigations): While time to stall varied from a few minutes to a few hours, typically a stall would occur around an hour. Trace data showed the stall point. It also showed the arrival of the 200Hz interrupt, and the transition of the PC to the IRQ handler, but that the code did not execute. Debugger could not gain control of the core, however via the DAP, DDR3 and MMRs can be read.

Updated boot code to include the hazard detection bit L2ACTLR[7] being set: Time to stall didn’t appear to significantly change, and trace data showed the same stall point. However, upon the arrival of the 200Hz interrupt, the PC did change to the handler, and the handler executed (however, the handler is basically a watchdog, and errs out if it detects the code that had been executed had not executed to completion (which it hadn’t, since it stalled). With the core freed, we were able to access the core from the debugger.

Included L2ACTLR[7], and also enabled second core. Turned on its MMU and caches. Put it into a loop where it just increments a counter in DDR3 (including a clean cache operation to force the data out of the L1 cache and through the L2 memory system) before the second core executes a “WFI” (wait for interrupt… basically the sleep until interrupt). The first core configured one of the timers to generate an event every 30us, and the interrupt controller routed that event to the second core. Debugger view of the counter in DDR3 showed it was incrementing at the expected rate for a timer going off every 30us. With this happening, the software continued to run for 72 hours before we stopped the test.

Enabled the second core as above, but did not set L2ACTLR[7]: The core stalled at the typical rate and trace data showed it stalled even after the 200Hz interrupt. This was as expected, but done to show the second core itself didn’t change the environment, possibly masking (versus resolving) the core stalls.

Included L2ACTLR[7], but instead of using the second core, had the 30us timer initiate a 32-byte DMA copy from DDR3 to DDR3 (both unused locations) as a DMA event. The timer interrupt wasn’t enabled at the interrupt controller. Because the HW cache coherency is on, the expectation was that the snoop request generated by the DMA would wake the L2 memory system up so that the hazard detection reissue would occur, much like in the second core test. Like the second core test, this ran without crash for over 48 hours before we stopped it.

Used DMA like above, but did not set L2ACTLR[7]: Similar to the second core version of this test, this was done to show the DMA itself didn’t alter timing or the conditions that caused the stall. As expected, the core did stall.

The empirical evidence shows that the errata work-around does appear to work. However, we’re going to need your help in resolving the following questions, which are critical for us to understand as we go for root cause:

(1) While the software without the work-around crashed, and the software with the work around did not, is there some means of determining for sure that stalls may have occurred and the hazard detection was invoked that cleared the stall? We need that to positively identify that the work-around is actually being invoked and are looking for some means of identifying how often these stalls may happen, and where. Right now, assuming the stall is happening, we’re assuming the work-around is clearing it, but we have no means of verifying that.

(2) Is there some means of determining whether we are positively running into this errata 798870 or whether we are running into some other type of circumstance that happens to be mitigated with the same work-around? We severely lack any information on the L2 memory system, it’s design, and also lack the ability to understand precisely what the state was when we stalled. We can infer that it might be happening, but lack the under-the-hood knowledge that might show that it is happening.

(3) Is there a means of identifying the particular code constructs that help lead to these stall scenarios. To date, we’ve only detected the stalls at this specific location following the specific path. By itself, it doesn’t trigger the stall, but it appears to be the only code segment that we’ve encountered that, when aligned with the other “unknown-to-us” conditions, has triggered the stall.

(4) Is there a known compiler switch that can instruct the compiler to avoid some particular code sequence that is known to open up the possibility of the stall to occur? I understand we’re not using the TI-compiler, but this would be a good question for them and whether the TI compiler has identified a sequence that makes the stall possible and created a switch that keeps the compiler from generating that sequence.

0 lding over 5 years ago in reply to Bill Smeed

TI__Guru* 95265 points

Hi,

We got the feedback: "I'm afraid there isn't a software visible way to know if the time out has happened. If the deadlock will disappear when L2ACTLR[7] is set, I think it is very likely that the erratum 798870 is hit."

Unfortunately there is no direct way in ARM provided IP to trigger a count of the hazard happening. If the workaround or alternatively second core activity remove the problem that is a strong indication that it is this issue. From TI side, maybe you can utilize our trace infrastructure to catch the sequence lead to the stall.

Not sure if you have any experience of using CP trace. If you have something like start tracing, look for reads to somehow constrained memory, parse the log to see 2 back to back reads at 1024 cycles with nothing in between to see the hazard was hit? The trick will capturing the right sequence, as you are looking for proof that the hazard timeout reissues a command. I don’t think you need every hazard timeout, just once. So somehow not trace everything (too mauch bandwidth to capture), but capture reads from the ARM cluster L2 into DDR that have the same address. Then offline parse to see that there is a long (1024cc) nothing going on in between those reads. Could also be writes but I suspect read is a lot more likely to stall for a long period.

The setup of trace is not trivial, you may look at some materials:
www.ti.com/.../spruhm4.pdf
processors.wiki.ti.com/.../Using_System_Trace_(STM)
processors.wiki.ti.com/.../Debugging_With_Trace
processors.wiki.ti.com/.../Embedded_Trace_Buffer
processors.wiki.ti.com/.../Debug_Handbook_for_CCS

Regards, Eric

Processors

Processors forum

AM5K2E02: ARM Core Stall