This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
It’s tough to share the actual C code as there are restrictions, and by itself, I suspect it is only part of the issue. The code, among plenty of other code, is run at a 50Hz rate, and the core may run anywhere from 10 minutes to 8 hours before the stall appears, though typically happening after 1-2 hours. This is only code sequence that we’ve detected the stall at. I suspect the remaining variables have to do with what happens to be currently cached, either data, instruction, branch prediction, etc… all creating just the right type of scenario to go along with the code to cause the stall. We don’t know if the stall is specifically the same as the errata, though it does fit the symptoms. This code structure is new, however the code did exist before. Previously, the functionality of “func1” was inlined with the rest of the “calling” function. The change that began to introduce the stalls occurred when that inlined functionality was moved to “func1”. When the code had been inlined, a number of floating point values remained in registers, whereas with the change, a number of those values were written to memory prior to the function call, and then read while in the function, so certainly a number of memory transactions were introduced that didn’t exist before.
I’ve posted an assembly listing leading to the stall point. This includes not only the function where the stall happens, but the code leading to the call of this function. Trace data shows it only happens along this trace, so even though the function where the stall occurs is called from a dozen locations (including twice from the calling function shown by the trace), it always is done on the second call from this specific function, so I’ve included some of the assembly leading up to it as well. The assembly is more useful to pass along, as I would expect compiler option changes, such as optimizations, to reorder statements and make this issue disappear. The code is built with gcc using its “-O1” optimization. At the C level, there is nothing particularly interesting about this code sequence, which is why we’re trying to track the source of what is actually happening to cause this stall.
We do have both instruction and data cache enabled, as well as branch prediction and the MMU. HW Cache coherency is also configured as well, though at the time this code is running, there are no DMA transactions in progress nor are there any interrupts (we only have 4 interrupts enabled, 2 of which are asynchronous and are tied to discretes and are designed to signal to the code to shutdown. The two periodic interrupts, one is from a discrete and comes at 200Hz, though we know this inserted code runs in-between those interrupts. The other interrupt is a timer interrupt and doesn't go off during this period of this code executing). Only one core is enabled. This code is executing while in User mode.
From calling function, steps leading up to the second call of “func1”. Float64 localVar01; Float64 localVar02; Float64 localVar03; LocalVar02 = LocalVar03 - LocalVar01; 82a11a78: ed5b0b09 vldr d16, [fp, #-36] ; 0xffffffdc LocalVar03 = (LocalVar03 + LocalVar01) / globalVar01; 82a11a7c: ee380b20 vadd.f64 d0, d8, d16 globalVar02 = (globalVar03 * LocalVar02) / fmax(LocalVar03, globalVar04); 82a11a80: e3003bf0 movw r3, #3056 ; 0xbf0 82a11a84: e34835a0 movt r3, #34208 ; 0x85a0 LocalVar02 = LocalVar03 - LocalVar01; 82a11a88: ee388b60 vsub.f64 d8, d8, d16 globalVar02 = (globalVar03 * LocalVar02) / fmax(LocalVar03, globalVar04); 82a11a8c: edd30b00 vldr d16, [r3] 82a11a90: ee288b20 vmul.f64 d8, d8, d16 LocalVar03 = (LocalVar03 + LocalVar01) / globalVar01; 82a11a94: e3003b28 movw r3, #2856 ; 0xb28 82a11a98: e34835a0 movt r3, #34208 ; 0x85a0 82a11a9c: edd30b00 vldr d16, [r3] globalVar02 = (globalVar03 * LocalVar02) / fmax(LocalVar03, globalVar04); 82a11aa0: e3003ad8 movw r3, #2776 ; 0xad8 82a11aa4: e34835a0 movt r3, #34208 ; 0x85a0 82a11aa8: ee800b20 vdiv.f64 d0, d0, d16 82a11aac: ed931b00 vldr d1, [r3] 82a11ab0: eb010ef0 bl 82a55678 <fmax> 82a11ab4: e30b3438 movw r3, #46136 ; 0xb438 82a11ab8: e34835a0 movt r3, #34208 ; 0x85a0 82a11abc: ee880b00 vdiv.f64 d0, d8, d0 82a11ac0: ed830b00 vstr d0, [r3] func1(&globalVar02, (&(GlobalVar05)), 82a11ac4: e58d6000 str r6, [sp] 82a11ac8: e24b202c sub r2, fp, #44 ; 0x2c 82a11acc: e58d2004 str r2, [sp, #4] 82a11ad0: e2855010 add r5, r5, #16 82a11ad4: e58d5008 str r5, [sp, #8] 82a11ad8: e1a00003 mov r0, r3 82a11adc: e30018a8 movw r1, #2216 ; 0x8a8 82a11ae0: e34815a0 movt r1, #34208 ; 0x85a0 82a11ae4: e30028a0 movw r2, #2208 ; 0x8a0 82a11ae8: e34825a0 movt r2, #34208 ; 0x85a0 82a11aec: eb0105b7 bl 82a531d0 <func1> *** [jump to func1] *** 82a531d0 <func1>: void func1(const float64 *p1, const float64 *p2, const float64 *p3, const float64 *p4, const fcBool *p5, float64 *p6, typedef_struct *p7) 82a531d0: e52d400c str r4, [sp, #-12]! 82a531d4: e58db004 str fp, [sp, #4] 82a531d8: e58de008 str lr, [sp, #8] 82a531dc: e28db008 add fp, sp, #8 82a531e0: e59be008 ldr lr, [fp, #8] 82a531e4: e59bc00c ldr ip, [fp, #12] if (*p5) { 82a531e8: e59b4004 ldr r4, [fp, #4] 82a531ec: e1d440b0 ldrh r4, [r4] 82a531f0: e3540000 cmp r4, #0 82a531f4: 0a000002 beq 82a53204 <func1+0x34> *** [ON STALL, SHOWS AS LAST INSTRUCTION IN TRACE] *** *p6 = *p4; 82a531f8: e1c320d0 ldrd r2, [r3] 82a531fc: e1ce20f0 strd r2, [lr] 82a53200: ea000008 b 82a53228 <func1+0x58> } else { *p6 = ((*p3) * p7->var1) + ((*p1) * ... 82a53204: edd22b00 vldr d18, [r2] *** [ON STALL, PC SHOWS THIS IS NEXT INSTRUCTION TO BE EXECUTED] *** 82a53208: eddc1b00 vldr d17, [ip] ((*p1) + globalVar01)); 82a5320c: edd00b00 vldr d16, [r0] 82a53210: eddc3b02 vldr d19, [ip, #8] 82a53214: ee700ba3 vadd.f64 d16, d16, d19 if (*p5) { *p6 = *p4 } else { *p6 = ((*p2) * p7->var1) + ((*p2) * ... 82a53218: edd13b00 vldr d19, [r1] 82a5321c: ee600ba3 vmul.f64 d16, d16, d19 82a53220: ee420ba1 vmla.f64 d16, d18, d17 82a53224: edce0b00 vstr d16, [lr] } p7->var1 = *p6; 82a53228: e1ce20d0 ldrd r2, [lr] 82a5322c: e1cc20f0 strd r2, [ip] p7->var1 = *p1; 82a53230: e1c020d0 ldrd r2, [r0] 82a53234: e1cc20f8 strd r2, [ip, #8] } 82a53238: e24bd008 sub sp, fp, #8 82a5323c: e59d4000 ldr r4, [sp] 82a53240: e59db004 ldr fp, [sp, #4] 82a53244: e28dd008 add sp, sp, #8 82a53248: e49df004 pop {pc} ; (ldr pc, [sp], #4)
As we tried to find the cause of the stall, we had looked at the ARM errata 798870 as mentioned in earlier posts. Because we don’t have the ability to look at the precise state of the L2 memory system (I assume there isn’t a method for us to do that), we made the assumption that we are running into this errata, so we tried a number of different configurations. Those are listed below along with the results.
Original code (no mitigations): While time to stall varied from a few minutes to a few hours, typically a stall would occur around an hour. Trace data showed the stall point. It also showed the arrival of the 200Hz interrupt, and the transition of the PC to the IRQ handler, but that the code did not execute. Debugger could not gain control of the core, however via the DAP, DDR3 and MMRs can be read.
Updated boot code to include the hazard detection bit L2ACTLR[7] being set: Time to stall didn’t appear to significantly change, and trace data showed the same stall point. However, upon the arrival of the 200Hz interrupt, the PC did change to the handler, and the handler executed (however, the handler is basically a watchdog, and errs out if it detects the code that had been executed had not executed to completion (which it hadn’t, since it stalled). With the core freed, we were able to access the core from the debugger.
Included L2ACTLR[7], and also enabled second core. Turned on its MMU and caches. Put it into a loop where it just increments a counter in DDR3 (including a clean cache operation to force the data out of the L1 cache and through the L2 memory system) before the second core executes a “WFI” (wait for interrupt… basically the sleep until interrupt). The first core configured one of the timers to generate an event every 30us, and the interrupt controller routed that event to the second core. Debugger view of the counter in DDR3 showed it was incrementing at the expected rate for a timer going off every 30us. With this happening, the software continued to run for 72 hours before we stopped the test.
Enabled the second core as above, but did not set L2ACTLR[7]: The core stalled at the typical rate and trace data showed it stalled even after the 200Hz interrupt. This was as expected, but done to show the second core itself didn’t change the environment, possibly masking (versus resolving) the core stalls.
Included L2ACTLR[7], but instead of using the second core, had the 30us timer initiate a 32-byte DMA copy from DDR3 to DDR3 (both unused locations) as a DMA event. The timer interrupt wasn’t enabled at the interrupt controller. Because the HW cache coherency is on, the expectation was that the snoop request generated by the DMA would wake the L2 memory system up so that the hazard detection reissue would occur, much like in the second core test. Like the second core test, this ran without crash for over 48 hours before we stopped it.
Used DMA like above, but did not set L2ACTLR[7]: Similar to the second core version of this test, this was done to show the DMA itself didn’t alter timing or the conditions that caused the stall. As expected, the core did stall.
The empirical evidence shows that the errata work-around does appear to work. However, we’re going to need your help in resolving the following questions, which are critical for us to understand as we go for root cause:
(1) While the software without the work-around crashed, and the software with the work around did not, is there some means of determining for sure that stalls may have occurred and the hazard detection was invoked that cleared the stall? We need that to positively identify that the work-around is actually being invoked and are looking for some means of identifying how often these stalls may happen, and where. Right now, assuming the stall is happening, we’re assuming the work-around is clearing it, but we have no means of verifying that.
(2) Is there some means of determining whether we are positively running into this errata 798870 or whether we are running into some other type of circumstance that happens to be mitigated with the same work-around? We severely lack any information on the L2 memory system, it’s design, and also lack the ability to understand precisely what the state was when we stalled. We can infer that it might be happening, but lack the under-the-hood knowledge that might show that it is happening.
(3) Is there a means of identifying the particular code constructs that help lead to these stall scenarios. To date, we’ve only detected the stalls at this specific location following the specific path. By itself, it doesn’t trigger the stall, but it appears to be the only code segment that we’ve encountered that, when aligned with the other “unknown-to-us” conditions, has triggered the stall.
(4) Is there a known compiler switch that can instruct the compiler to avoid some particular code sequence that is known to open up the possibility of the stall to occur? I understand we’re not using the TI-compiler, but this would be a good question for them and whether the TI compiler has identified a sequence that makes the stall possible and created a switch that keeps the compiler from generating that sequence.
Hi,
We got the feedback: "I'm afraid there isn't a software visible way to know if the time out has happened. If the deadlock will disappear when L2ACTLR[7] is set, I think it is very likely that the erratum 798870 is hit."
Unfortunately there is no direct way in ARM provided IP to trigger a count of the hazard happening. If the workaround or alternatively second core activity remove the problem that is a strong indication that it is this issue. From TI side, maybe you can utilize our trace infrastructure to catch the sequence lead to the stall.
Not sure if you have any experience of using CP trace. If you have something like start tracing, look for reads to somehow constrained memory, parse the log to see 2 back to back reads at 1024 cycles with nothing in between to see the hazard was hit? The trick will capturing the right sequence, as you are looking for proof that the hazard timeout reissues a command. I don’t think you need every hazard timeout, just once. So somehow not trace everything (too mauch bandwidth to capture), but capture reads from the ARM cluster L2 into DDR that have the same address. Then offline parse to see that there is a long (1024cc) nothing going on in between those reads. Could also be writes but I suspect read is a lot more likely to stall for a long period.
The setup of trace is not trivial, you may look at some materials:
www.ti.com/.../spruhm4.pdf
processors.wiki.ti.com/.../Using_System_Trace_(STM)
processors.wiki.ti.com/.../Debugging_With_Trace
processors.wiki.ti.com/.../Embedded_Trace_Buffer
processors.wiki.ti.com/.../Debug_Handbook_for_CCS
Regards, Eric