This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

DMTimer on am335x counter goes backward causing clock to jump ahead a day and a half

Other Parts Discussed in Thread: AM3352

Hello,

We've got a custom board running an AM3352 processor booting Linux.  We see sporadic jumps in the Linux system time of about a day and a half.  I figured out that at 32 kHz, 2^32 ticks roughly equates to the time period we're seeing.  After instrumenting the kernel and looking at how the timekeeper and clocksources are managed, it looks like we're seeing negative deltas with subsequent reads to the dmtimer counter, which of course equates to a large unsigned 32-bit value.   It seems a bit arbitrary as to when they happen, but they do sometimes clump together.  Examples of the log messages I've added look like this:

root@hub-000165:~# uname -a
Linux hub-000165 3.2.42 #2 Fri Aug 16 16:45:20 MDT 2013 armv7l GNU/Linux
root@hub-000165:~# dmesg | grep clock
[    0.000000] OMAP clockevent source: GPTIMER2 at 24000000 Hz
[    0.000000] OMAP clocksource: GPTIMER1 at 32768 Hz
[    0.000000] sched_clock: 32 bits at 32kHz, resolution 30517ns, wraps every 131071999ms
[    0.120086] Switching to clocksource gp timer
[    1.244506] omap_rtc omap_rtc: setting system clock to 2013-08-21 20:01:07 UTC (1377115267)
...

[416422.022918] ACB: mach-omap2/timer.c:clocksource_read_cycles(): current time 760414958 < previous 760414971 (delta -13) in counter!! setting cur = prev+1 760414972
Still running at: Wed Aug 21 18:26:47 UTC 2013
Still running, date = Wed Aug 21 2013 18:27:16 GMT+0000 (UTC)
[416471.883392] ACB: mach-omap2/timer.c:clocksource_read_cycles(): current time 762048786 < previous 762050542 (delta -1756) in counter!! setting cur = prev+1 762050543
[416471.898864] ACB: mach-omap2/timer.c:clocksource_read_cycles(): current time 762049294 < previous 762050543 (delta -1249) in counter!! setting cur = prev+1 762050544
[416471.914306] ACB: mach-omap2/timer.c:clocksource_read_cycles(): current time 762049800 < previous 762050544 (delta -744) in counter!! setting cur = prev+1 762050545
[416471.934051] ACB: mach-omap2/timer.c:clocksource_read_cycles(): current time 762050446 < previous 762050545 (delta -99) in counter!! setting cur = prev+1 762050546
Still running, date = Wed Aug 21 2013 18:28:16 GMT+0000 (UTC)
Still running at: Wed Aug 21 18:28:52 UTC 2013
Still running, date = Wed Aug 21 2013 18:29:16 GMT+0000 (UTC)
Still running, date = Wed Aug 21 2013 18:30:16 GMT+0000 (UTC)
Still running at: Wed Aug 21 18:30:58 UTC 2013

While these logs are part of my work around (see below) at a higher level (plus user space output from our scripts testing for time jumps), I've also added printks showing negative jumps as low as __omap_dm_timer_read() in arch/arm/plat-omap/include/plat/dmtimer.h, which is basically reading straight out of the counter register:

static inline u32 __omap_dm_timer_read(struct omap_dm_timer *timer, u32 reg,
                                                int posted)
{
        if (posted)
                while (__raw_readl(timer->pend) & (reg >> WPSHIFT))
                        cpu_relax();

        return __raw_readl(timer->func_base + (reg & 0xff));
}

It looks like the kernel does proper casting to a volatile pointer and such. I instrumented this by adding a static to hold the previous value and checked that the new value was never less than the previous value.  Though it's susceptible to a race condition, the results were consistent with what the upper-layer timekeeper.c code, which also tracks a previous clock value, was seeing.

I've also followed several posts suggesting it might be an issue with POSTED mode, for example this thread:

http://e2e.ti.com/support/arm/sitara_arm/f/791/t/270632.aspx

I applied the patch referenced in there (http://arago-project.org/git/projects/?p=linux-am33x.git;a=commit;h=506c079b8421c901cbf81ebd4c8c27f1315dc769) and like others have reported, it didn't seem to have any effect on the negative time jumps we're seeing.  Also, it doesn't seem to address the overflow case, which is going to happen every day and a half anyway and also causes a huge jump in the time since the delta is very large number - very small number.  This seems like an issue others have seen, I just haven't been able to find a real fix for it.

Here's how I worked around both the negative deltas and the overflow. It's a hack and can slowly cause the time to drift ahead as the kernel "waits" for the time to catch up, but it's usually on the order of a couple of seconds a day.  At least it's continuous and monotonically increasing!  Some of the ideas for this came from drivers/clocksource/sh_cmt.c which seems to implement a clocksource correctly with overflow timers and such.

arch/arm/mach-omap2/timer.c:

static cycle_t clocksource_read_cycles(struct clocksource *cs)
{
        struct omap_dm_timer_priv *p = &timer_priv;
        unsigned long flags;
        cycle_t cycles;
        u32 cur;

        spin_lock_irqsave(&p->lock, flags);
        cur = __omap_dm_timer_read_counter(p->clksrc, 1);
        if (cur < p->prev) {
                /* we either wrapped, or it's an error.  If prev is "close" to UINT_MAX and current is "close" to 0, we assume it wrapped. */
                /* hopefully this is read at least every 10 seconds */
                if (cur < 0x8000*10 && p->prev > (~0U) - (0x8000*10)) {
                        p->wraps++;
                        pr_info("mach-omap2/timer.c:clocksource_read_cycles(): overflow detected cur = %u, prev = %u, new wraps=%u\n", cur, p->prev, p->wraps);
                } else {
                        pr_info("mach-omap2/timer.c:clocksource_read_cycles(): current time %u < previous %u (delta %d) in counter! setting cur = prev+1 %u\n", cur, p->prev, cur - p->prev, p->prev+1);
                        cur = p->prev + 1;
                }
        }
        p->prev = cur;
        cycles = (cycle_t)(((cycle_t)p->wraps << 32) + cur);
        spin_unlock_irqrestore(&p->lock, flags);

        return cycles;
}

If anyone can provide an explanation for why the counter might be appearing to go backwards, or clarify how overflows are supposed to be handled, I would love to have a more elegant fix for this problem.  I've exhausted my google-fu on this as well, so if I missed any obvious discussions with a proposed fix I'm willing to take a look at those too.  We're running a 3.2 Angstrom kernel which is a bit dated perhaps, but we're stuck with that for now.

Thanks in advance!

Sincerely,

Adam Boggs