Reading 32KHz Sync Timer on OMAP3530

Chris Elmquist

Other Parts Discussed in Thread: OMAP3530, TPS65950

We are noticing hangs with USB and other subsystems on Linux 2.6.32, 3.0 and 3.2 with OMAP3530 and when these hangs occur, we see processes stuck waiting for high resolution timers that have huge negative expiry endpoints.

Investigation reveals that sometimes two successive calls to the function omap34xx_32k_read() in .../arch/arm/plat-omap will return the second value less than the previous-- meaning time moved backwards or sometimes the register moves forward is a huge, non-linear jump. The 32K Sync Timer register is supposed to be a monotonically increasing, read-only register clocked at 32KHz.

The TRM for the OMAP3530 states in section 16.6.1.1 that the 32K Sync Timer count register needs to be read with three separate 16-bit transfers, first reading the 16-bit LSB and then two 16-bit reads of the MSB. None of the Linux kernels I listed do it this way. They all do a single 32-bit read of this register.

Can someone confirm if the requirement to read the 32K Sync Timer with three separate 16-bit transfers is a real requirement and if so, comment on why most of the Linux kernels with OMAP support are not doing it correctly. Or, can someone confirm that the requirement is not real and that performing a single 32-bit read should be OK and then suggest why we see the register move backwards and/or make huge jumps??

Chris Elmquist

LogicPD

over 13 years ago

0 Chris Elmquist over 13 years ago

Prodigy 105 points

oops. correcting path to offending function,

.../arch/arm/plat-omap/counter_32k.c

0 Brad Griffis over 13 years ago in reply to Chris Elmquist

TI__Guru*** 125430 points

Chris,

Good question. I'm trying to track down the internal specs as well as the IP owner. It might take a bit given we're in the midst of the holidays. The faster approach is probably to just make the change in the code to implement the reads as discussed in 16.6.1.1. That's a pretty bizarre procedure -- in particular having to access the upper 16 bits twice!

Can you share some of the specific values you've read where it seemingly goes back in time? I'm trying to gauge if the issue is related to the 16-bit boundary inside the register, i.e. do the issues occur when the lower 16 bits are transitioning from 0xFFFF to 0x0000.

As a side note, I've seen some 64-bit timers where reading the lower 32 bits "captures" the upper 32 such that you're able to get a single consistent value. Perhaps there's something similar implemented here, but using the upper and lower 16 bits. It still seems a bit odd that you would need to read the upper 16 bits twice!

Brad

0 Mike Brudevold over 13 years ago in reply to Brad Griffis

Expert 1365 points

Here are some values we see (r0 is read first, r1 is then read immediately afterwards):

[ 1709.722351] omap34xx_32k_read, r1:0x03a336f0, r0:0x03a336f6

[ 2769.035339] omap34xx_32k_read, r1:0x05b4df00, r0:0x05b4df3e

[ 3899.328308] omap34xx_32k_read, r1:0x07ea0480, r0:0x07ea04be

[ 7530.898132] omap34xx_32k_read, r1:0x0f01cd70, r0:0x0f01cd72

Some comments from Chris:

Chris35513 said:

We tried the 16-bit read approach as discussed in the TRM and this did not pan out because we were seeing the 16-bit LSB coming up wrong similar to what the values above show.

It appears to us that the first read of the 32K Sync Timer can have a large number of 1-bits errantly set and only after second or third read, have those bits settled out to the zeros that they should be.

This suggests that we are facing the advisory requiring delay before reading the register expect that we believe we are not in IDLE mode.

We also do not understand why all of the Linux code from at least kernel 2.6.32 does not follow the TRMs guidance on how to read this register, if indeed it must be read in 16-bit chunks.

Of note, here are some combinations we have tried and their results. The Linux kernel needs two timers, one is continuously running and is the basis for the time and date of the system. The other timer is used in one-shot mode to schedule the next event.

The two errors we see are either clock wrap, where the clock jumps by ~36 hours (close to the amount of time it would take the 32-bit counter @32KHz to wrap), or the scheduler stalls as it scheduled the one-shot timer for a ridiculously long time (probably a day or more).

Continuous: 32k Sync @ 32K

One-shot: GPTIMER1 @ 32K

Errors seen: Clock wrap, Scheduler Stall

Continuous: 32k Sync @ 32K

One-shot: GPTIMER12 @ 32K

Errors seen: Clock wrap

Continuous: GPTIMER1 @ Sys_clk

One-shot: GPTIMER2 @ Sys_clk

Errors seen: None

Continuous: GPTIMER1 @ 32K

One-shot: GPTIMER12 @ 32K

Errors seen: Clock wrap

All of these tests were run with posted mode enabled on the timers. There is some concern that we are running into Advisory 3.1.1.4, but we have not yet been able to confirm that the timer interface clock ever enters a stopped state. Are there known registers we should be looking at to make sure the L4 interface clock never shuts off?

0 Mike Brudevold over 13 years ago in reply to Mike Brudevold

Expert 1365 points

To provoke this condition, we are continuously writing a large file to a USB stick (USB OTG port in HOST mode) and then reading it back.

I just ran an overnight test with GPTIMERs 1 and 12 running at 32K but in non-posted mode. The scheduler never hung and the date is accurate still. This is in contrast to posted mode where we previously saw clock wrap.

So this does lead me to believe we're seeing a similar condition to Advisory 3.1.1.4, but we have CPU idle (wfi) disabled, so the processor never goes to sleep. The interface and functional clock for these timers should always be on, so not sure what conditions would allow L4 to sleep.

Do the values above match the type of corruption seen when investigating Advisory 3.1.1.4?

0 Brad Griffis over 13 years ago in reply to Mike Brudevold

TI__Guru*** 125430 points

I tracked down the spec. It shows 2 different ways to access the CR register:

Single 32-bit access -OR-
Two 16-bit accesses -- least significant 16-bits must be accessed FIRST in this case as that will cause the upper 16 to be simultaneously captured into a temp register such that reading the upper half returns the temp register.

That said, I think the Linux code performing the single 32-bit access is correct. Furthermore, looking at the values you provided it looks to me like the corruption always occurs on the least significant BYTE (i.e. not the low 16 bits). So I think between what you're seeing and what I saw in the spec, that it's safe to assume the issue is not related to performing 16-bit accesses vs 32-bit accesses.

How often do you see this issue? If you disable power management in the kernel does the issue go away? What are the values of CM_ICLKEN_WKUP and CM_AUTOIDLE_WKUP?

0 Mike Brudevold over 13 years ago in reply to Brad Griffis

Expert 1365 points

We see the issue anywhere between a few minutes into the test to taking 4 hours or more to show up.

On my currently running system (using GPT1/12, not the 32k sync):

CM_ICLKEN_WKUP: 0x0000002B

CM_AUTOIDLE_WKUP: 0x0000023F

If we'd like to go back to the original issue (and what the mainline kernel uses), I can switch back to using the 32k sync clock and GPTIMER1.

0 Mike Brudevold over 13 years ago in reply to Mike Brudevold

Expert 1365 points

Oh, and which power management in the kernel are you referring to? CPU Idle is disabled, I don't think cpufreq is working right now in this kernel, so that's likely not active.

0 Brad Griffis over 13 years ago in reply to Mike Brudevold

TI__Guru*** 125430 points

Looking at the register values you shared previously for ICLKEN and AUTOIDLE it looks like Timer1 is configured such that the ICLK will be auto-idled. Generally speaking the ICLK is only needed for the MPU to access registers of the peripherals so I imagine it's pretty likely that the ICLK is being turned off. In other words, I think you're correct that you are being stung by Advisory 3.1.1.4. That said, it sounds like you've already implemented one workaround which is to switch to one of the GPTimers such that you can use the non-posted mode of operation. I think that's probably the best way around it.

If you switch back to using the 32 kHz timer then I think the only workaround there would be to force the ICLK to stay on all the time, e.g. write CM_AUTOIDLE_WKUP[AUTO_32KSYNC]=0 to put that clock in the "enabled" state rather than the "auto" state. This will increase current consumption and probably force you to make other changes as well in order to get the device into low power modes. So I think what you've already done is the better way to go.

0 Mike Brudevold over 13 years ago in reply to Brad Griffis

Expert 1365 points

Just to make sure we fully understand the issue, I want to get the 32k sync timer and GPTIMER1 working properly. I reset the configuration to use the 32k sync timer and GPTIMER1. I also changed CM_AUTOIDLE_WKUP to 0x00000238 (after linux booted.. haven't hunted through the source enough to know where it sets it). Took it about 10 minutes before the scheduler messed up. I verified this was still the value in that register after the scheduler stalled. Any more ideas?

0 Brad Griffis over 13 years ago in reply to Mike Brudevold

TI__Guru*** 125430 points

After it fails can you please look at CM_AUTOIDLE_WKUP and CM_ICLKEN_WKUP again? It's conceivable that these registers might get modified at run-time, i.e. that it overwrites your quick hack.

0 Mike Brudevold over 13 years ago in reply to Brad Griffis

Expert 1365 points

CM_AUTOIDLE_WKUP: 0x00000238
CM_ICLKEN_WKUP: 0x0000002F

I'll work on getting it out of the kernel code so that no one sets it. But it doesn't look like anything changed it.

0 Mike Brudevold over 13 years ago in reply to Mike Brudevold

Expert 1365 points

I also noticed each of the GPTIMERS had its own autoidle bit set. I'm running now with that disabled on GPTIMER1. We'll see how it goes.

0 Mike Brudevold over 13 years ago in reply to Mike Brudevold

Expert 1365 points

Nope, that didn't help. Are these the only registers that really matter to this issue? If so, I guess I'll have to hunt down where they get set in the kernel to make sure they aren't being set later on.

0 Brad Griffis over 13 years ago in reply to Mike Brudevold

TI__Guru*** 125430 points

I'm pretty sure those 2 registers are the only ones that affect those clocks. They're derived straight from SYS_CLK so there's not much in between. (I didn't see anything else in either the TRM or Clock Tree Tool.) I've been grepping the code and only found one place where that register was being written. Perhaps you could put some code into the timer read function to check those registers before reading the register, e.g. print an error if it ever sees that ICLK isn't enabled or if AUTOIDLE has been turned back on. I wonder if perhaps it's not related to that Advisory after all. I agree with you that it sure feels that way.

Are you able to reproduce this issue on an EVM? It sounds like this issue causes some major issues (scheduler stops?!). That said, why aren't we seeing tons of people having this problem? There must be something specific about your configuration that's causing the issue. Trying to reproduce the issue on the EVM might help you narrow down the issue. If nothing else, if you reproduce the issue on the EVM then it makes it easier for us to help you fix it.

Brad

0 Mike Brudevold over 13 years ago in reply to Brad Griffis

Expert 1365 points

The one-shot timer for the scheduler gets set to some long value that takes a day or so to resolve -- the system always comes back up when the timer finally gets there!

So far we've seen it on our omap35x torpedo. I think we saw it in the past on an omap35x LV-SOM. I will have to hunt down an EVM to see if we can reproduce it there. We tested a dm37x SOM and did not see the issue and since it is the same code base, it might be specific to the hardware.

We didn't notice this problem until we were debugging some USB issues. And it's interesting that we can reproduce it on a 2.6.32 kernel, so it is a little surprising that we are just now seeing it during tests of our 3.0 kernel. The following script below provokes it. Of note, we have never seen this issue when we are simply reading from a USB stick. Likely because reading is not scheduling a timeout the way writing is when the page cache is full of dirty pages to write out. Any process that doesn't rely on setting a timer happily runs while the scheduler is out to lunch.

#!/bin/sh

FILE=/dev/zero
BS=1M
COUNT=110

let CYCLE=0
while true
do
        let CYCLE=CYCLE+1
        echo -e "\n\nCycle: $CYCLE"
        date

        echo "dd from data0 to usb..."
        dd if=$FILE of=/dev/sda bs=$BS count=$COUNT
        echo "read back..."
        dd if=/dev/sda bs=$BS count=$COUNT > /dev/null
done

0 Mike Brudevold over 13 years ago in reply to Mike Brudevold

Expert 1365 points

I have also modified x-loader to add in an infinite loop where all it does is read the 32k sync timer count register. Despite the fact that we can see occasional wrong values when running the Linux kernel, I have yet to see a wrong value in running this test for a few days (the values always increment, never decrement -- save for the wrap at 32-bits). The initial test read the values as fast as possible, the latest test has a slight delay loop in between reads. There are many obvious register differences and the fact that a minimal peripheral set is active, but it does make us wonder if the extra traffic on the bus from USB is provoking issues.

0 Brad Griffis over 13 years ago in reply to Mike Brudevold

TI__Guru*** 125430 points

Great tests. Yes, it seems there's something specific to Linux (or maybe just your Linux configuration) that's causing the issue to happen. FYI, Monday/Tuesday are TI holidays so I would be surprised if you heard anything before Wednesday. I'm not sure what your holiday schedule looks like, but if you have any updates please let us know. Happy holidays!

0 Brad Griffis over 13 years ago in reply to Brad Griffis

TI__Guru*** 125430 points

Mike/Chris,

Sorry that there's been no other TI activity on this thread. I've been traveling a lot and it's not letting up, so if the issue is still on-going I probably need to find someone else in TI to help out. Can you please provide an update on the current status?

Brad

0 Chris Elmquist over 13 years ago in reply to Brad Griffis

Prodigy 105 points

Hi Brad. Thanks for checking in.

Although we implemented a work-around for our immediate customer problem relating to this issue, the approach we took is not a good general solution. We ended up changing the kernel scheduling mechanism to use GPTIMER1 and GPTIMER2 and abandoned the 32K SYNC TIMER. The result of this is that we cannot achieve low power states like we would have been able to with the 32K SYNC TIMER as the scheduling timer. Lower power was not an issue for this particular customer so this solution works in this case. However for the general case and many of our customer projects, this will continue to be an issue. We would like to continue to pursue what is going on here and why it appears we cannot read the 32K SYNC TIMER reliably. Our products group will eventually get this issue into scope but for now, since the immediate customer issue is behind us, it is not a priority. However if you can continue to research and perhaps explore the relationship to IDLE modes or other subsystems that could affect the 32K SYNC TIMER access, this could help greatly for when we return to the issue.

Chris

0 Jon Hunter over 13 years ago in reply to Mike Brudevold

TI__Expert 3245 points

Hi Mike,

I have been looking into a similar problem on our OMAP3/4 devices. It definitely sounds like i103 is the problem here.

I know that it has been a while since you posted this and have probably moved on with life. However, I am curious to know a couple things ...

1. Should it be possible to see this on any omap3 device if left running for a day or more?
2. How long does it take to occur with your script?
3. Have you tried any recent kernels such as v3.4/v3.5?

Cheers
Jon

0 Mike Brudevold over 13 years ago in reply to Jon Hunter

Expert 1365 points

Hi Jon,

We saw the issue on every OMAP3530 we tested. If a device is just left running, I would guess the only possible visible defect would be if the date changed. We didn't specifically test leaving a unit powered on for days. I do recall another project that had similar issues using a 2.6.32 kernel on an OMAP3530, and that one would take a few days to exhibit issues. That particular project was running its normal application and not a test script. We hadn't yet figured out the issue with the 32k clock at that time, and I think a switch to GPTIMER12 was sufficient to make the issues go away.

With our script, the issue could take anywhere from a few minutes to overnight. I don't think I ever saw it take more than 24 hours.

We have not tried any more recent kernels to my knowledge.

Mike

0 Jon Hunter over 13 years ago in reply to Mike Brudevold

TI__Expert 3245 points

Hi Mike,

Thanks for the feedback. I will give your script a try. Did you ever try to reproduce with a SD card? I am curious if you tried any other storage devices? If not can you confirm your USB drive was connected to the USB OTG port or host port (if you happen to recall)?

Cheers
Jon

0 Mike Brudevold over 13 years ago in reply to Jon Hunter

Expert 1365 points

Hi Jon,

I do not recall if we tried SD cards. We might have. The particular failure case was an upgrade process that used the USB port heavily, so that is what we focused on.

The USB drive was connected through a TPS65950 on HSUSB0.

Mike

0 Jon Hunter over 13 years ago in reply to Mike Brudevold

TI__Expert 3245 points

Hi Mike,

Thanks that tells me it is the OTG port. I will give it a try. By the way, when it failed, I am assuming that the entire system was hung and that is the symptom.

Jon

0 Mike Brudevold over 13 years ago in reply to Jon Hunter

Expert 1365 points

No, the entire system was not hung. There are two symptoms:

The system date/time has advanced by ~36 hours
Any process that was waiting on a timer will likely be stuck -- I think this resolves after a while (few days?), but it's been too long to remember the exact mechanics of the scheduler

With this in mind, your console could still be functional, but if a foreground process was waiting on a timer, it will appear unresponsive. What we found useful was SysRq 'q' (Display all active high-resolution timers and clock sources) and 't' (Output a list of current tasks and their information to the console). The timers will show that events should have fired but did not. The task output will show you that tasks are likely waiting on a timer.

0 Jon Hunter over 13 years ago in reply to Mike Brudevold

TI__Expert 3245 points

Hi Mike,

Quick update I have been running your script for 4 days on my beagle-board but I have not observed any problems so far. We have had a similar issue on omap4 and we have a workaround in place. I don't see any workarounds in place for omap3 and I am double checking on this at the moment. By the way, I am testing on the latest v3.5 kernel.

Cheers
Jon

Processors

Processors forum

Reading 32KHz Sync Timer on OMAP3530