This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Reading 32KHz Sync Timer on OMAP3530

Other Parts Discussed in Thread: OMAP3530, TPS65950

We are noticing hangs with USB and other subsystems on Linux 2.6.32, 3.0 and 3.2 with OMAP3530 and when these hangs occur, we see processes stuck waiting for high resolution timers that have huge negative expiry endpoints.

Investigation reveals that sometimes two successive calls to the function omap34xx_32k_read() in .../arch/arm/plat-omap will return the second value less than the previous-- meaning time moved backwards or sometimes the register moves forward is a huge, non-linear jump.   The 32K Sync Timer register is supposed to be a monotonically increasing, read-only register clocked at 32KHz.

The TRM for the OMAP3530 states in section 16.6.1.1 that the 32K Sync Timer count register needs to be read with three separate 16-bit transfers, first reading the 16-bit LSB and then two 16-bit reads of the MSB.   None of the Linux kernels I listed do it this way.  They all do a single 32-bit read of this register.

Can someone confirm if the requirement to read the 32K Sync Timer with three separate 16-bit transfers is a real requirement and if so, comment on why most of the Linux kernels with OMAP support are not doing it correctly.    Or, can someone confirm that the requirement is not real and that performing a single 32-bit read should be OK and then suggest why we see the register move backwards and/or make huge jumps??

Chris Elmquist

LogicPD

  • oops.  correcting path to offending function,

               .../arch/arm/plat-omap/counter_32k.c

  • Chris,

    Good question.  I'm trying to track down the internal specs as well as the IP owner.  It might take a bit given we're in the midst of the holidays.  The faster approach is probably to just make the change in the code to implement the reads as discussed in 16.6.1.1.  That's a pretty bizarre procedure -- in particular having to access the upper 16 bits twice!

    Can you share some of the specific values you've read where it seemingly goes back in time?  I'm trying to gauge if the issue is related to the 16-bit boundary inside the register, i.e. do the issues occur when the lower 16 bits are transitioning from 0xFFFF to 0x0000.

    As a side note, I've seen some 64-bit timers where reading the lower 32 bits "captures" the upper 32 such that you're able to get a single consistent value.  Perhaps there's something similar implemented here, but using the upper and lower 16 bits.  It still seems a bit odd that you would need to read the upper 16 bits twice!

    Brad

  • Here are some values we see (r0 is read first, r1 is then read immediately afterwards):

    [ 1709.722351] omap34xx_32k_read, r1:0x03a336f0, r0:0x03a336f6

    [ 2769.035339] omap34xx_32k_read, r1:0x05b4df00, r0:0x05b4df3e

    [ 3899.328308] omap34xx_32k_read, r1:0x07ea0480, r0:0x07ea04be

    [ 7530.898132] omap34xx_32k_read, r1:0x0f01cd70, r0:0x0f01cd72

    Some comments from Chris:

    Chris35513 said:

    We tried the 16-bit read approach as discussed in the TRM and this did not pan out because we were seeing the 16-bit LSB coming up wrong similar to what the values above show.

    It appears to us that the first read of the 32K Sync Timer can have a large number of 1-bits errantly set and only after second or third read, have those bits settled out to the zeros that they should be.

    This suggests that we are facing the advisory requiring delay before reading the register expect that we believe we are not in IDLE mode.

    We also do not understand why all of the Linux code from at least kernel 2.6.32 does not follow the TRMs guidance on how to read this register, if indeed it must be read in 16-bit chunks.

    Of note, here are some combinations we have tried and their results.  The Linux kernel needs two timers, one is continuously running and is the basis for the time and date of the system.  The other timer is used in one-shot mode to schedule the next event.

    The two errors we see are either clock wrap, where the clock jumps by ~36 hours (close to the amount of time it would take the 32-bit counter @32KHz to wrap), or the scheduler stalls as it scheduled the one-shot timer for a ridiculously long time (probably a day or more).

    Continuous: 32k Sync @ 32K

    One-shot: GPTIMER1 @ 32K

    Errors seen: Clock wrap, Scheduler Stall

    Continuous: 32k Sync @ 32K

    One-shot: GPTIMER12 @ 32K

    Errors seen: Clock wrap

    Continuous: GPTIMER1 @ Sys_clk

    One-shot: GPTIMER2 @ Sys_clk

    Errors seen: None

    Continuous: GPTIMER1 @ 32K

    One-shot: GPTIMER12 @ 32K

    Errors seen: Clock wrap

    All of these tests were run with posted mode enabled on the timers.  There is some concern that we are running into Advisory 3.1.1.4, but we have not yet been able to confirm that the timer interface clock ever enters a stopped state.  Are there known registers we should be looking at to make sure the L4 interface clock never shuts off?

  • To provoke this condition, we are continuously writing a large file to a USB stick (USB OTG port in HOST mode) and then reading it back.

    I just ran an overnight test with GPTIMERs 1 and 12 running at 32K but in non-posted mode.  The scheduler never hung and the date is accurate still.  This is in contrast to posted mode where we previously saw clock wrap.

    So this does lead me to believe we're seeing a similar condition to Advisory 3.1.1.4, but we have CPU idle (wfi) disabled, so the processor never goes to sleep.  The interface and functional clock for these timers should always be on, so not sure what conditions would allow L4 to sleep.

    Do the values above match the type of corruption seen when investigating Advisory 3.1.1.4?

  • I tracked down the spec.  It shows 2 different ways to access the CR register:

    1. Single 32-bit access -OR-
    2. Two 16-bit accesses -- least significant 16-bits must be accessed FIRST in this case as that will cause the upper 16 to be simultaneously captured into a temp register such that reading the upper half returns the temp register.

    That said, I think the Linux code performing the single 32-bit access is correct.  Furthermore, looking at the values you provided it looks to me like the corruption always occurs on the least significant BYTE (i.e. not the low 16 bits).  So I think between what you're seeing and what I saw in the spec, that it's safe to assume the issue is not related to performing 16-bit accesses vs 32-bit accesses.

    How often do you see this issue?  If you disable power management in the kernel does the issue go away?  What are the values of CM_ICLKEN_WKUP and CM_AUTOIDLE_WKUP?

  • We see the issue anywhere between a few minutes into the test to taking 4 hours or more to show up.

    On my currently running system (using GPT1/12, not the 32k sync):

    CM_ICLKEN_WKUP: 0x0000002B

    CM_AUTOIDLE_WKUP: 0x0000023F

    If we'd like to go back to the original issue (and what the mainline kernel uses), I can switch back to using the 32k sync clock and GPTIMER1.

  • Oh, and which power management in the kernel are you referring to?  CPU Idle is disabled, I don't think cpufreq is working right now in this kernel, so that's likely not active.

  • Looking at the register values you shared previously for ICLKEN and AUTOIDLE it looks like Timer1 is configured such that the ICLK will be auto-idled.  Generally speaking the ICLK is only needed for the MPU to access registers of the peripherals so I imagine it's pretty likely that the ICLK is being turned off.  In other words, I think you're correct that you are being stung by Advisory 3.1.1.4.  That said, it sounds like you've already implemented one workaround which is to switch to one of the GPTimers such that you can use the non-posted mode of operation.  I think that's probably the best way around it.

    If you switch back to using the 32 kHz timer then I think the only workaround there would be to force the ICLK to stay on all the time, e.g. write CM_AUTOIDLE_WKUP[AUTO_32KSYNC]=0 to put that clock in the "enabled" state rather than the "auto" state.  This will increase current consumption and probably force you to make other changes as well in order to get the device into low power modes.  So I think what you've already done is the better way to go.

  • Just to make sure we fully understand the issue, I want to get the 32k sync timer and GPTIMER1 working properly.  I reset the configuration to use the 32k sync timer and GPTIMER1.  I also changed CM_AUTOIDLE_WKUP to 0x00000238 (after linux booted.. haven't hunted through the source enough to know where it sets it).  Took it about 10 minutes before the scheduler messed up.  I verified this was still the value in that register after the scheduler stalled.  Any more ideas?

  • After it fails can you please look at CM_AUTOIDLE_WKUP and CM_ICLKEN_WKUP again?  It's conceivable that these registers might get modified at run-time, i.e. that it overwrites your quick hack.

  • CM_AUTOIDLE_WKUP: 0x00000238
    CM_ICLKEN_WKUP: 0x0000002F

    I'll work on getting it out of the kernel code so that no one sets it. But it doesn't look like anything changed it.

  • I also noticed each of the GPTIMERS had its own autoidle bit set.  I'm running now with that disabled on GPTIMER1.  We'll see how it goes.

  • Nope, that didn't help.  Are these the only registers that really matter to this issue?  If so, I guess I'll have to hunt down where they get set in the kernel to make sure they aren't being set later on.

  • I'm pretty sure those 2 registers are the only ones that affect those clocks.  They're derived straight from SYS_CLK so there's not much in between. (I didn't see anything else in either the TRM or Clock Tree Tool.)  I've been grepping the code and only found one place where that register was being written.  Perhaps you could put some code into the timer read function to check those registers before reading the register, e.g. print an error if it ever sees that ICLK isn't enabled or if AUTOIDLE has been turned back on.  I wonder if perhaps it's not related to that Advisory after all.  I agree with you that it sure feels that way.

    Are you able to reproduce this issue on an EVM?  It sounds like this issue causes some major issues (scheduler stops?!).  That said, why aren't we seeing tons of people having this problem?  There must be something specific about your configuration that's causing the issue.  Trying to reproduce the issue on the EVM might help you narrow down the issue.  If nothing else, if you reproduce the issue on the EVM then it makes it easier for us to help you fix it.

    Brad

  • The one-shot timer for the scheduler gets set to some long value that takes a day or so to resolve -- the system always comes back up when the timer finally gets there!

    So far we've seen it on our omap35x torpedo.  I think we saw it in the past on an omap35x LV-SOM.  I will have to hunt down an EVM to see if we can reproduce it there.  We tested a dm37x SOM and did not see the issue and since it is the same code base, it might be specific to the hardware.

    We didn't notice this problem until we were debugging some USB issues.  And it's interesting that we can reproduce it on a 2.6.32 kernel, so it is a little surprising that we are just now seeing it during tests of our 3.0 kernel.  The following script below provokes it.  Of note, we have never seen this issue when we are simply reading from a USB stick.  Likely because reading is not scheduling a timeout the way writing is when the page cache is full of dirty pages to write out.  Any process that doesn't rely on setting a timer happily runs while the scheduler is out to lunch.

    #!/bin/sh

    FILE=/dev/zero
    BS=1M
    COUNT=110

    let CYCLE=0
    while true
    do
    let CYCLE=CYCLE+1
    echo -e "\n\nCycle: $CYCLE"
    date

    echo "dd from data0 to usb..."
    dd if=$FILE of=/dev/sda bs=$BS count=$COUNT
    echo "read back..."
    dd if=/dev/sda bs=$BS count=$COUNT > /dev/null
    done
  • I have also modified x-loader to add in an infinite loop where all it does is read the 32k sync timer count register.  Despite the fact that we can see occasional wrong values when running the Linux kernel, I have yet to see a wrong value in running this test for a few days (the values always increment, never decrement -- save for the wrap at 32-bits).  The initial test read the values as fast as possible, the latest test has a slight delay loop in between reads.  There are many obvious register differences and the fact that a minimal peripheral set is active, but it does make us wonder if the extra traffic on the bus from USB is provoking issues.

  • Great tests.  Yes, it seems there's something specific to Linux (or maybe just your Linux configuration) that's causing the issue to happen.  FYI, Monday/Tuesday are TI holidays so I would be surprised if you heard anything before Wednesday.  I'm not sure what your holiday schedule looks like, but if you have any updates please let us know.  Happy holidays!

  • Mike/Chris,

    Sorry that there's been no other TI activity on this thread.  I've been traveling a lot and it's not letting up, so if the issue is still on-going I probably need to find someone else in TI to help out.  Can you please provide an update on the current status?

    Brad

  • Hi Brad.  Thanks for checking in.

    Although we implemented a work-around for our immediate customer problem relating to this issue, the approach we took is not a good general solution.  We ended up changing the kernel scheduling mechanism to use GPTIMER1 and GPTIMER2 and abandoned the 32K SYNC TIMER.  The result of this is that we cannot achieve low power states like we would have been able to with the 32K SYNC TIMER as the scheduling timer.   Lower power was not an issue for this particular customer so this solution works in this case.  However for the general case and many of our customer projects, this will continue to be an issue.   We would like to continue to pursue what is going on here and why it appears we cannot read the 32K SYNC TIMER reliably.  Our products group will eventually get this issue into scope but for now, since the immediate customer issue is behind us, it is not a priority.   However if you can continue to research and perhaps explore the relationship to IDLE modes or other subsystems that could affect the 32K SYNC TIMER access, this could help greatly for when we return to the issue.

    Chris

  • Hi Mike,

    I have been looking into a similar problem on our OMAP3/4 devices. It definitely sounds like i103 is the problem here.

    I know that it has been a while since you posted this and have probably moved on with life. However, I am curious to know a couple things ...

    1. Should it be possible to see this on any omap3 device if left running for a day or more?
    2. How long does it take to occur with your script?
    3. Have you tried any recent kernels such as v3.4/v3.5?

    Cheers
    Jon

  • Hi Jon,

    We saw the issue on every OMAP3530 we tested.  If a device is just left running, I would guess the only possible visible defect would be if the date changed.  We didn't specifically test leaving a unit powered on for days.  I do recall another project that had similar issues using a 2.6.32 kernel on an OMAP3530, and that one would take a few days to exhibit issues.  That particular project was running its normal application and not a test script.  We hadn't yet figured out the issue with the 32k clock at that time, and I think a switch to GPTIMER12 was sufficient to make the issues go away.

    With our script, the issue could take anywhere from a few minutes to overnight.  I don't think I ever saw it take more than 24 hours.

    We have not tried any more recent kernels to my knowledge.

    Mike

  • Hi Mike,

    Thanks for the feedback. I will give your script a try. Did you ever try to reproduce with a SD card? I am curious if you tried any other storage devices? If not can you confirm your USB drive was connected to the USB OTG port or host port (if you happen to recall)?

    Cheers
    Jon

  • Hi Jon,

    I do not recall if we tried SD cards.  We might have.  The particular failure case was an upgrade process that used the USB port heavily, so that is what we focused on.

    The USB drive was connected through a TPS65950 on HSUSB0.

    Mike

  • Hi Mike,

    Thanks that tells me it is the OTG port. I will give it a try. By the way, when it failed, I am assuming that the entire system was hung and that is the symptom.

    Jon

  • No, the entire system was not hung.  There are two symptoms:

    • The system date/time has advanced by ~36 hours
    • Any process that was waiting on a timer will likely be stuck -- I think this resolves after a while (few days?), but it's been too long to remember the exact mechanics of the scheduler

    With this in mind, your console could still be functional, but if a foreground process was waiting on a timer, it will appear unresponsive.  What we found useful was SysRq 'q' (Display all active high-resolution timers and clock sources) and 't' (Output a list of current tasks and their information to the console).  The timers will show that events should have fired but did not.  The task output will show you that tasks are likely waiting on a timer.

  • Hi Mike,

    Quick update I have been running your script for 4 days on my beagle-board but I have not observed any problems so far. We have had a similar issue on omap4 and we have a workaround in place. I don't see any workarounds in place for omap3 and I am double checking on this at the moment. By the way, I am testing on the latest v3.5 kernel.

    Cheers
    Jon