We are noticing hangs with USB and other subsystems on Linux 2.6.32, 3.0 and 3.2 with OMAP3530 and when these hangs occur, we see processes stuck waiting for high resolution timers that have huge negative expiry endpoints.
Investigation reveals that sometimes two successive calls to the function omap34xx_32k_read() in .../arch/arm/plat-omap will return the second value less than the previous-- meaning time moved backwards or sometimes the register moves forward is a huge, non-linear jump. The 32K Sync Timer register is supposed to be a monotonically increasing, read-only register clocked at 32KHz.
The TRM for the OMAP3530 states in section 16.6.1.1 that the 32K Sync Timer count register needs to be read with three separate 16-bit transfers, first reading the 16-bit LSB and then two 16-bit reads of the MSB. None of the Linux kernels I listed do it this way. They all do a single 32-bit read of this register.
Can someone confirm if the requirement to read the 32K Sync Timer with three separate 16-bit transfers is a real requirement and if so, comment on why most of the Linux kernels with OMAP support are not doing it correctly. Or, can someone confirm that the requirement is not real and that performing a single 32-bit read should be OK and then suggest why we see the register move backwards and/or make huge jumps??
Chris Elmquist
LogicPD
I have also modified x-loader to add in an infinite loop where all it does is read the 32k sync timer count register. Despite the fact that we can see occasional wrong values when running the Linux kernel, I have yet to see a wrong value in running this test for a few days (the values always increment, never decrement -- save for the wrap at 32-bits). The initial test read the values as fast as possible, the latest test has a slight delay loop in between reads. There are many obvious register differences and the fact that a minimal peripheral set is active, but it does make us wonder if the extra traffic on the bus from USB is provoking issues.
Great tests. Yes, it seems there's something specific to Linux (or maybe just your Linux configuration) that's causing the issue to happen. FYI, Monday/Tuesday are TI holidays so I would be surprised if you heard anything before Wednesday. I'm not sure what your holiday schedule looks like, but if you have any updates please let us know. Happy holidays!
---------------------------------------------------------------------------------------------------------
Please click the Verify Answer button on this post if it answers your question.---------------------------------------------------------------------------------------------------------
Mike/Chris,
Sorry that there's been no other TI activity on this thread. I've been traveling a lot and it's not letting up, so if the issue is still on-going I probably need to find someone else in TI to help out. Can you please provide an update on the current status?
Brad
Hi Brad. Thanks for checking in.
Although we implemented a work-around for our immediate customer problem relating to this issue, the approach we took is not a good general solution. We ended up changing the kernel scheduling mechanism to use GPTIMER1 and GPTIMER2 and abandoned the 32K SYNC TIMER. The result of this is that we cannot achieve low power states like we would have been able to with the 32K SYNC TIMER as the scheduling timer. Lower power was not an issue for this particular customer so this solution works in this case. However for the general case and many of our customer projects, this will continue to be an issue. We would like to continue to pursue what is going on here and why it appears we cannot read the 32K SYNC TIMER reliably. Our products group will eventually get this issue into scope but for now, since the immediate customer issue is behind us, it is not a priority. However if you can continue to research and perhaps explore the relationship to IDLE modes or other subsystems that could affect the 32K SYNC TIMER access, this could help greatly for when we return to the issue.
Chris
Hi Mike,
I have been looking into a similar problem on our OMAP3/4 devices. It definitely sounds like i103 is the problem here.
I know that it has been a while since you posted this and have probably moved on with life. However, I am curious to know a couple things ...
1. Should it be possible to see this on any omap3 device if left running for a day or more?2. How long does it take to occur with your script?3. Have you tried any recent kernels such as v3.4/v3.5?
CheersJon
Hi Jon,
We saw the issue on every OMAP3530 we tested. If a device is just left running, I would guess the only possible visible defect would be if the date changed. We didn't specifically test leaving a unit powered on for days. I do recall another project that had similar issues using a 2.6.32 kernel on an OMAP3530, and that one would take a few days to exhibit issues. That particular project was running its normal application and not a test script. We hadn't yet figured out the issue with the 32k clock at that time, and I think a switch to GPTIMER12 was sufficient to make the issues go away.
With our script, the issue could take anywhere from a few minutes to overnight. I don't think I ever saw it take more than 24 hours.
We have not tried any more recent kernels to my knowledge.
Mike
Thanks for the feedback. I will give your script a try. Did you ever try to reproduce with a SD card? I am curious if you tried any other storage devices? If not can you confirm your USB drive was connected to the USB OTG port or host port (if you happen to recall)?
I do not recall if we tried SD cards. We might have. The particular failure case was an upgrade process that used the USB port heavily, so that is what we focused on.
The USB drive was connected through a TPS65950 on HSUSB0.
Thanks that tells me it is the OTG port. I will give it a try. By the way, when it failed, I am assuming that the entire system was hung and that is the symptom.
Jon
No, the entire system was not hung. There are two symptoms:
With this in mind, your console could still be functional, but if a foreground process was waiting on a timer, it will appear unresponsive. What we found useful was SysRq 'q' (Display all active high-resolution timers and clock sources) and 't' (Output a list of current tasks and their information to the console). The timers will show that events should have fired but did not. The task output will show you that tasks are likely waiting on a timer.
Quick update I have been running your script for 4 days on my beagle-board but I have not observed any problems so far. We have had a similar issue on omap4 and we have a workaround in place. I don't see any workarounds in place for omap3 and I am double checking on this at the moment. By the way, I am testing on the latest v3.5 kernel.