This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Linux/PROCESSOR-SDK-AM335X: Timer interrupts stop and system becomes unresponsive periodically

Part Number: PROCESSOR-SDK-AM335X


Tool/software: Linux

Hi All,

We have a custom board utilising an am335x SoC. From time to time (very hard to reproduce), the system goes into a state where the gp_timer interrupt stalls completely, so if I issue the command cat /proc/interrupts, the number of interrupts on gp_timer remains the same permanently..

Once in this state the device in question goes into a state where it becomes unresponsive for a certain time then becomes responsive again. The time that it is unresponsive is almost constant at about 1 minute 55 seconds. At the time the system becomes responsive again, the system time has jumped back 62 seconds, and exactly 62 seconds later the system freezes again.

Once in this state, the serial console does not respond at all - I am connected to the device vi ssh, and in this state, I experience the freezes and then date jumping back reliably.

I have found a thread on the forum in which this issue is described, but there is no resolution posted. The thread is 

Has anyone else noticed this behaviour?

Thanks in advance

Hamish

  • Hi,

    Which kernel is this? Can you summarize the changes you've made to port the TISDK to your custom board?

    Best Regards,
    Yordan
  • Hi Yordan,

    The kernel we see the problem on is 4.1.13. We build our systems using Yocto. We are using the git://git.yoctoproject.org/linux-yocto-4.1.git kernel repo on which we are on the standard/beaglebone branch. Unfortunately I do not have the complete history of the porting process to our platform, as I only joined the team a year ago, however the project started in 2013.

    Best Regards,
    Hamish

  • Hi Yordan,

    I have managed to dig up some more history on the porting process. When the project started, the am335x was evaluated on Beaglebone devices and yocto denzil was used at that point. Our custom hardware was developed, and once the first prototypes arrived, porting was started. At this point, the kernel was at 3.2.23 - linux-ti33x-psp. Our custom hardware is essentially based on the am335x-evm, but we have added an Altera Cyclone FPGA as well as PHY's to support PTP, in addition we added Modbus support and a few other items I cannot quite remember off the top of my head, but as I say, it is mainly based on the evm. The FPGA is connected to the GPMC bus so the pinmuxing is different to the evm, the NAND pinmuxing is also slightly different to the evm.

    The original port was done in a local repository and directly modified the am335xevm code base. As the port progressed, the developers kept up-to date with Yocto, and bumped kernel versions as yocto releases did so. As the port matured, it became increasingly difficult for the developers to integrate the changes into newer kernel versions, and at that stage SVN was being used for the project. It was then decided to switch to git and a new yocto layer was created for the custom board, and patches were created against the then-current TI kernel, and these stored in the layer metadata.

    Sometime in 2014 it was decided to use the yocto kernel repos instead of linux-ti33x-psp. Development has continued and kernels have been updated relatively frequently but conservatively.

    Two products we have on the market are based on a platform release from autumn last year, which has the 4.1.13 kernel in it. Both products exhibit the issue described, but as I say, it is not a frequent occurrence, and we would like to try to understand what the cause of the problem is, we are able to revive a device in this state immediately by syncing it's clock to our PTP clock, which simply involves doing a  clock_settime call on CLOCK_REALTIME. I guess we could use any call to clock_settime, however, it occurs so infrequently that we are aware of, that I have not had much opportunity to try other mechanisms to revive them.

    I have now written code using the real-time clock to generate interrupts so that I can check if the gp_timer interrupt has locked up, and if it has, I can remotely notify that this has happened, so I will be putting a number of devices onto long-running tests, with the 4.1.13 kernel.

    We do have 2 further development branches, one on a 4.1.35 Kernel and another on a 4.8.13 Kernel (which we will be shortly updating to 4.9.x). I have also completely re-worked our patches so that we patch the existing evm codebase, including our board detection code so that we are in fact able to run our u-boot and kernel binaries against our board, as well as beaglebone and the evm. This means that if there are any changes in the am335x code-base we immediately see them when our patches fail to apply, and are able to re-base our patches against newer kernel versions.

    We have never noticed this phenomenon on either of those newer branches, but I must very quickly add that we have no long-running tests on either of those branches, so I cannot rule out the fact that those branches also exhibit the problem. I am in the process of setting up infrastructure so that I will have a few devices on the 4.1.35 and 4.8.13 kernel versions running long-term tests, with my failure detection code in place.

    I am just very curious that exactly this problem was reported in this thread:

    e2e.ti.com/.../2108951

    but there was never any conclusive report of resolution to that problem.

    I thank you again for your interest in this issue.

    Hamish

  • Hi,

    Sorry for the late reply. Let me check this further.

    Best Regards,
    Yordan
  • Hi Yordan,

    I am eager for any additional information.

    Best regards

    Hamish