This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Horrible interrupt latency (>10ms) on OMAP linux

We're using an OMAP L138 running a 2.6.37 kernel and we're experiencing horrible interrupt latencies. This manifests itself in various drivers.

Handling an interrupt typically take less than 100us on the platform. If I attach a waveform generator to a GPIO line and handle the interrupt by counting, it can keep up with frequencies up to 100kHz (that is, 100000 interrupts each second). However, once every few seconds or so, the system appears to stall and the interrupt takes several milliseconds (I've seen 10ms on occasion) to arrive at the (kernel) handler.

The easiest to verify manifestation of the problem is simply sending data over the serial terminal. I wrote a small Python program that writes lines of text to the serial port and reads them back for comparison. If i delay about 50ms between each line, all is right, but at higher rates serial data will be lost. I can upload the python script for anyone interested.

I have experienced the same problem on an OMAP3 (gumstix overo) where we could work around the problem by moving the implementation into an FPGA. In that case, we had a A/D convertor connected that we failed to read within 2ms after an interrupt.

I have tried disabling (and enabling) each and every power management option in the kernel (most notably, turning OFF all power management). I tried raising the IRQ priorities of the serial ports from "7" to "4" in arch/arm/mach-davinci/da850.c.  I also tried a 3.3 kernel (davinci-next branch of arago) which also made no difference.

What could be causing this horrible behaviour?

  • Mike,

    Check the kernel .config file whether CONFIG_NO_HZ is enabled or not. If enabled, disable this. Also check the value of CONFIG_HZ see whether its 100 or 128. If so, please change it to 1000. This can be done by editing the HZ entry in arch/arm/Kconfig file

    Try these and observe the latencies.

  • CONFIG_NO_HZ is enabled, HZ was 100. Tried with HZ set to 1000, which did not make any difference.

    Also disabled NO_HZ (tickless kernel). Made no difference in the "serial" test either. Only difference is that the "clockevent" IRQ now occurs 1000x per second.

  • I noticed that there is NO interrupt (other than serial) occuring when the "glitch" occurs (either lost serial data or an extra "0" byte inserted in the reponse). I cannot find a relation to other interrupt routines, which also explains why raising the priority of the serial IRQ does not make any difference.

    I suspect it is related to caching or RAM access, but I have no idea how I can test that.

  • Mike,

    10ms is too high to suspect RAM or caching causing the issue. Can you describe in more detail on the hardware setup, the serial port baudrate and how the data is transferred and received back. Also where have you added the code for GPIO toggling in your interrupt handler etc. Please share as much details as possible.

  • Here's the source for the serial "ping pong" test program. It exeutes "cat" on the other side (just to have something to talk to) and just writes a line and reads it back.

    6786.pingpong.py.gz

    Example usage:

    pingpong.py --device /dev/ttyUSB0 --delay 25 --loops 999 --data "The quick fox jumped over the lazy brown dog"

    It's to be run on a host with a serial connection to a board running some kind of shell (one that supports "cat"). I derived it from a similar program that I wrote to upload files through a terminal connection, which works fine on most development boards, but not on the OMAPs because their serial connection isn't reliable.

  • If you have any suggestion for measuring or otherwise diagnosing this problem, that'd be welcome.

    On the omap-3 we had the problem that there was a AD converter chip connected to the SPI bus, which generated an IRQ for each sample. The omap then had to read from the chip via SPI. It often "missed" that deadline, even at 500Hz sampling.

    When I replaced the code for the SPI transfer with a simple "gettimeofday" call, it became clear that the interrupt did not arrive in time, sometimes they were several milliseconds apart (so multiple interrupts were lost). We never found a solution for that either.

  • Losing data on the serial bus happens on all three UARTs, on baud rates above 9600. It's easiest to reproduce on the serial console, but it's not particular to that one.

    It does not matter whether I run the CPU at its lowest possible speed of 96MHz or at 456MHz. Frequency scaling is already disabled (because the drivers for SPI and SD cannot handle it).

  • Mike,

    Can you give more clarity on how exactly you are reading the characters and echoing it back? If you are using a script to access it, then I can almost tell you what might be happening. Please share more details about the script or the commands that is reading at the device end.

  • I posted the script above. It's python code, so it should be easier to read than my English.

    There's nothing on the device side (script runs on host), it just executes "cat" on the shell and exploits the "echo" feature of the terminal connection to receive back what it sends (well almost, when sending a "\n" you get a "\r\n" back).

  • Mike,

    The problem with scripts and continuously executing shell commands, especially which handle serial port are listed below.

    1. Each time you run the cat or echo command it will do a open(), read/write() and close() of the serial device. 

    2. If you check the Linux kernel serial driver code, generally the serial driver gets initialized after it calls open(). During open it instantiates the serial_interrupt_handler routine. 

    3. The handler is freed up when you do a close(). 

    So in your case since you are doing a open->read/write->close,  open->read/write->close sequence continuously instead of open->read/write->read/write->read/write...->close, this behavior can occur. 

    Your handler itself may not exist when you are seeing higher latency, which in no way is related to the interrupt latency.

    What I recommend is to write a small C program to do a send and receive at high frequency at both PC and target and verify the latency by toggling the GPIO. I hope I have explained it properly. 

  • The script executes "cat" ONLY ONCE and keeps talking to that one instance. Since this is running on the console, the serial device is already open, so there's no opening/closing of the serial port involved.

    In the serial port test, I'm not measuring latency. I'm seeing "lost data".

    Please read the whole story and don't focus on the serial part. It happens on two different OMAP platforms, with ANYTHING that uses an interrupt handler. I'm just using the serial port because they're on any board and anyone can run the test.

  • Mike,

    You are loosing the data because there is no interrupt handler available to receive the data when you are receiving it. Please enable a print inside the function uart_open() inside the file drivers/serial/serial_core.c in your kernel and run your script and share the logs.

  • Here's the "body" of the script ("dev" is the serial device)

            for i in xrange(options.loops):
                dev.write(data + '\n')
                line = dev.readline()[:-2] # Strip away \r\n
                print i, "\r",
                if data != line:
                print "\nMismatch:\ndata=%s\nline=%s" % (repr(data),repr(line))
                sys.stdout.flush()
                if delay:
                    time.sleep(delay)


  • Right after booting with the extra logging in uart_open:

    # dmesg -c | grep uart
    uart_open(2) called
    uart_open(2) called
    uart_open(2) called
    uart_open(2) called
    uart_open(2) called
    uart_open(2) called
    uart_open(2) called
    uart_open(2) called
    uart_open(2) called
    uart_open(2) called
    uart_open(2) called
    uart_open(2) called
    uart_open(2) called
    uart_open(2) called
    uart_open(2) called

    Running the script (on the host) yields:

    bin/pingpong.py --device /dev/ttyUSB0 --delay 1 --loop 20

    1
    Mismatch:
    data='0-hello.1-hello.2-hello.3-hello.4-hello.5-hello.6-hello.7-hello.8-hello.9-hello'
    line='0-hello.1-hello.2-hello.3-hello.4-hello.5-hello0-hello.1-hello.2-hello.3-hello.4-hello.5-hello.6-hello.7-hello.8-hello.9-hello'

    Then getting the log again on the device:

    # dmesg -c | grep uart
    #

    As you see no opening/closing of the serial port.

  • MIke,

    I'm not worried about the host side. I'm referring to the target side. If you see, there are multiple calls to uart_open(). This makes my point clear.

    Mike Looijmans said:

    uart_open(2) called
    uart_open(2) called
    uart_open(2) called
    uart_open(2) called
    uart_open(2) called
    uart_open(2) called
    uart_open(2) called
    uart_open(2) called
    uart_open(2) called
    uart_open(2) called
    uart_open(2) called
    uart_open(2) called
    uart_open(2) called
    uart_open(2) called
    uart_open(2) called

    At the target side, it should just invoke one uart_open(2) call. So can you write a simple C application for the target to implement the logic. 

  • Those uart_open() were !!!!!!! BEFORE !!!!!! i ran the test program. They occurred during system startup, and serve no other purpose than to show you that the log message was working.

    It's the SECOND dmesg that you should be "concerned" about. As you can see, op "uart_open" whatsoever.

  • Mike,

    If you are not seeing any uart_open() when you are running the script, how is the device doing the read and write? 

    Can you simply try creating the test C code and observe the same? This will clear all confusions.

  • I'm sending to the console (where you usually would run PuTTY or minicom). Linux runs a getty process on the device side, which echoes characters back to the terminal program (=my script). By running a neutral command (cat) on the device, the device acts as a simple "echo" program. Is this clear?

    I've posted the source code already, you can run it on any system (linux or windows).

    I can make a set of C programs that do the same, if that is what it takes to convince you guys that the serial port is dropping characters for no apparent reason.

    I was mostly hoping for good suggestions on to analyze what's going on.

  • There was a bug in that test script, I forgot to "drop" the ouput, hence the "cat" command is showing every data line twice. With a change to the code to do that, the problem no longer reproduces as easily. So I'll have to do some C coding anyway.

    Don't read this the wrong way: We ARE losing data on the serial inputs in our software, while ther eis absolutely no load on the system, that is the problem I'm really after. We also see high latencies on other interrupts, I have to come up with code for that so it can be demonstrated on other platforms.

  • Mike,

    I'll wait for the result of C program.

  • Hi Renjith,

    We are facing a problem here

    We have ported JB 4.2.p1 with kernel 3.4 onto our custom board having omap4430.  We have connected an interrupt pin of a  device on our board to gpi 80 pin of omap. The device is configured to send out interrupts to the omap gpi 80 at every 10ms i.e 100hz pulse train. We toggled another gpio 3 pin of omap on receiving every interrupt  pulse as mentioned above in the interrupt handler function of our device . We noticed on probing that the gpio 3 did not toggle consistently at 10ms, and it skipped a few pulses, and at times the delay was as high as 200ms.

    However on repeating the same experiment n ics with kernel 3.0.4 we noticed that gpio 3 toggeled consistently at 10ms without any issues.

    We have to correct this inconsistent behaviour and the interrupt latency. Could you provide suggestions on this?

    Thank You

    Prakash

  • Prakash,

    Can you check your kernel config in both cases? Can you compare CONFIG_HZ related changes? Also can you explain how are you waiting for 10ms in the kernel?

  • Hi Renjith,


    CONFIG_HZ is 128 in both cases, there is no change in that. The interrupt pin of the device gives out interrupts every 10ms to gpio in omap. We have configured the device at 100Hz. Ideally we have to get interrupts every 10ms. When we probed the interrupt pin of the device, we were consistently getting the interrupt at 10ms but not in the interrupt handler function of the device.

    Regards

    Prakash

  • Prakash,

    Please try with CONFIG_HZ as 1000 instead of 128. 

    Prakash Divate said:
    We have configured the device at 100Hz.

    How did you configure the device.

  • Renjith,


    I tried CONFIG_HZ as 1000, its not making any difference.

    The device we are using is MPU9150. The driver provided by them has provision to hard code the frequency.

    Regards

    Prakash

  • Hi Prakash,

    Can you share the piece of code that you've written?

    Also can you let me know whether you are using bottom half handlers for implementing the GPIO toggle? If so, please move the toggling to the interrupt handler function instead of putting in bottom half handler.

    Also can you summarize what is the CPU load in Linux during this test?

  • Renjith,

    I used the following code to toggle the gpio

    gpio_set_value(82, 1);

    gpio_set_value(82, 0);

    i tried using the above code in the interrupt handler and the thread function, but it made no difference.

    Regards

    Prakash

  • Prakash,

    Did you check the CPU load on ARM?

  • Renjith,

    The CPU load is 1%.

    I tried toggling the gpio in the gpio driver file ie. in the gpio_irq_handler function in the drivers/gpio/gpio-omap.c file. But still i observe the same problem , I don't see the gpio toggling consistently at all. There seems to be a problem with the irq handling in the gpio driver.  Is there a patch available for the gpio driver file for this kernel release?

    Regards

    Prakash

  • Prakash,

    I'm not aware of a ready made patch for fixing this issue. But this seems to be something strange with the kernel. This needs further debugging.