Very high serial tx latency on DM355 (Linux 2.6.18)

Rob Voisey16850

I'm implementing a multi-drop serial protocol over a tty serial port on an embedded Linux platform running a 2.6.18 kernel. The protocol is half duplex - the host sends ID polling packets periodically (about 100 per second) and if a device has data to send it waits for its ID before sending it. Pretty standard simple stuff which I've done many times on microcontrollers.

The problem is that the Linux host is caching my transmit data (the polling packets) and spitting them out back to back at full data rate in bursts. So, instead of one packet every 10ms, I'm getting 20 packets back to back, then a 200ms ish delay then another 20 and so on. When I send a packet I need Linux to flush it to the serial port immediately, not cache it for a later burst. In any case, 200ms seems like an infeasibly huge delay.

I looked into the ASYNC_LOW_LATENCY flag but when set the port stops working altogether. From grepping the kernel source it doesn't look like this is implemented for davinci.

Here's an extract of my code (with error handling removed for brevity):

fd_target=open(TARGET_TTY, O_RDWR | O_NDELAY | O_NOCTTY);

tcgetattr(fd_target, &config

config.c_cflag &= ~(CSIZE | PARENB | CSTOPB | HUPCL);

config.c_cflag |= CS8 | CLOCAL | CREAD;

config.c_cc[VMIN] = 1;

config.c_cc[VTIME] = 0;

cfsetispeed(&config, TARGET_BPS);

cfsetospeed(&config, TARGET_BPS);

tcsetattr(fd_target, TCSAFLUSH, &config);

FD_ZERO(&target_fds);

FD_SET(fd_target, &target_fds);

fdmax = fd_target;

while(1)

{

/* Poll for incoming data, unblock anyway after 10ms */

read_fds = target_fds;

tv.tv_sec = 0;

tv.tv_usec = 10000;

select(fdmax+1, &read_fds, NULL, NULL, &tv);

/* Check for serial data */

if(FD_ISSET(fd_target, &read_fds))

{

/* Empty the device into our state machine */

{

n = read(fd_target, buf, 1);

if(n>0) target_fsm(*buf);

}

while(n>0);

}

else

{

/* Poll targets if there's nothing incoming - pkt is setup with the appropriate data*/

write(fd_target, &pkt, 3);

}

I tried the same thing in blocking mode (without the inner loop around the read) - same behaviour. I need something to force that write through to the port. I tried tcdrain() but that just causes my thread to block until the FIFO empties after 200ms or so.

I checked that the scheduler isn't causing the latency - the thread is waking 100 times per second as expected and the CPU load is very low, so I'm pretty sure this behaviour is somewhere in the tty.

Any advice appreciated! Hardware is the DM355 EVM, software is the RidgeRun SDK with 2.6.18 montavista kernel.

Cheers

Rob

over 15 years ago

0 Harold Wasserman over 13 years ago

Prodigy 30 points

I have hit precisely the same problem. Wonder if you solved it and if so how did you do it?

I have also discovered that this problem exists on /dev/tts/0 and dev/tts/1.

Getting /dev/tts/1 to work at all was an adventure.

0 Martin Jahn over 9 years ago in reply to Harold Wasserman

Prodigy 10 points

We also discovered this exact problem a few months ago in our DM368-based project. Don't know if anyone may still be interested in this stuff in the future, but since these SoCs seem to be in widespread use, you never know. So I will share our findings and solution 'briefly' here (including the code changes, just as the license requires):

At first we also checked/tried 'everything' (such as numerous flags, flush/drain-APIs etc as well as process/thread priorities (since 200ms 'smelled' somehow like scheduling)) on the userland side. But nothing changed anything, the 200ms latency seemed like carved in stone.

After moving to the kernel side ("TTY/Serial stack") I first suspected the highest TTY-stack layer (following the generic VFS and char devices layers) called '(TTY) Line Discipline'. But after fully instrumenting (using single GPIO-reg requests + LA (no UART/printk), thus allowing for 'real-time' tracing incl. exact timestamps) the TX-path of this layer, I found that it is fine and there is no additional buffering there (given the correct port configuration (TTY_HW_COOK_OUT bit) is set, "write_chan()" in [/drivers/char/n_tty.c] directly calls "tty->driver->write()" = "uart_write()" in [/drivers/serial/serial_core.c]).

Then I suspected that it may be some ISR ("top half") vs. deferred procedure ("bottom half": softirq/tasklets) issue, where the latter could theoretically lead to scheduling issues again. But I only found sane, expected behaviour there as well, just as the theory from the books taught me before.

Finally I had fully GPIO-instrumented the TTY-stack's TX-path down to the low-level UART-driver [/drivers/serial/8250.c] where the data is handed over byte-by-byte to the peripheral's HW-Tx-FIFO. After some (painful) analysis of the LA-captured data in combination with KGDB/DDD kernel debugging to figure out non-synchronous (i.e. ISR-/softirq/tasklet-driven) call-stacks of the deeply nested routines (such as "transfer_chars()") which are used from both softirq- and regular kernel task level contexts, I eventually came to the root cause of the evil:

- The 8250.c driver is a 'quite generic' driver used for all PC-standard-compatible (8250/16450/16550A) UARTs, which the DM368 UART claims to be.
- However, there seem to be a vast amount of partially buggy clones which the driver (tries to) account for with various workarounds. E.g.
#define UART_BUG_TXEN (1 << 1) /* UART has buggy TX IIR status */
is active in DM368's MontaVista 5.0 / kernel 2.6.18 config.
- If it is set, "serial8250_start_tx()" calls "transmit_chars()" (which clocks out to the UART-FIFO), if two other conditions are met:
if (lsr & UART_LSR_TEMT && iir & UART_IIR_NO_INT)
transmit_chars(up);

- But this seems to get called only on the first usermode-write()-request to the TTY, afterwards UART_IIR_NO_INT seems to be always LOW (i.e. indicating that there is an (resp. _some_) IRQ pending (which is/was supposed to handle the transferring once it is processed - so the design idea I guess)) at time of this check - which is the reason why UART-output is not synchronous to the write()-call, except for the 1st-call-case.

- But in fact, there seems to be **no** IRQ pending which will soon get serviced _and_ take care of transferring the data in all subsequent write()-calls. Instead, the fact that the UART outputs something at all (but only after a 200ms delay) looks to me as if it is only a 'side-effect' of an existing "backup timer"-mechanism of the driver:

Maybe the DM3xx-'supposedly 16550-compatible'-UARTs have this problem (cited from description in "serial8250_startup()"):
"UARTs that do not reassert THRE when the transmitter is idle and the interrupt
has already been cleared. Real 16550s should always reassert this interrupt
whenever the transmitter is idle and the interrupt is enabled."
=> For those the driver uses the "serial8250_backup_timeout()"-timer to
"setup a timer to kick the UART on a regular basis".
"Standard timer interval plus 0.2s to keep the port running"
mod_timer(&up->timer, jiffies + poll_timeout(up->port.timeout) + (HZ/5));

=> Here we go: The 200ms delay is actually hard-coded as a 'magic number' (twice! (once on timer init + again at timer reload)) in the code! And since it is (correctly) defined in dependency from "HZ", it was invariant to my prior experiments with changing the kernel's system tick rate (CONFIG_HZ) from its default (for ARM-platforms) of 100 (=10ms tick) to 250 (4ms) and 1000 (1ms).

So apparently DM3xx UART transmission almost fully depends on this 200ms-backup-timer mechanism! From here we directly had (at least) two feasible workaround options which both seem to work OK:

(1) We could minimize the delay of the backup time (=increasing its frequency). Tests with add-on values of (HZ/200)=5ms and (HZ/1000)=1ms [1ms for non-default CONFIG_HZ=1000 setting, but also worked for HZ=100 (where it effectively became 0ms add-on)]
were successful and brought the write()-call-finished-to-UART-clock-out latency down to <10ms, which would be sufficient for our use cases.
However, it would still be dependent on the backup-timer, incurring - even if minimized - unnecessary delays and CPU (softirq-) load.

And since our usecase is a simplest "ping-pong (request-response)"-protocol-scheme (i.e. "multi-frame-throughput" doesn't matter), we do not need/want any queuing/asynchronous operation of the UART. So we also tested another option:

(2) Always enforce transmit_chars() on start_tx (regardless of IIR-bit0 'interrupt pending'-state (which seems to be either buggy or the check (for _any_ interrupt source) is not exact enough...)) by disabling the check for UART_IIR_NO_INT in "serial8250_start_tx()":
if (lsr & UART_LSR_TEMT /*&& iir & UART_IIR_NO_INT*/ )
transmit_chars(up);

=> This seems to work OK _so far_ (in our usecase), leading to 'zero latency' (fully synchronous operation of userland-write()-calls) since backup-timer is never needed.

0 Rob Voisey64155 over 9 years ago in reply to Martin Jahn

Prodigy 10 points

Gosh, I wasn't expecting a solution after all this time! Thanks for posting. We long since moved to a platform with better Linux support, but there must be a lot of legacy users around still.

Cheers

Rob

Processors

Processors forum

Very high serial tx latency on DM355 (Linux 2.6.18)