This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Linux - Latency and performance issues on the L137 processor

Other Parts Discussed in Thread: DA8XX, OMAPL138, OMAP3530

Hi we are having performance issues on the L137 processor using the open source GIT linux version (and Montavista version) on the L137 processor and eval card

Basically we have a driver which receives an interrupt every 125uS into a GPIO line. The interrupt handler wakes up and process waiting on a wait queue. The write call simply puts the calling process to sleep on the wait queue.

I'm toggling a GPIO line in the isr and write call to get an idea of the time spent in IRQ and the time spent in the call to write (the line is asserted when the process is awaoken and deasserted when put to sleep) 

I'm making comparisons to something we did similar with an IXP420 255MHz processor running an old 2.4.20 kernel which had no such problems and received two interrupt sources at 8kHz,

I'm concerneed that the OMAP arm has a smaller cache size 16k for instruction and data compared to the IXP425 arm 5 32k, and that the processor / linux is unsuitable.

Below is a full outline of our problems. My concern is that we will not have the power to run our application code.

I apolgise if I've put this on the wrong forum. 

Has anyone else had similar experiences with this processor / linux?

Regards

Dave

 

Full details ....

OMAP L137 / Linux performance issues

 

Outline of a simple driver / userspace appliction:

Driver wakes up userspace process every 125uS performance is measured by driving gpio line in ISR and driving another line in the context of the write process.

Issues:

ISR and process latency seems to be poor and time spent in the write process (15uS best)  context seems to be excessive for what is being done here. The worry is there will be little processing time for our user space application.

Although the process is scheduled as real-time it is affected by other logins and apps such as top.

Shell commands such as PS are a little slow to respond.

Results Summary (Typical):

ISR Latency = 15uS

User Process wakeup Latency =35uS

Time in ISR = 10uS

Time in write process context = 15uS

 

We have tried the above scenario on the Montavist 2.6.18 kernel shipped with the eval card with PREEMPT_RT enabled, which gives very similar results.

We have very similar processing on a 266MHz IXP420 running  linux 2.4.20 (montavista) and 2.6.26 (opensource) which handle this on an existing application (two ISRs at 8kHz rate) with a large multi treaded application on top. No problems running telnet logins etc.

 

Kernel Config

OMAP opensource GIT Kernel with spi support based on 2.6.37

Built with codesourcery tool chain, running with ramdisk, PREEMPT (low latency desktop), high-res timers, tickless kernel enabled.

 

Driver:

Sets up McASP 1 to give a 8kHz pulse which is wired into GPIO(0, 13)

Interrupt handler simply wakes up user space process waiting on wait queue.

 

// Handle a 125uS sync pulse

//  . Enable the uart transmitter

//  . Set up dma transfer if data is available

//  . Set user index of which buffer to write to

//  . Wake up tx process

 

static irqreturn_t sync_handler(int irq, void *dev_id)

{

    struct srm600_device_config *dev_config = (struct srm600_device_config *) dev_id;

 

    //gpio_set_value(GPIO_TO_PIN(0,15), 1);   

    __raw_writel(0x8000, gpio_base_addr + BANK0_SET_OFFSET);

 

    // for test purposes simply wake up the write process so we are scheduling on 120uS

 

    // wake up process waiting on rx

    wake_up_interruptible(&dev_config->write_wq);

 

    //gpio_set_value(GPIO_TO_PIN(0,15), 0);       

    __raw_writel(0x8000, gpio_base_addr + BANK0_CLR_OFFSET);

 

       return IRQ_HANDLED;

}


 

Write process is put to sleep awaiting wake up

static ssize_t srm600_uart_write (struct file *filp, char *buf, size_t count,

                loff_t *f_pos)

{

    DEFINE_WAIT(wait);

 

    struct srm600_device_config *dev_config = (struct srm600_device_config *) filp->private_data;

 

#if 0

    gpio_set_value(GPIO_TO_PIN(0,12), 0);

#else

    __raw_writel(0x1000, gpio_base_addr + BANK0_CLR_OFFSET);

#endif

 

       prepare_to_wait(&dev_config->write_wq, &wait, TASK_INTERRUPTIBLE);

       schedule();

       finish_wait(&dev_config->write_wq, &wait);

 

#if 0

    gpio_set_value(GPIO_TO_PIN(0,12), 1);

#else

    __raw_writel(0x1000, gpio_base_addr + BANK0_SET_OFFSET);

#endif

    return count;

 

}

 

Userspace

Write thread simply loops and prints out a message every 1 second

 

while(!exitting)

    {

        if(write(fd, txbuffer, strlen(txbuffer) +1) > 0)

        {

            txcount++;

 

            if(txcount == 8000)

            {

                txcount = 0;

                printf("Write woken 8000x\n");

            }

 

#ifdef ADD_DELAY

            for(z=0; z< 400;z++)

            {

                t = z;

            }

#endif

        }

        else

        {

            printf("Error writing to uart\n");

        }

    }

 

 

Thread is started as a real-time process

 

 

 

#ifdef REAL_TIME

 

            pthread_attr_init(&attr);

            pthread_attr_setinheritsched(&attr, PTHREAD_EXPLICIT_SCHED);

            pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM);

            pthread_attr_setschedpolicy(&attr, SCHED_FIFO);

 

            schedp.sched_priority = 99;

 

            pthread_attr_setschedparam(&attr, &schedp);

            pthread_create(&txthread, &attr, processTx, NULL);

 

           

            pthread_attr_destroy(&attr);

#else

            pthread_create(&txthread, NULL, processTx, NULL);

 

#endif

 


  • Also in the kernel bootup I get the following:

    Dentry cache hash table entries: 4096 (order: 2, 16384 bytes)
    Inode-cache hash table entries: 2048 (order: 1, 8192 bytes)

    I'm wondering if this means the instruction cache is only using 8k of the available 16k. ... I need to investigate this.

  • (I work with David)

    Apologies: Dentry and Inode-cache refer to filesystem not instruction/data cache. Sorry if this caused any confusion.

     

    Attached are our bootup messages for a 2.6.18 Monta Vista kernel, please note the lines highlighted regarding our kernel configuration.

    Hope this information is of help!

     

    Linux version 2.6.18_pro500-da830_omapl137_evm-arm_v5t_le (xxxx@yyyyyy) (gcc version 4.2.0 (MontaVista 4.2.0-16.0.32.0801914 2008-08-30)) #1 PREEMPT Tue Jan 25 12:06:28 GMT 2011

    CPU: ARM926EJ-S [41069265] revision 5 (ARMv5TEJ), cr=00053177

    Machine: DaVinci DA8XX EVM

    Memory policy: ECC disabled, Data cache writethrough

    On node 0 totalpages: 8192

      DMA zone: 8192 pages, LIFO batch:1

    DA830 variant 0x9

    CPU0: D VIVT write-back cache

    CPU0: I cache: 16384 bytes, associativity 4, 32 byte lines, 128 sets

    CPU0: D cache: 16384 bytes, associativity 4, 32 byte lines, 128 sets

    Built 1 zonelists.  Total pages: 8192

    Kernel command line: console=ttyS2,115200n8 noinitrd rw ip=aa.bb.cc.dd root=/dev/nfs nfsroot=aa.bb.cc.dd:/home/xxxxx/omap_evaluation/fileSystem, nolock mem=32M

    PID hash table entries: 256 (order: 8, 1024 bytes)

    Clock event device timer0_0 configured with caps set: 07

    Console: colour dummy device 80x30

    Dentry cache hash table entries: 4096 (order: 2, 16384 bytes)

    Inode-cache hash table entries: 2048 (order: 1, 8192 bytes)

    Memory: 32MB = 32MB total

    Memory: 28744KB available (2892K code, 604K data, 176K init)

    Calibrating delay loop... 149.50 BogoMIPS (lpj=747520)

    Security Framework v1.0.0 initialized

    Capability LSM initialized

    Mount-cache hash table entries: 512

    CPU: Testing write buffer coherency: ok

    NET: Registered protocol family 16

    DaVinci: 128 gpio irqs

    Generic PHY: Registered new driver

    usbcore: registered new driver usbfs

    usbcore: registered new driver hub

    NET: Registered protocol family 2

    IP route cache hash table entries: 256 (order: -2, 1024 bytes)

    TCP established hash table entries: 1024 (order: 0, 4096 bytes)

    TCP bind hash table entries: 512 (order: -1, 2048 bytes)

    TCP: Hash tables configured (established 1024 bind 512)

    TCP reno registered

    NetWinder Floating Point Emulator V0.97 (double precision)

    VFS: Disk quotas dquot_6.5.1

    Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)

    squashfs: version 3.1 (2006/08/19) Phillip Lougher

    JFFS version 1.0, (C) 1999, 2000  Axis Communications AB

    JFFS2 version 2.2. (NAND) (SUMMARY)  (C) 2001-2006 Red Hat, Inc.

    yaffs Jan 25 2011 12:03:17 Installing.

    SGI XFS with no debug enabled

    Initializing Cryptographic API

    io scheduler noop registered

    io scheduler anticipatory registered (default)

    LTT : ltt-facilities init

    LTT : ltt-facility-core init in kernel

    DAVINCI-WDT: DaVinci Watchdog Timer: heartbeat 60 sec

    Serial: 8250/16550 driver $Revision: 1.90 $ 3 ports, IRQ sharing disabled

    serial8250.0: ttyS0 at MMIO map 0x1c42000 mem 0xfec42000 (irq = 25) is a 16550A

    serial8250.0: ttyS1 at MMIO map 0x1c20400 mem 0xfed0c000 (irq = 53) is a 16550A

    serial8250.0: ttyS2 at MMIO map 0x1d0d000 mem 0xfed0d000 (irq = 61) is a 16550A

    RAMDISK driver initialized: 1 RAM disks of 32768K size 1024 blocksize

    Davinci EMAC MII Bus: probed

    MAC address is 00:0e:99:03:12:76

    TI DaVinci EMAC Linux version updated 4.0

    i2c /dev entries driver

    Creating 3 MTD partitions on "Windbond spi nand flash":

    0x00000000-0x00040000 : "U-Boot"

    0x00040000-0x00044000 : "U-Boot Environment"

    0x00044000-0x00400000 : "Linux"

    dm_spi.0: davinci SPI Controller driver at 0xc285c000 (irq = 20) use_dma=1

    dm_spi.1: davinci SPI Controller driver at 0xc285e000 (irq = 56) use_dma=1

    ohci_hcd: 2006 August 04 USB 1.1 'Open' Host Controller (OHCI) Driver

    ohci ohci.0: DA8xx OHCI

    ohci ohci.0: new USB bus registered, assigned bus number 1

    Waiting for USB PHY clock good...

    ohci ohci.0: irq 59, io mem 0x01e25000

    usb usb1: configuration #1 chosen from 1 choice

    hub 1-0:1.0: USB hub found

    hub 1-0:1.0: 1 port detected

    usbcore: registered new driver libusual

    musb_hdrc: version 6.0, cppi4.1-dma, host, debug=0

    Waiting for USB PHY clock good...

    musb_hdrc: ConfigData=0x06 (UTMI-8, dyn FIFOs, SoftConn)

    musb_hdrc: MHDRC RTL version 1.800

    musb_hdrc: setup fifo_mode 2

    musb_hdrc: 8/9 max ep, 3392/4096 memory

    musb_hdrc: hw_ep 0shared, max 64

    musb_hdrc: hw_ep 1tx, max 512

    musb_hdrc: hw_ep 1rx, max 512

    musb_hdrc: hw_ep 2tx, max 512

    musb_hdrc: hw_ep 2rx, max 512

    musb_hdrc: hw_ep 3tx, max 512

    musb_hdrc: hw_ep 3rx, max 512

    musb_hdrc: hw_ep 4shared, max 256

    musb_hdrc: USB Host mode controller at c2860000 using DMA, IRQ 58

    musb_hdrc musb_hdrc: MUSB HDRC host driver

    musb_hdrc musb_hdrc: new USB bus registered, assigned bus number 2

    usb usb2: configuration #1 chosen from 1 choice

    hub 2-0:1.0: USB hub found

    hub 2-0:1.0: 1 port detected

    rtc-da8xx rtc-da8xx.0: rtc intf: proc

    rtc-da8xx rtc-da8xx.0: rtc intf: dev (254:0)

    rtc-da8xx rtc-da8xx.0: rtc core: registered rtc-da8xx as rtc0

    rtc-da8xx rtc-da8xx.0: TI DA8xx Real Time Clock driver.

    davinci-mmc davinci-mmc.0: Supporting 8-bit mode

    davinci-mmc davinci-mmc.0: Supporting 4-bit mode

    davinci-mmc davinci-mmc.0: Using DMA mode

    IPv4 over IPv4 tunneling driver

    TCP bic registered

    NET: Registered protocol family 1

    NET: Registered protocol family 17

    rtc-da8xx rtc-da8xx.0: setting the system clock to 2000-01-08 01:09:38 (947293778)

    Time: timer0_1 clocksource has been installed.

    Clock event device timer0_0 configured with caps set: 08

    Switched to high resolution mode on CPU 0

    mmcblk0: mmc0:aaaa SD02G 1931264KiB

     mmcblk0: p1

    IP-Config: Guessing netmask 255.255.0.0

    IP-Config: Complete:

          device=eth0, addr=aa.bb.cc.dd, mask=255.255.0.0, gw=255.255.255.255,

         host=aa.bb.cc.dd, domain=, nis-domain=(none),

         bootserver=255.255.255.255, rootserver=aa.bb.cc.dd, rootpath=

    Looking up port of RPC 100003/2 on aa.bb.cc.dd

    Looking up port of RPC 100005/1 on aa.bb.cc.dd

    VFS: Mounted root (nfs filesystem).

    Freeing init memory: 176K

  • I did a lot of work with the OMAPL137 when it first came out, prior to there being much documentation.  You are correct about the cache problem with the part.  TI should have made the cache much bigger.

    From a Linux perspective the chip is good for slow Linux operation.  If you want performance you have to utilize the DSP.  It is very fast and depending upon what you want to do can be integrated with your Linux application.  Thus real, real-time, on the DSP side, and communications/user interface and slow real time via Linux.  TI provides DSPLink to interact with the DSP but the better way is just writing your own simple shared memory driver and use the host interrupt capability & dual ported ram.

    Another alternative is a couple of years ago I ported FreeRTOS to the ARM side of the OMAPL137 and made it available on this forum (search on FreeRTOS).  If you don't really need Linux then you can avoid the overhead and run FreeRTOS on the ARM9 and DSP/BIOS on the DSP side.  Either way, the only way you will get performance out of the L137 is to highly leverage the DSP, the cache is too limiting and Linux places too much of a load on it.

    For our newer application we have started to work with the OMAP3530.  We were torn between an OMAPL138 discrete design and an OMAP3530 module you can buy for $80. The 3530 DSP is not as good but the Linux performance will be much better.

    Now if we could only get TI to use the Cortex A8 in the OMAPL137/138 with a larger cache and the same DSP/dual ported memory, you'd have an awesome chip.  

    Kev