This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

(DM6446) High ARM cpu load on ethernet driver when DSP doing intensive memcopy

Other Parts Discussed in Thread: 4428

Good morning,
I'm having some troubles with ethernet driver on DM6446 with linux 2.6.18pro500 PREEMPT.
I have a system that captures from a CMOS sensor, it does previewing to grab ycbcr image
and then It makes some post-processing on the DSP. After that the generated image is encoded in jpeg (on the DSP)
and sent with a socket on the ARM side. The system in the complex is heavily loaded in the DSP/VPSS side but the ARM side except socket
processing is free.
The generated UDP stream occupies 1.5MB/s. With these conditions the ARM CPU use is more or less negligible for the user
part but it reaches 50% with 60% peaks and over on the kernel part. The ARM thread that does socket processing
is Round Robin scheduled with 95 priority and maximum task nice. There are other threads composing the task but changing
relative priorities and global task nice doesn't globally change load very much. I tried also changing nice of
softirq-net-tx/rx threads with no relevant improvement.
I made some runs with oprofile and these are the kernel functions with most of the hits:

1 0.0052 vmlinux vmlinux skb_dequeue
2 0.0105 vmlinux vmlinux n_tty_chars_in_buffer
2 0.0105 vmlinux vmlinux __queue_work
3 0.0157 vmlinux vmlinux skb_queue_tail
3 0.0157 vmlinux vmlinux __wake_up_sync
8 0.0419 vmlinux vmlinux add_wait_queue_exclusive
23 0.1205 vmlinux vmlinux tty_ldisc_deref
23 0.1205 vmlinux vmlinux tty_ldisc_try
30 0.1572 vmlinux vmlinux videobuf_qbuf
51 0.2672 vmlinux vmlinux hrtimer_try_to_cancel
51 0.2672 vmlinux vmlinux n_tty_receive_buf
73 0.3825 vmlinux vmlinux emac_poll
621 3.2539 vmlinux vmlinux __mod_timer
622 3.2591 vmlinux vmlinux hrtimer_start
810 4.2442 vmlinux vmlinux del_timer
1452 7.6081 vmlinux vmlinux remove_wait_queue
1691 8.8604 vmlinux vmlinux __up_read
1964 10.2908 vmlinux vmlinux add_wait_queue
2485 13.0207 vmlinux vmlinux emac_dev_tx
4428 23.2015 vmlinux vmlinux emac_tx_bdproc
4742 24.8467 vmlinux vmlinux __wake_up
18886 5.5619 vmlinux vmlinux __spin_unlock_irqrestore
18886 99.9947 vmlinux vmlinux __spin_unlock_irqrestore [self]
1 0.0053 vmlinux vmlinux preempt_schedule
-------------------------------------------------------------------------------
84 1.2247 vmlinux vmlinux dev_queue_xmit
116 1.6912 vmlinux vmlinux __ip_route_output_key
121 1.7641 vmlinux vmlinux __spin_lock_bh
124 1.8078 vmlinux vmlinux __read_lock_bh
6414 93.5122 vmlinux vmlinux local_bh_disable
6729 1.9817 vmlinux vmlinux __local_bh_disable
6729 100.000 vmlinux vmlinux __local_bh_disable [self]
-------------------------------------------------------------------------------
164 2.5119 vmlinux vmlinux ip_generic_getfrag
6365 97.4881 vmlinux vmlinux csum_partial_copy_fromiovecend
6447 1.8986 vmlinux vmlinux csum_partial_copy_from_user
6447 100.000 vmlinux vmlinux csum_partial_copy_from_user [self]

I made an ARM test program that doesn't use neither DSP neither VPSS, it simply sends a constant 1.5MB/s UDP stream
to a client. The ARM CPU load in this case never exeeds for kernel 10% with a mean load of 6%.
The functions in kernel with most hits in this case are:

2 0.2039 vmlinux emac_tx_bdproc
5 0.5097 vmlinux emac_poll
141 14.3731 vmlinux hrtimer_try_to_cancel
261 26.6055 vmlinux emac_dev_tx
572 58.3078 vmlinux hrtimer_start
293 18.5678 vmlinux __spin_unlock_irqrestore
293 100.000 vmlinux __spin_unlock_irqrestore [self]
-------------------------------------------------------------------------------
257 100.000 vmlinux __schedule
159 10.0760 vmlinux __spin_unlock_irq
159 100.000 vmlinux __spin_unlock_irq [self]
-------------------------------------------------------------------------------
6 1.2346 vmlinux do_nanosleep
480 98.7654 vmlinux schedule
80 5.0697 vmlinux __schedule
257 64.5729 vmlinux __spin_unlock_irq
80 20.1005 vmlinux __schedule [self]
22 5.5276 vmlinux add_preempt_count
11 2.7638 vmlinux arm_return_addr
11 2.7638 vmlinux profile_hit
10 2.5126 vmlinux sched_clock
2 0.5025 vmlinux in_lock_functions
2 0.5025 vmlinux sub_preempt_count
2 0.5025 vmlinux __spin_lock_irq
1 0.2513 vmlinux debug_smp_processor_id

After this, I kept running the udp test program on ARM and with Code Composer I made a
simple DSP program that continuously copies data with the intent to load DDR memory
and check if it could be the a cause of the problem. After running on DSP the memcopy program I
found that the ARM side udp test program CPU load rose to 20% mean with 23% peaks (from 6% mean load).
This condition in production software limits ARM usage because most of the resources seems to be
lost spinning in the kernel driver. Is this a physiological problem of heavy loading or can I somehow
tune the kernel ethernet driver to limit the problem?

Thanks,

Matteo

  • Moving this post to DM64x forum
  • Hi Matteo,
    Sorry for the delayed response here.
    I think it is due to heavy loading on DDR by both ARM and DSP.

    After this, I kept running the udp test program on ARM and with Code Composer I made a
    simple DSP program that continuously copies data with the intent to load DDR memory
    and check if it could be the a cause of the problem.

    Could you try to use internal memory for copying by DSP when you try it on CCS.
  • Hi Titusrathinaraj,

    Thanks for the answer,

    did you mean IRAM to IRAM memcopy?

    If you meant IRAM to IRAM memcopy, I tried it limited to the few kilobytes available and this actually does not change ARM side cpu usage confirming your answer.

    So, in my application, could this mean that ARM profiler's higher hits are due to instructions operating in memory taking longer and increasing proportionally hit rate? Cold I hope in some syncronization issues (like spinlocks) in driver or shuld I give up?

    Best regards,

    Matteo


  • So, in my application, could this mean that ARM profiler's higher hits are due to instructions operating in memory taking longer and increasing proportionally hit rate?

    Yes, I'm also suspecting like that.
    I've seen similar issues with some other processor, like data was missing due to DDR busy.