Performance tests with new network stack

Sebastian Huber sebastian.huber at embedded-brains.de
Thu Sep 25 06:15:06 UTC 2014


Hello,

I used simple FTP transfers to/from the target to measure the TCP performance 
of the new network stack on a PowerPC MPC8309.  The new network stack is a port 
from FreeBSD 9.2.  It is highly optimized for SMP and uses fine grained 
locking.  For uni-processor systems this is not a benefit.  About 2000 mutexes 
are present in the idle state of the stack.  It turned out that the standard 
RTEMS semaphores are a major performance bottleneck.  I added a light weight 
alternative (rtems_bsd_mutex).  For fine grained locking it is important that 
the uncontested mutex obtain/release is as fast as possible.

With the latest version (struct timespec and rtems_bsd_mutex) I get this:

curl -o /dev/null  ftp://anonymous@192.168.100.70/dev/zero
   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                  Dload  Upload   Total   Spent    Left  Speed
   0     0    0 1194M    0     0  9101k      0 --:--:--  0:02:14 --:--:-- 9158k

       perf disabled   coverage: 100.000%  runtime:  99.998%   covtime: 100.000%
name________________________|ratio___|1%_____2%________5%_____10%_____20%_____|
in_cksumdata                | 11.137%|==========================              |
memcpy                      | 10.430%|=========================               |
tcp_output                  |  7.189%|=====================                   |
ip_output                   |  3.241%|=============                           |
uma_zalloc_arg              |  2.710%|===========                             |
ether_output                |  2.533%|==========                              |
tcp_do_segment              |  2.121%|========                                |
m_copym                     |  2.062%|========                                |
uma_zfree_arg               |  2.062%|========                                |
bsd__mtx_unlock_flags       |  2.062%|========                                |
tcp_input                   |  2.003%|=======                                 |
Thread_Dispatch             |  1.885%|=======                                 |
rtalloc1_fib                |  1.649%|=====                                   |
ip_input                    |  1.708%|======                                  |
memmove                     |  1.532%|====                                    |
rn_match                    |  1.473%|====                                    |
tcp_addoptions              |  1.414%|====                                    |
arpresolve                  |  1.355%|===                                     |
in_cksum_skip               |  1.296%|===                                     |
memset                      |  1.296%|===                                     |
mb_dupcl                    |  1.178%|==                                      |
uec_if_dequeue              |  1.178%|==                                      |
in_lltable_lookup           |  1.119%|=                                       |
rtfree                      |  1.001%|<                                       |
ether_nh_input              |  1.001%|<                                       |
uec_if_bd_wait_and_free     |  1.001%|<                                       |
quicc_bd_tx_submit_and_wait |  1.001%|<                                       |
TOD_Get_with_nanoseconds    |  1.001%|<                                       |
uec_if_interface_start      |  0.942%|<                                       |
bsd__mtx_lock_flags         |  0.883%|<                                       |
bzero                       |  0.883%|<                                       |
mb_ctor_mbuf                |  0.824%|<                                       |
mb_free_ext                 |  0.824%|<                                       |
netisr_dispatch_src         |  0.824%|<                                       |
in_pcblookup_hash_locked.isr|  0.766%|<                                       |
bsd_critical_enter          |  0.766%|<                                       |
rw_runlock                  |  0.707%|<                                       |
if_transmit                 |  0.707%|<                                       |
Timespec_Add_to             |  0.707%|<                                       |
in_delayed_cksum            |  0.648%|<                                       |
tcp_timer_active            |  0.648%|<                                       |
ether_demux                 |  0.648%|<                                       |
ppc_clock_nanoseconds_since_|  0.648%|<                                       |
RBTree_Find                 |  0.648%|<                                       |
Thread_Enable_dispatch      |  0.648%|<                                       |
rw_rlock                    |  0.589%|<                                       |
callout_reset_on            |  0.589%|<                                       |
in_clsroute                 |  0.589%|<                                       |

We have 3% processor load due to mutex operations (_bsd__mtx_lock_flags() and 
_bsd__mtx_unlock_flags()).

With the 64-bit nanoseconds timestamp I get this:

curl -o /dev/null  ftp://anonymous@192.168.100.70/dev/zero
   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                  Dload  Upload   Total   Spent    Left  Speed
   0     0    0  830M    0     0  8834k      0 --:--:--  0:01:39 --:--:-- 8982k

       perf disabled   coverage: 100.000%  runtime:  99.998%   covtime: 100.000%
name____________________________________|ratio___|1%_____2%________5%_____10%_|
in_cksumdata                            | 10.130%|=========================   |
memcpy                                  |  9.786%|========================    |
tcp_output                              |  8.890%|=======================     |
ip_output                               |  5.031%|=================           |
ether_output                            |  2.618%|==========                  |
Thread_Dispatch                         |  2.549%|==========                  |
__divdi3                                |  2.205%|========                    |
bsd__mtx_unlock_flags                   |  2.136%|========                    |
__moddi3                                |  2.067%|========                    |
tcp_input                               |  1.998%|=======                     |
uma_zalloc_arg                          |  1.929%|=======                     |
m_copym                                 |  1.654%|=====                       |
tcp_do_segment                          |  1.654%|=====                       |
tcp_addoptions                          |  1.516%|====                        |
sbdrop_internal                         |  1.447%|====                        |
mb_free_ext                             |  1.378%|===                         |
uma_zfree_arg                           |  1.309%|===                         |
ip_input                                |  1.240%|==                          |
in_cksum_skip                           |  1.171%|=                           |
uec_if_interface_start                  |  1.171%|=                           |
quicc_bd_tx_submit_and_wait             |  1.171%|=                           |
callout_reset_on                        |  1.102%|=                           |
rtfree                                  |  1.033%|                            |
uec_if_dequeue                          |  1.102%|=                           |
rn_match                                |  0.964%|<                           |
rtalloc1_fib                            |  0.964%|<                           |
ether_nh_input                          |  0.964%|<                           |
uec_if_bd_wait_and_free                 |  0.964%|<                           |
mb_ctor_mbuf                            |  0.895%|<                           |
in_lltable_lookup                       |  0.895%|<                           |
memset                                  |  0.895%|<                           |
uec_if_bd_wait.constprop.9              |  0.827%|<                           |
mb_dupcl                                |  0.758%|<                           |
cc_ack_received.isra.0                  |  0.758%|<                           |
tcp_timer_active                        |  0.758%|<                           |
bsd__mtx_lock_flags                     |  0.689%|<                           |
netisr_dispatch_src                     |  0.689%|<                           |
in_pcblookup_hash_locked.isra.1         |  0.689%|<                           |
tcp_xmit_timer                          |  0.689%|<                           |
sosend_generic                          |  0.620%|<                           |
rtems_bsd_chunk_get_info                |  0.620%|<                           |
Thread_Enable_dispatch                  |  0.620%|<                           |
bzero                                   |  0.620%|<                           |
rw_runlock                              |  0.551%|<                           |
uma_find_refcnt                         |  0.551%|<                           |
arpresolve                              |  0.551%|<                           |
chunk_compare                           |  0.551%|<                           |
ether_demux                             |  0.551%|<                           |
rtems_clock_get_uptime_timeval          |  0.551%|<                           |
TOD_Get_with_nanoseconds                |  0.551%|<                           |
memcmp                                  |  0.551%|<                           |
mb_ctor_clust                           |  0.482%|<                           |
in_pcblookup_hash                       |  0.482%|<                           |
in_clsroute                             |  0.482%|<                           |

So we 4.2% processor load due to the 64-bit divisions and the throughput drops 
by 3%.

With the standard RTEMS objects I get this:

curl -o /dev/null  ftp://anonymous@192.168.100.70/dev/zero
   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                  Dload  Upload   Total   Spent    Left  Speed
   0     0    0  927M    0     0  8438k      0 --:--:--  0:01:52 --:--:-- 8528k

       perf disabled   coverage: 100.000%  runtime:  99.997%   covtime: 100.000%
name____________________________________|ratio___|1%_____2%________5%_____10%_|
in_cksumdata                            | 10.184%|=========================   |
memcpy                                  |  9.052%|========================    |
tcp_output                              |  8.382%|=======================     |
ip_output                               |  3.310%|=============               |
rtems_semaphore_obtain                  |  3.017%|============                |
ether_output                            |  2.598%|==========                  |
Thread_Dispatch                         |  2.430%|=========                   |
uma_zalloc_arg                          |  1.844%|======                      |
uma_zfree_arg                           |  1.634%|=====                       |
quicc_bd_tx_submit_and_wait             |  1.634%|=====                       |
tcp_do_segment                          |  1.550%|=====                       |
uec_if_dequeue                          |  1.508%|====                        |
in_lltable_lookup                       |  1.466%|====                        |
rn_match                                |  1.424%|====                        |
rtalloc1_fib                            |  1.424%|====                        |
ip_input                                |  1.424%|====                        |
in_cksum_skip                           |  1.424%|====                        |
rtems_semaphore_release                 |  1.424%|====                        |
CORE_mutex_Surrender                    |  1.383%|===                         |
Thread_queue_Dequeue                    |  1.341%|===                         |
m_copym                                 |  1.257%|==                          |
bsd__mtx_lock_flags                     |  1.173%|=                           |
mb_free_ext                             |  1.173%|=                           |
arpresolve                              |  1.173%|=                           |
memset                                  |  1.173%|=                           |
tcp_input                               |  1.131%|=                           |
tcp_addoptions                          |  1.089%|=                           |
bsd__mtx_unlock_flags                   |  1.047%|                            |
ether_nh_input                          |  1.047%|                            |
bzero                                   |  0.963%|<                           |
rtfree                                  |  0.922%|<                           |
netisr_dispatch_src                     |  0.880%|<                           |
mb_dupcl                                |  0.838%|<                           |
rtalloc_ign_fib                         |  0.838%|<                           |
in_broadcast                            |  0.838%|<                           |
uec_if_interface_start                  |  0.838%|<                           |
memmove                                 |  0.838%|<                           |
mb_ctor_mbuf                            |  0.796%|<                           |
tcp_timer_active                        |  0.796%|<                           |
chunk_compare                           |  0.712%|<                           |
callout_reset_on                        |  0.712%|<                           |
in_pcblookup_hash_locked                |  0.712%|<                           |
uec_if_bd_wait_and_free                 |  0.712%|<                           |
RBTree_Find                             |  0.712%|<                           |
tcp_dooptions                           |  0.670%|<                           |
sbsndptr                                |  0.628%|<                           |
if_transmit                             |  0.586%|<                           |
Objects_Get_isr_disable                 |  0.544%|<                           |

So we 8.5% processor load due mutex operations and the throughput drops by 7%.

In all configurations we see that the UMA zone allocator used for mbuf/mcluster 
allocations produces a high processor load.  If we replace it with a simple 
freelist, then we will likely be on par with the old network stack in terms of 
throughput on this target.

The in_cksumdata() is a generic implementation in the new network stack.  The 
old network stack uses an optimized variant with inline assembler.

Modern network interface controller support TCP/UDP checksum generation and 
checks in hardware.  This can be also used with the new network stack.

-- 
Sebastian Huber, embedded brains GmbH

Address : Dornierstr. 4, D-82178 Puchheim, Germany
Phone   : +49 89 189 47 41-16
Fax     : +49 89 189 47 41-09
E-Mail  : sebastian.huber at embedded-brains.de
PGP     : Public key available on request.

Diese Nachricht ist keine geschäftliche Mitteilung im Sinne des EHUG.


More information about the devel mailing list