Performance tests with new network stack
Sebastian Huber
sebastian.huber at embedded-brains.de
Thu Sep 25 06:15:06 UTC 2014
Hello,
I used simple FTP transfers to/from the target to measure the TCP performance
of the new network stack on a PowerPC MPC8309. The new network stack is a port
from FreeBSD 9.2. It is highly optimized for SMP and uses fine grained
locking. For uni-processor systems this is not a benefit. About 2000 mutexes
are present in the idle state of the stack. It turned out that the standard
RTEMS semaphores are a major performance bottleneck. I added a light weight
alternative (rtems_bsd_mutex). For fine grained locking it is important that
the uncontested mutex obtain/release is as fast as possible.
With the latest version (struct timespec and rtems_bsd_mutex) I get this:
curl -o /dev/null ftp://anonymous@192.168.100.70/dev/zero
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 1194M 0 0 9101k 0 --:--:-- 0:02:14 --:--:-- 9158k
perf disabled coverage: 100.000% runtime: 99.998% covtime: 100.000%
name________________________|ratio___|1%_____2%________5%_____10%_____20%_____|
in_cksumdata | 11.137%|========================== |
memcpy | 10.430%|========================= |
tcp_output | 7.189%|===================== |
ip_output | 3.241%|============= |
uma_zalloc_arg | 2.710%|=========== |
ether_output | 2.533%|========== |
tcp_do_segment | 2.121%|======== |
m_copym | 2.062%|======== |
uma_zfree_arg | 2.062%|======== |
bsd__mtx_unlock_flags | 2.062%|======== |
tcp_input | 2.003%|======= |
Thread_Dispatch | 1.885%|======= |
rtalloc1_fib | 1.649%|===== |
ip_input | 1.708%|====== |
memmove | 1.532%|==== |
rn_match | 1.473%|==== |
tcp_addoptions | 1.414%|==== |
arpresolve | 1.355%|=== |
in_cksum_skip | 1.296%|=== |
memset | 1.296%|=== |
mb_dupcl | 1.178%|== |
uec_if_dequeue | 1.178%|== |
in_lltable_lookup | 1.119%|= |
rtfree | 1.001%|< |
ether_nh_input | 1.001%|< |
uec_if_bd_wait_and_free | 1.001%|< |
quicc_bd_tx_submit_and_wait | 1.001%|< |
TOD_Get_with_nanoseconds | 1.001%|< |
uec_if_interface_start | 0.942%|< |
bsd__mtx_lock_flags | 0.883%|< |
bzero | 0.883%|< |
mb_ctor_mbuf | 0.824%|< |
mb_free_ext | 0.824%|< |
netisr_dispatch_src | 0.824%|< |
in_pcblookup_hash_locked.isr| 0.766%|< |
bsd_critical_enter | 0.766%|< |
rw_runlock | 0.707%|< |
if_transmit | 0.707%|< |
Timespec_Add_to | 0.707%|< |
in_delayed_cksum | 0.648%|< |
tcp_timer_active | 0.648%|< |
ether_demux | 0.648%|< |
ppc_clock_nanoseconds_since_| 0.648%|< |
RBTree_Find | 0.648%|< |
Thread_Enable_dispatch | 0.648%|< |
rw_rlock | 0.589%|< |
callout_reset_on | 0.589%|< |
in_clsroute | 0.589%|< |
We have 3% processor load due to mutex operations (_bsd__mtx_lock_flags() and
_bsd__mtx_unlock_flags()).
With the 64-bit nanoseconds timestamp I get this:
curl -o /dev/null ftp://anonymous@192.168.100.70/dev/zero
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 830M 0 0 8834k 0 --:--:-- 0:01:39 --:--:-- 8982k
perf disabled coverage: 100.000% runtime: 99.998% covtime: 100.000%
name____________________________________|ratio___|1%_____2%________5%_____10%_|
in_cksumdata | 10.130%|========================= |
memcpy | 9.786%|======================== |
tcp_output | 8.890%|======================= |
ip_output | 5.031%|================= |
ether_output | 2.618%|========== |
Thread_Dispatch | 2.549%|========== |
__divdi3 | 2.205%|======== |
bsd__mtx_unlock_flags | 2.136%|======== |
__moddi3 | 2.067%|======== |
tcp_input | 1.998%|======= |
uma_zalloc_arg | 1.929%|======= |
m_copym | 1.654%|===== |
tcp_do_segment | 1.654%|===== |
tcp_addoptions | 1.516%|==== |
sbdrop_internal | 1.447%|==== |
mb_free_ext | 1.378%|=== |
uma_zfree_arg | 1.309%|=== |
ip_input | 1.240%|== |
in_cksum_skip | 1.171%|= |
uec_if_interface_start | 1.171%|= |
quicc_bd_tx_submit_and_wait | 1.171%|= |
callout_reset_on | 1.102%|= |
rtfree | 1.033%| |
uec_if_dequeue | 1.102%|= |
rn_match | 0.964%|< |
rtalloc1_fib | 0.964%|< |
ether_nh_input | 0.964%|< |
uec_if_bd_wait_and_free | 0.964%|< |
mb_ctor_mbuf | 0.895%|< |
in_lltable_lookup | 0.895%|< |
memset | 0.895%|< |
uec_if_bd_wait.constprop.9 | 0.827%|< |
mb_dupcl | 0.758%|< |
cc_ack_received.isra.0 | 0.758%|< |
tcp_timer_active | 0.758%|< |
bsd__mtx_lock_flags | 0.689%|< |
netisr_dispatch_src | 0.689%|< |
in_pcblookup_hash_locked.isra.1 | 0.689%|< |
tcp_xmit_timer | 0.689%|< |
sosend_generic | 0.620%|< |
rtems_bsd_chunk_get_info | 0.620%|< |
Thread_Enable_dispatch | 0.620%|< |
bzero | 0.620%|< |
rw_runlock | 0.551%|< |
uma_find_refcnt | 0.551%|< |
arpresolve | 0.551%|< |
chunk_compare | 0.551%|< |
ether_demux | 0.551%|< |
rtems_clock_get_uptime_timeval | 0.551%|< |
TOD_Get_with_nanoseconds | 0.551%|< |
memcmp | 0.551%|< |
mb_ctor_clust | 0.482%|< |
in_pcblookup_hash | 0.482%|< |
in_clsroute | 0.482%|< |
So we 4.2% processor load due to the 64-bit divisions and the throughput drops
by 3%.
With the standard RTEMS objects I get this:
curl -o /dev/null ftp://anonymous@192.168.100.70/dev/zero
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 927M 0 0 8438k 0 --:--:-- 0:01:52 --:--:-- 8528k
perf disabled coverage: 100.000% runtime: 99.997% covtime: 100.000%
name____________________________________|ratio___|1%_____2%________5%_____10%_|
in_cksumdata | 10.184%|========================= |
memcpy | 9.052%|======================== |
tcp_output | 8.382%|======================= |
ip_output | 3.310%|============= |
rtems_semaphore_obtain | 3.017%|============ |
ether_output | 2.598%|========== |
Thread_Dispatch | 2.430%|========= |
uma_zalloc_arg | 1.844%|====== |
uma_zfree_arg | 1.634%|===== |
quicc_bd_tx_submit_and_wait | 1.634%|===== |
tcp_do_segment | 1.550%|===== |
uec_if_dequeue | 1.508%|==== |
in_lltable_lookup | 1.466%|==== |
rn_match | 1.424%|==== |
rtalloc1_fib | 1.424%|==== |
ip_input | 1.424%|==== |
in_cksum_skip | 1.424%|==== |
rtems_semaphore_release | 1.424%|==== |
CORE_mutex_Surrender | 1.383%|=== |
Thread_queue_Dequeue | 1.341%|=== |
m_copym | 1.257%|== |
bsd__mtx_lock_flags | 1.173%|= |
mb_free_ext | 1.173%|= |
arpresolve | 1.173%|= |
memset | 1.173%|= |
tcp_input | 1.131%|= |
tcp_addoptions | 1.089%|= |
bsd__mtx_unlock_flags | 1.047%| |
ether_nh_input | 1.047%| |
bzero | 0.963%|< |
rtfree | 0.922%|< |
netisr_dispatch_src | 0.880%|< |
mb_dupcl | 0.838%|< |
rtalloc_ign_fib | 0.838%|< |
in_broadcast | 0.838%|< |
uec_if_interface_start | 0.838%|< |
memmove | 0.838%|< |
mb_ctor_mbuf | 0.796%|< |
tcp_timer_active | 0.796%|< |
chunk_compare | 0.712%|< |
callout_reset_on | 0.712%|< |
in_pcblookup_hash_locked | 0.712%|< |
uec_if_bd_wait_and_free | 0.712%|< |
RBTree_Find | 0.712%|< |
tcp_dooptions | 0.670%|< |
sbsndptr | 0.628%|< |
if_transmit | 0.586%|< |
Objects_Get_isr_disable | 0.544%|< |
So we 8.5% processor load due mutex operations and the throughput drops by 7%.
In all configurations we see that the UMA zone allocator used for mbuf/mcluster
allocations produces a high processor load. If we replace it with a simple
freelist, then we will likely be on par with the old network stack in terms of
throughput on this target.
The in_cksumdata() is a generic implementation in the new network stack. The
old network stack uses an optimized variant with inline assembler.
Modern network interface controller support TCP/UDP checksum generation and
checks in hardware. This can be also used with the new network stack.
--
Sebastian Huber, embedded brains GmbH
Address : Dornierstr. 4, D-82178 Puchheim, Germany
Phone : +49 89 189 47 41-16
Fax : +49 89 189 47 41-09
E-Mail : sebastian.huber at embedded-brains.de
PGP : Public key available on request.
Diese Nachricht ist keine geschäftliche Mitteilung im Sinne des EHUG.
More information about the devel
mailing list