libbsd network stack optimization tips & tricks

Chris Johns chrisj at rtems.org
Mon Apr 29 01:50:31 UTC 2019


On 25/4/19 7:37 am, Jonathan Brandmeyer wrote:
> Any good tips & tricks I should know about how to optimize the
> rtems-libbsd networking stack?

I use the stack defaults with an /etc/rc.conf of:

 TELn [/] # cat /etc/rc.conf
 #
 # Hydra LibBSD Configuration
 #

 hostname="XXX-880452-0014"
 ifconfig_cgem0="DHCP rxcsum txcsum"
 ifconfig_cgem0_alias0="ether 20:c3:05:11:00:25"

 dhcpcd_priority="200"
 dhcpcd_options="--nobackground --timeout 10"

 telnetd_enable="YES
 TELn [/] # ifconfig
 cgem0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500

options=68008b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
         ether 20:c3:05:11:00:25
         hwaddr 0e:b0:ba:5e:ba:11
         inet6 fe80::72b3:d5ff:fec1:6029%cgem0 prefixlen 64 scopeid 0x1
         inet 10.10.5.189 netmask 0xffffff00 broadcast 10.10.5.255
         nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
         media: Ethernet autoselect (1000baseT <full-duplex>)
         status: active

Using a recent kernel, libbsd and a custom protobufs protocol some simple
testing I did showed a sustained TX of 800Mbsp with higher peaks on TCP and UDP.
I am not sure what the socket buffer sizes are set too.

Note, the interface set up enables hardware checksums for the tx and rx paths.

> Case:
> - Cortex-A9, dual-core, SMP mode, using the zynq BSP on microzed hardware.
> - RTEMS v5, using the libbsd networking layer.
> - Network is otherwise idle
> - Test case is a trivial program that just read()'s from a socket in a
> loop into a 10 kB buffer, while using netcat or iperf on the sender to
> stream data through the pipe.  Nothing is done with the data after it
> is read, we just read another buffer.
> 
> The throughput isn't great.  I'm seeing ~390 Mbps with default
> settings.  When testing with iperf as the client, I see that one IRQS
> server is credited with almost exactly the same amount of runtime as
> the test duration, and that the SHEL task (implementing the server
> side of the socket) is credited with about 40% of that time as well.
> 
> Without a detailed CPU profiler, its hard to know exactly where the
> time is being spent in the networking stack, but it clearly is
> CPU-limited.  Enabling hardware checksum offload improved throughput
> from ~390 Mbps to ~510 Mbps.  Our dataflow is such that jumbo frames
> would be an option, but the cadence device doesn't support an MTU
> larger than 1500 bytes.  Disabling the fancy networking features used
> by the libbsd test programs had no effect.

I use the shell command `top` to look at the CPU load. With a single core I had
capacity left, ie IDLE was not 0%. I think I was limited by data feed from the PL.

> Ethernet is not used in our field configuration, but in our testing
> configuration we were aiming for about 500 Mbps throughput with about
> 1.5 cores left for additional processing.  Are there any other tunable
> knobs that can get some more throughput?  XAPP1082 suggests that
> inbound throughput in the 750+ range is achievable... on a completely
> different OS and network stack.
> 
> Speaking of tunables, I do see via `sysctl` that
> `dev.cgem.0.stats.rx_resource_errs` and `dev.cgem.0._rxnobufs` are
> nonzero after a benchmark run.  But if the test is CPU limited, then I
> wouldn't expect throwing buffers at the problem to help.

I would attempt to separate the networking performance testing and your apps
ability to consume the data. This may help isolate the performance issue.

Chris


More information about the users mailing list