libbsd network stack optimization tips & tricks

Mon Apr 29 01:50:31 UTC 2019

On 25/4/19 7:37 am, Jonathan Brandmeyer wrote:
> Any good tips & tricks I should know about how to optimize the
> rtems-libbsd networking stack?

I use the stack defaults with an /etc/rc.conf of:

 TELn [/] # cat /etc/rc.conf
 #
 # Hydra LibBSD Configuration
 #

 hostname="XXX-880452-0014"
 ifconfig_cgem0="DHCP rxcsum txcsum"
 ifconfig_cgem0_alias0="ether 20:c3:05:11:00:25"

 dhcpcd_priority="200"
 dhcpcd_options="--nobackground --timeout 10"

 telnetd_enable="YES
 TELn [/] # ifconfig
 cgem0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500

options=68008b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
         ether 20:c3:05:11:00:25
         hwaddr 0e:b0:ba:5e:ba:11
         inet6 fe80::72b3:d5ff:fec1:6029%cgem0 prefixlen 64 scopeid 0x1
         inet 10.10.5.189 netmask 0xffffff00 broadcast 10.10.5.255
         nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
         media: Ethernet autoselect (1000baseT <full-duplex>)
         status: active

Using a recent kernel, libbsd and a custom protobufs protocol some simple
testing I did showed a sustained TX of 800Mbsp with higher peaks on TCP and UDP.
I am not sure what the socket buffer sizes are set too.

Note, the interface set up enables hardware checksums for the tx and rx paths.

> Case:
> - Cortex-A9, dual-core, SMP mode, using the zynq BSP on microzed hardware.
> - RTEMS v5, using the libbsd networking layer.
> - Network is otherwise idle
> - Test case is a trivial program that just read()'s from a socket in a
> loop into a 10 kB buffer, while using netcat or iperf on the sender to
> stream data through the pipe.  Nothing is done with the data after it
> is read, we just read another buffer.
> 
> The throughput isn't great.  I'm seeing ~390 Mbps with default
> settings.  When testing with iperf as the client, I see that one IRQS
> server is credited with almost exactly the same amount of runtime as
> the test duration, and that the SHEL task (implementing the server
> side of the socket) is credited with about 40% of that time as well.
> 
> Without a detailed CPU profiler, its hard to know exactly where the
> time is being spent in the networking stack, but it clearly is
> CPU-limited.  Enabling hardware checksum offload improved throughput
> from ~390 Mbps to ~510 Mbps.  Our dataflow is such that jumbo frames
> would be an option, but the cadence device doesn't support an MTU
> larger than 1500 bytes.  Disabling the fancy networking features used
> by the libbsd test programs had no effect.

I use the shell command `top` to look at the CPU load. With a single core I had
capacity left, ie IDLE was not 0%. I think I was limited by data feed from the PL.

> Ethernet is not used in our field configuration, but in our testing
> configuration we were aiming for about 500 Mbps throughput with about
> 1.5 cores left for additional processing.  Are there any other tunable
> knobs that can get some more throughput?  XAPP1082 suggests that
> inbound throughput in the 750+ range is achievable... on a completely
> different OS and network stack.
> 
> Speaking of tunables, I do see via `sysctl` that
> `dev.cgem.0.stats.rx_resource_errs` and `dev.cgem.0._rxnobufs` are
> nonzero after a benchmark run.  But if the test is CPU limited, then I
> wouldn't expect throwing buffers at the problem to help.

I would attempt to separate the networking performance testing and your apps
ability to consume the data. This may help isolate the performance issue.

Chris