libbsd network stack optimization tips & tricks
chrisj at rtems.org
Mon Apr 29 01:50:31 UTC 2019
On 25/4/19 7:37 am, Jonathan Brandmeyer wrote:
> Any good tips & tricks I should know about how to optimize the
> rtems-libbsd networking stack?
I use the stack defaults with an /etc/rc.conf of:
TELn [/] # cat /etc/rc.conf
# Hydra LibBSD Configuration
ifconfig_cgem0="DHCP rxcsum txcsum"
dhcpcd_options="--nobackground --timeout 10"
TELn [/] # ifconfig
cgem0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
inet6 fe80::72b3:d5ff:fec1:6029%cgem0 prefixlen 64 scopeid 0x1
inet 10.10.5.189 netmask 0xffffff00 broadcast 10.10.5.255
media: Ethernet autoselect (1000baseT <full-duplex>)
Using a recent kernel, libbsd and a custom protobufs protocol some simple
testing I did showed a sustained TX of 800Mbsp with higher peaks on TCP and UDP.
I am not sure what the socket buffer sizes are set too.
Note, the interface set up enables hardware checksums for the tx and rx paths.
> - Cortex-A9, dual-core, SMP mode, using the zynq BSP on microzed hardware.
> - RTEMS v5, using the libbsd networking layer.
> - Network is otherwise idle
> - Test case is a trivial program that just read()'s from a socket in a
> loop into a 10 kB buffer, while using netcat or iperf on the sender to
> stream data through the pipe. Nothing is done with the data after it
> is read, we just read another buffer.
> The throughput isn't great. I'm seeing ~390 Mbps with default
> settings. When testing with iperf as the client, I see that one IRQS
> server is credited with almost exactly the same amount of runtime as
> the test duration, and that the SHEL task (implementing the server
> side of the socket) is credited with about 40% of that time as well.
> Without a detailed CPU profiler, its hard to know exactly where the
> time is being spent in the networking stack, but it clearly is
> CPU-limited. Enabling hardware checksum offload improved throughput
> from ~390 Mbps to ~510 Mbps. Our dataflow is such that jumbo frames
> would be an option, but the cadence device doesn't support an MTU
> larger than 1500 bytes. Disabling the fancy networking features used
> by the libbsd test programs had no effect.
I use the shell command `top` to look at the CPU load. With a single core I had
capacity left, ie IDLE was not 0%. I think I was limited by data feed from the PL.
> Ethernet is not used in our field configuration, but in our testing
> configuration we were aiming for about 500 Mbps throughput with about
> 1.5 cores left for additional processing. Are there any other tunable
> knobs that can get some more throughput? XAPP1082 suggests that
> inbound throughput in the 750+ range is achievable... on a completely
> different OS and network stack.
> Speaking of tunables, I do see via `sysctl` that
> `dev.cgem.0.stats.rx_resource_errs` and `dev.cgem.0._rxnobufs` are
> nonzero after a benchmark run. But if the test is CPU limited, then I
> wouldn't expect throwing buffers at the problem to help.
I would attempt to separate the networking performance testing and your apps
ability to consume the data. This may help isolate the performance issue.
More information about the users