libbsd network stack optimization tips & tricks

Wed Apr 24 21:37:40 UTC 2019

Any good tips & tricks I should know about how to optimize the
rtems-libbsd networking stack?

Case:
- Cortex-A9, dual-core, SMP mode, using the zynq BSP on microzed hardware.
- RTEMS v5, using the libbsd networking layer.
- Network is otherwise idle
- Test case is a trivial program that just read()'s from a socket in a
loop into a 10 kB buffer, while using netcat or iperf on the sender to
stream data through the pipe.  Nothing is done with the data after it
is read, we just read another buffer.

The throughput isn't great.  I'm seeing ~390 Mbps with default
settings.  When testing with iperf as the client, I see that one IRQS
server is credited with almost exactly the same amount of runtime as
the test duration, and that the SHEL task (implementing the server
side of the socket) is credited with about 40% of that time as well.

Without a detailed CPU profiler, its hard to know exactly where the
time is being spent in the networking stack, but it clearly is
CPU-limited.  Enabling hardware checksum offload improved throughput
from ~390 Mbps to ~510 Mbps.  Our dataflow is such that jumbo frames
would be an option, but the cadence device doesn't support an MTU
larger than 1500 bytes.  Disabling the fancy networking features used
by the libbsd test programs had no effect.

Ethernet is not used in our field configuration, but in our testing
configuration we were aiming for about 500 Mbps throughput with about
1.5 cores left for additional processing.  Are there any other tunable
knobs that can get some more throughput?  XAPP1082 suggests that
inbound throughput in the 750+ range is achievable... on a completely
different OS and network stack.

Speaking of tunables, I do see via `sysctl` that
`dev.cgem.0.stats.rx_resource_errs` and `dev.cgem.0._rxnobufs` are
nonzero after a benchmark run.  But if the test is CPU limited, then I
wouldn't expect throwing buffers at the problem to help.

Thanks,
-- 
Jonathan Brandmeyer