libbsd network stack optimization tips & tricks
sebastian.huber at embedded-brains.de
Mon May 6 06:16:14 UTC 2019
On 24/04/2019 23:37, Jonathan Brandmeyer wrote:
> Any good tips & tricks I should know about how to optimize the
> rtems-libbsd networking stack?
> - Cortex-A9, dual-core, SMP mode, using the zynq BSP on microzed hardware.
> - RTEMS v5, using the libbsd networking layer.
> - Network is otherwise idle
> - Test case is a trivial program that just read()'s from a socket in a
> loop into a 10 kB buffer, while using netcat or iperf on the sender to
> stream data through the pipe. Nothing is done with the data after it
> is read, we just read another buffer.
> The throughput isn't great. I'm seeing ~390 Mbps with default
> settings. When testing with iperf as the client, I see that one IRQS
> server is credited with almost exactly the same amount of runtime as
> the test duration, and that the SHEL task (implementing the server
> side of the socket) is credited with about 40% of that time as well.
> Without a detailed CPU profiler, its hard to know exactly where the
> time is being spent in the networking stack, but it clearly is
> CPU-limited. Enabling hardware checksum offload improved throughput
> from ~390 Mbps to ~510 Mbps. Our dataflow is such that jumbo frames
> would be an option, but the cadence device doesn't support an MTU
> larger than 1500 bytes. Disabling the fancy networking features used
> by the libbsd test programs had no effect.
One goal of the GSoC tracing project this year is to be able to analyse
these kinds of problems. Some known problems are:
1. RTEMS is supposed to be a real-time operating system and this has
some implications for the mutex synchronization primitives. In RTEMS an
atomic read-modify-write operation is used in the fast path mutex obtain
and release operations. Linux and FreeBSD provide only random fairness
(Futex) or no fairness at all, but can use only one atomic
read-modify-write operation in the fast path.
2. The NETISR(9) service is currently disabled in libbsd:
3. If you use TCP, then a lock contention in the TCP socket locks may
appear. This leads to costly priority inheritance operations.
4. There is no zero copy receive available (it is just a matter of time
to implement it).
> Ethernet is not used in our field configuration, but in our testing
> configuration we were aiming for about 500 Mbps throughput with about
> 1.5 cores left for additional processing. Are there any other tunable
> knobs that can get some more throughput? XAPP1082 suggests that
> inbound throughput in the 750+ range is achievable... on a completely
> different OS and network stack.
> Speaking of tunables, I do see via `sysctl` that
> `dev.cgem.0.stats.rx_resource_errs` and `dev.cgem.0._rxnobufs` are
> nonzero after a benchmark run. But if the test is CPU limited, then I
> wouldn't expect throwing buffers at the problem to help.
I would try to use UDP and make sure your packet consumer runs on the
same processor as the IRQS server task. Maybe implement a zero copy
receive as well.
Sebastian Huber, embedded brains GmbH
Address : Dornierstr. 4, D-82178 Puchheim, Germany
Phone : +49 89 189 47 41-16
Fax : +49 89 189 47 41-09
E-Mail : sebastian.huber at embedded-brains.de
PGP : Public key available on request.
Diese Nachricht ist keine geschäftliche Mitteilung im Sinne des EHUG.
More information about the users