The "Out of mbuf clusters" problem

Tue Sep 7 17:08:10 UTC 2004

Phil Torre wrote:
> (These comments pertain to RTEMS-4.6.1 on MPC860 with an Ethernet driver
> swiped from the one in eth_comm.)
> 
> We have a problem right now with an application which sends log data
> to a client using the file send hook in the FTP server.  (The hook
> function reads some raw data from flash memory, formats it into
> human-readable text, and passes it back to ftpd, over and over again
> until all of the flash data has been read out.)
> 
> With the default mbuf and mbuf cluster allocations, and the MPC860's
> caches turned off, it looks like the receive daemon can grab all of
> the available mbuf clusters before the network task has a chance to
> process any of them, resulting in deadlock.
> 
> There are at least two ways out of this:  Turn the caches on so we run
> fast enough to keep up with the received data, or increase the amount
> of memory allocated to the mbuf cluster pool.  Either one works to
> prevent the deadlock from occurring.
> 
> My problem is this:  In experimenting with a system in the deadlocked
> state, it seems that running out of clusters is an unrecoverable
> situation.  Even if I set always_keepalive so that the TCP connection
> times out eventually, that causes the TCP connection to be dropped
> but *doesn't deallocate the mbufs/clusters*.  Once it's locked up,
> it's gonna stay that way until the system is rebooted.
> 
> Googling the freebsd mailing lists, I see people complaining about this
> problem (which will kernel panic a freebsd box) from back in the 3.x
> days.  The reply is usually "That's the system's way of telling you 
> that you haven't tuned it properly.  Bad sysadmin!".  That may be
> acceptable for a web server, but not for a mission critical embedded
> product.
> 
> As stated above, I can push the problem back at least two ways, but
> I'd sure prefer to know that it will self-recover even in the worst
> case.  Our networking code is up to date with what was in CVS as of
> August 16th, so we should have the latest and greatest.  Can anyone
> comment on fixes for this?  One obvious thing to try is hacking tcp_drop()
> so that it frees all clusters associated with the stuck TCP connection.

I don't see that there is any other alternative.  If the system is out
of mbufs, then some have to be freed or more magically added to the
pool.  Since we don't have any magic mbuf dispenser at hand, trying
to forcibly reclaim some seems the only way out.

I have no idea what side-effects this will have though.

Is the RTEMS tcp_drop() source the same as that currently in FreeBSD and
NetBSD?  Perhaps they have added a solution.

> -Phil
> 

-- 
Joel Sherrill, Ph.D.             Director of Research & Development
joel at OARcorp.com                 On-Line Applications Research
Ask me about RTEMS: a free RTOS  Huntsville AL 35805
    Support Available             (256) 722-9985