MBUF Cluster Network freeze problem

Thu Dec 14 10:55:44 UTC 2000

An update on our progress tracking the mysterious MBUF cluster
allocation/deallocation problem:

We have built the 1st December Snapshot but it looks like its our
Application which was the root of the problem - still being "badly behaved"
with respect to calls to the BSD/TCP stack. I have incorporated the
application changes and retro-tested and into the RTEMS 4.5Beta, and it
seems to keep running ok so far in both RTEMS versions.

We think the problem stems from not properly checking and dealing with all
possible return values from calls to recv, send and select. For example it
is possible for a zero return or an error return to come back from recv on a
peer close.  I guess this is "old-hat" to most of you old-timers out there.
Also, the peer close can catch any of the stack calls (send, recv, select).

One thing we noted though was that it seems we have to call recv after
detecting a close, to clear out the buffered data that has not been taken
out the stack by our application. Correspondingly I am not sure if sent data
in the stack automagically gets removed if the socket closes (or the wire
breaks) before the stack has emptied, or if there is a system call we should
use to clear it explicitly.

The point is that I guess under BSD where MBUFS can be in virtual memory it
doesn't matter if over a period of time a few MBUFs / Clusters never get
emptied. With our 24/7 BSP with only a few tens of clusters in RAM it could
be significant.

Does my hypothesis make sense? if so, anyone got any ideas for a
MBUF/Cluster cleanup daemon, or should the stack take care of this itself
when the internal socket timeout occurs (because I don't think it does at
the moment)?

Thanks for sticking with us on this one, all help greatly appreciated!
Regards to all.
Bob Wisdom