MBUF Cluster Network freeze problem

Tue May 1 20:23:31 UTC 2001

Bob,
I downloaded the latest ss and in the ChangeLog for libnetworking I could
find no mention of fixes regarding mbufs/clusters.  The libnetworking
ChangeLog for 4.5.0 (which I am using) seems to contain nothing newer than
1998. 

When I flood ping the unit with nothing going on, it seems to recover with
no mbuf/cluster problems reported. The problem seems to occur with lot of
clients connected and sending/rcving data. Could it be that the unit's
consumption of data is being prempted by the ping -f and is actually
experiencing the problem you describe below as "kills the entire network"?
Do you see the "still waiting for mbuf clusters" message when this problem
occurs?

Guess I need more information about what has changed and when it was
changed.

-gene

-----Original Message-----
From: bob [mailto:bobwis at asczone.com]
Sent: Tuesday, May 01, 2001 12:21 PM
To: Smith, Gene
Cc: rtems-users at oarcorp.com
Subject: RE: MBUF Cluster Network freeze problem

Hello Gene,
Its been a while, but I am pretty sure that with snapshot 3 at 
least, the
system does recover if you fill all the MBUF's to the point of the error
message, close the connections (sender end), wait a while and then start
again. There *might* be a problem with connections left open 
after sending
something where the net data is unconsumed by the RTEMS 
application, I think
that the MBUFs stay allocated forever and it can soon become a 
problem. This
happens when the Application is waiting for something else and 
not consuming
network data. This is not a TCP/Stack problem as such, just an annoying
side-effect that kills the entire network.
It would be nice if there was a daemon to free MBUFs that were 
too stale, or
something to prevent one "stream" from hogging the whole MBUF 
pool if its
associated application had stopped consuming buffers for a while.
I am very pleased you are looking into it by further testing as 
it would be
nice to state what the rules of the game really are!
Hope this helps.
Bob Wisdom
bobwis at asczone.com

-----Original Message-----
From: Smith, Gene [mailto:Gene.Smith at sea.siemens.com]
Sent: 01 May 2001 16:20
To: joel.sherrill at oarcorp.com
Cc: rtems-users at oarcorp.com
Subject: FW: MBUF Cluster Network freeze problem

Joel,

I have been doing to final stress test to finish up my project. I have
encountered
the problem described by "bob" back in Dec (see below).  What I have is
about 64 tcp
connections to the rtems unit with data flowing back and forth on all
connection. This
seems to work fine. The mbufs and clusters just seem moderately used.
However, when I
flood ping the unit, I quickly start seeing the message from 
rtems_glue.c
"Still
waiting for mbuf clusters" which repeats about every 30 
seconds. Even after
I stop the
flood ping and disconnect all clients, I still see the messages 
and I never
seems to
recover. I have to reboot the unit.

I also see this when I stop my 80186 processor. The 186 
receives the data
from the rtems processor (a 386) via dual port ram. The sockets are all
still connected
but no read()s are occurring when the 186 stops, so data backs up and
eventually depletes
the mbuf clusters which causes the "Still waiting..." messages to occur.
Also, in this
situation I have to reboot.  I can see this with just one connection and
flood ping
is not needed to trigger it.

"bob" seem to indicate that possibly this had been corrected in 
a post-4.5.0
snapshot
but it is somewhat unclear from the postings.  Do you or Eric know the
status of this
problem?  It seems like the systems should recover from mbuf cluster
depletion. I am
using 4.5.0.

-gene

-----Original Message-----
From: bob [mailto:bobwis at ascweb.co.uk]
Sent: Friday, December 15, 2000 8:33 AM
To: Eric_Norum at young.usask.ca
Cc: 'rtems mail-list'
Subject: MBUF Cluster Network freeze problem

Hello Eric / RTEMS users
I have been testing again this morning (snapshot 20001201) and it is all
looking very positive. I can now confirm that we don't need the 
recv before
a close, to empty the data to save MBUFs.  I am fairly certain 
this was not
always the case, but I could be wrong. The MBUF pool also seems 
to cope with
the cable being pulled - that is, it recovers the used MBUFs 
all by itself
after the timeout has occurred.
The only problem we are seeing now is not a BSD stack problem 
as such, its
when the task servicing the open socket stops calling read 
(because it has
frozen). The open socket still allows incoming data into free 
MBUFs, fills
the clusters and locks up the lot after a while. The only 
recovery seems to
be a system reset. While the MBUF clusters are filling, the master
application task still allows accept(), to spawn new tasks and 
sockets, and
so the "big lockup" comes quite a while after this. This had us 
going for a
while ;-)

To conclude, the TCP Stack looks very solid again, now that we 
have isolated
the problems to our application.
Thanks again for all your help.
Regards
Bob Wisdom