MBUF Cluster Network freeze problem

Tue May 1 16:20:32 UTC 2001

Hello Gene,
Its been a while, but I am pretty sure that with snapshot 3 at least, the
system does recover if you fill all the MBUF's to the point of the error
message, close the connections (sender end), wait a while and then start
again. There *might* be a problem with connections left open after sending
something where the net data is unconsumed by the RTEMS application, I think
that the MBUFs stay allocated forever and it can soon become a problem. This
happens when the Application is waiting for something else and not consuming
network data. This is not a TCP/Stack problem as such, just an annoying
side-effect that kills the entire network.
It would be nice if there was a daemon to free MBUFs that were too stale, or
something to prevent one "stream" from hogging the whole MBUF pool if its
associated application had stopped consuming buffers for a while.
I am very pleased you are looking into it by further testing as it would be
nice to state what the rules of the game really are!
Hope this helps.
Bob Wisdom
bobwis at asczone.com

-----Original Message-----
From: Smith, Gene [mailto:Gene.Smith at sea.siemens.com]
Sent: 01 May 2001 16:20
To: joel.sherrill at oarcorp.com
Cc: rtems-users at oarcorp.com
Subject: FW: MBUF Cluster Network freeze problem

Joel,

I have been doing to final stress test to finish up my project. I have
encountered
the problem described by "bob" back in Dec (see below).  What I have is
about 64 tcp
connections to the rtems unit with data flowing back and forth on all
connection. This
seems to work fine. The mbufs and clusters just seem moderately used.
However, when I
flood ping the unit, I quickly start seeing the message from rtems_glue.c
"Still
waiting for mbuf clusters" which repeats about every 30 seconds. Even after
I stop the
flood ping and disconnect all clients, I still see the messages and I never
seems to
recover. I have to reboot the unit.

I also see this when I stop my 80186 processor. The 186 receives the data
from the rtems processor (a 386) via dual port ram. The sockets are all
still connected
but no read()s are occurring when the 186 stops, so data backs up and
eventually depletes
the mbuf clusters which causes the "Still waiting..." messages to occur.
Also, in this
situation I have to reboot.  I can see this with just one connection and
flood ping
is not needed to trigger it.

"bob" seem to indicate that possibly this had been corrected in a post-4.5.0
snapshot
but it is somewhat unclear from the postings.  Do you or Eric know the
status of this
problem?  It seems like the systems should recover from mbuf cluster
depletion. I am
using 4.5.0.

-gene

-----Original Message-----
From: bob [mailto:bobwis at ascweb.co.uk]
Sent: Friday, December 15, 2000 8:33 AM
To: Eric_Norum at young.usask.ca
Cc: 'rtems mail-list'
Subject: MBUF Cluster Network freeze problem

Hello Eric / RTEMS users
I have been testing again this morning (snapshot 20001201) and it is all
looking very positive. I can now confirm that we don't need the recv before
a close, to empty the data to save MBUFs.  I am fairly certain this was not
always the case, but I could be wrong. The MBUF pool also seems to cope with
the cable being pulled - that is, it recovers the used MBUFs all by itself
after the timeout has occurred.
The only problem we are seeing now is not a BSD stack problem as such, its
when the task servicing the open socket stops calling read (because it has
frozen). The open socket still allows incoming data into free MBUFs, fills
the clusters and locks up the lot after a while. The only recovery seems to
be a system reset. While the MBUF clusters are filling, the master
application task still allows accept(), to spawn new tasks and sockets, and
so the "big lockup" comes quite a while after this. This had us going for a
while ;-)

To conclude, the TCP Stack looks very solid again, now that we have isolated
the problems to our application.
Thanks again for all your help.
Regards
Bob Wisdom