FW: MBUF Cluster Network freeze problem

Tue May 1 15:19:50 UTC 2001

Joel,

I have been doing to final stress test to finish up my project. I have
encountered
the problem described by "bob" back in Dec (see below).  What I have is
about 64 tcp 
connections to the rtems unit with data flowing back and forth on all
connection. This 
seems to work fine. The mbufs and clusters just seem moderately used.
However, when I 
flood ping the unit, I quickly start seeing the message from rtems_glue.c
"Still 
waiting for mbuf clusters" which repeats about every 30 seconds. Even after
I stop the
flood ping and disconnect all clients, I still see the messages and I never
seems to 
recover. I have to reboot the unit.  

I also see this when I stop my 80186 processor. The 186 receives the data 
from the rtems processor (a 386) via dual port ram. The sockets are all
still connected 
but no read()s are occurring when the 186 stops, so data backs up and
eventually depletes 
the mbuf clusters which causes the "Still waiting..." messages to occur.
Also, in this
situation I have to reboot.  I can see this with just one connection and
flood ping
is not needed to trigger it.

"bob" seem to indicate that possibly this had been corrected in a post-4.5.0
snapshot
but it is somewhat unclear from the postings.  Do you or Eric know the
status of this
problem?  It seems like the systems should recover from mbuf cluster
depletion. I am 
using 4.5.0.

-gene

-----Original Message-----
From: bob [mailto:bobwis at ascweb.co.uk] 
Sent: Friday, December 15, 2000 8:33 AM
To: Eric_Norum at young.usask.ca
Cc: 'rtems mail-list'
Subject: MBUF Cluster Network freeze problem

Hello Eric / RTEMS users
I have been testing again this morning (snapshot 20001201) and it is all
looking very positive. I can now confirm that we don't need the recv before
a close, to empty the data to save MBUFs.  I am fairly certain this was not
always the case, but I could be wrong. The MBUF pool also seems to cope with
the cable being pulled - that is, it recovers the used MBUFs all by itself
after the timeout has occurred.
The only problem we are seeing now is not a BSD stack problem as such, its
when the task servicing the open socket stops calling read (because it has
frozen). The open socket still allows incoming data into free MBUFs, fills
the clusters and locks up the lot after a while. The only recovery seems to
be a system reset. While the MBUF clusters are filling, the master
application task still allows accept(), to spawn new tasks and sockets, and
so the "big lockup" comes quite a while after this. This had us going for a
while ;-)

To conclude, the TCP Stack looks very solid again, now that we have isolated
the problems to our application.
Thanks again for all your help.
Regards
Bob Wisdom