MBUF Cluster Network freeze problem

Tue May 15 15:28:22 UTC 2001

Hello Gene,
If you need me to do some specific testing I will have to set up the
system - it will take me a day or two I'm afraid, but I am happy to do it.
We have had so much help from all the RTEMS community, especially Eric, its
the least we can do in return to help everyone else.

My memories about our Network problem process are a bit similar to your own.
At one point we were convinced that the problem that was irrecoverable and
the "Still waiting for mbuf clusters" message would not go away. RTEMS
itself appeared to almost grind to a halt and effectively the system was
unusable and dead.

I am convinced that we did track it down ultimately to the fact that the
Application side had not emptied waiting MBUFs, or had not properly closed
sockets (for whatever reason). We certainly tried multiple flood pings and
concurrent TCP sessions to try and break it. The message appeared, but
provided we closed the sockets properly it did recover, every time after we
sorted our Application code out.

One thought I had was that because RTEMS won't run lower priority tasks when
it is servicing higher ones (even if timeslicing is enabled), a flood ping
will effectively take all the CPU time as the network runs as high priority
tasks. This means that the application layer can't consume, nor terminate a
connection. I am not sure that I proved this however. I once tried running
an Application at Network priority and all hell let loose, so I didn't try
again. (I didn't repeat it and I can't remember what broke unfortunately).

Bottom line is I have certainly seen the dreaded message and effectively had
a dead network, and also seen it come back to life. Subsequently when idle,
the MBUF / Cluster count was just the same as before the incident (i.e. no
leaks).

Have you tried stopping the flood pings after the message has appeared? Are
you also sure you Application code has not frozen or is looping somewhere
(maybe on a blocking call to the TCP/stack) or maybe getting an unexpected
condition back from the stack that is not explicitly handled when the MBUFs
run out- thus never servicing the stack properly thereafter? My bet would be
on the latter, we certainly had all sorts of these kind of problems
ourselves. It seemed very sensitive to get the application code "right" but
I have to say most of the problems were due to me in the end.

Bob Wisdom

From: "Smith, Gene" <Gene.Smith at sea.siemens.com>
To: <bobwis at ascweb.co.uk>
Sent: Monday, May 14, 2001 5:20 PM
Subject: RE: MBUF Cluster Network freeze problem

> Hi Bob,
> Sorry to bother you again about this stuff but Eric, Joel and I have
> yet to find a solution to my problem. I just want to find out from you
> if your problem is the same as mine. Specifically, I am referring to
> what you call the "big lockup" below.
>
> What I see is the console message "Still waiting for mbuf clusters"
> repeated about every 30 seconds. When this occurs, I have to reboot
> the rtems unit. I can get my unit into this mode by heavily loading
> it with 54 tcp/ip connections and about 10 udp clients and then sending
> a flood ping to it.
>
> Is this the same thing you observe when your "big lockup" occurs?
>
> Thanks,
>
> Gene
>
> >-----Original Message-----
> >From: bob [mailto:bobwis at ascweb.co.uk]
> >Sent: Friday, December 15, 2000 8:33 AM
> >To: Eric_Norum at young.usask.ca
> >Cc: 'rtems mail-list'
> >Subject: MBUF Cluster Network freeze problem
> >
> >
> >Hello Eric / RTEMS users
> >I have been testing again this morning (snapshot 20001201) and
> >it is all
> >looking very positive. I can now confirm that we don't need
> >the recv before
> >a close, to empty the data to save MBUFs.  I am fairly certain
> >this was not
> >always the case, but I could be wrong. The MBUF pool also
> >seems to cope with
> >the cable being pulled - that is, it recovers the used MBUFs
> >all by itself
> >after the timeout has occurred.
> >The only problem we are seeing now is not a BSD stack problem
> >as such, its
> >when the task servicing the open socket stops calling read
> >(because it has
> >frozen). The open socket still allows incoming data into free
> >MBUFs, fills
> >the clusters and locks up the lot after a while. The only
> >recovery seems to
> >be a system reset. While the MBUF clusters are filling, the master
> >application task still allows accept(), to spawn new tasks and
> >sockets, and
> >so the "big lockup" comes quite a while after this. This had
> >us going for a
> >while ;-)
> >
> >To conclude, the TCP Stack looks very solid again, now that we
> >have isolated
> >the problems to our application.
> >Thanks again for all your help.
> >Regards
> >Bob Wisdom
> >
> >
>