Patches for TCP buffer exhaustion problem

Mon Nov 8 21:17:11 UTC 2004

Many people have written to me requesting my patches for the "out of
mbuf clusters" problem that I experienced with TCP.  I am posting them
to the list rather than submitting a PR because I wish someone who
knows more about TCP than myself would review them first.  For all I
know, there are bugs lurking in there that will cause us to crash,
corrupt data, or interoperate badly with other implementations.  All
that I can report is that I've been running these patches for several
months now and obtaining good results.

Also, while the libnetworking part should be portable between 
architectures, the libbsp part is specific to mpc8xx using SCC1 for
Ethernet.

-Phil

-- 

=====================================================================
Phil Torre                               phone: 425-820-6363 x234
Design Engineer                          email: ptorre at zetron.com
Switching Systems Group                    fax: 425-820-7031
Zetron, Inc.                               web: http://www.zetron.com

-------------- next part --------------

In reference to my previous message, here's what I ended up doing to
"fix" it.

The deadlocked state that I was observing was caused when the RTEMS
system was doing sustained file transmission via FTP, and receiving
a mix of TCP ACKs and broadcast traffic (from chatty ms windows boxes
on our LAN).  With the default mbuf/cluster pool sizes, we quickly
run out of clusters.  (Our Ethernet driver only allocates clusters
for receive data, which makes matters even worse.)

As soon as all clusters are exhausted, the receive task goes into
its "waiting for clusters" loop.  As incoming ACKs are processed,
outbound packets are freed from the sockbuf by TCP, which frees up
some clusters.  But, there is a race condition between the receive
thread and the application writing to the socket; they both want
clusters, and the application is winning too much of the time.  So,
the incoming ACKs get lost, the outbound packets stay in the sockbuf
pending retransmission, and there we sit.

I expected that TCP would eventually time out and drop the connection,
which should bring us back to life.  It does, but manages not to free
the outbound packets from the sockbuf.  (This makes no sense to me,
as it seems to guarantee that we will leak memory if a remote client
hangs.  But, it sat there wedged for 16 hours without recovering.  
That's close enough to forever for me.)

So, I applied two fixes:

1) Deadlock recovery.  I shortened tcp_keepidle to 30 seconds, 
   tcp_keepintvl to 10 seconds, and set always_keepalive.  This
   makes the connection time out in a few minutes rather than many
   hours.  Then I modified tcp_drop() so that if the connection is
   being dropped due to timeout, both receive and send sockbufs and
   any mbufs/clusters are explicitly freed.

2) Deadlock avoidance.  To resolve the "receive thread is losing the
   fight for clusters" problem, I modified m_clalloc() to respect a
   global flag set by the receive thread when it is waiting for a
   cluster.  No one but the receive thread can get a cluster so long
   as that flag is true.

The attached patch "rtems_tcp.diff" modifies these files:

	c/src/lib/libbsp/powerpc/nisqually/network/network.c
	cpukit/libnetworking/netinet/tcp_subr.c
	cpukit/libnetworking/netinet/tcp_timer.c
	cpukit/libnetworking/rtems/rtems_glue.c

The network.c file is specific to our BSP, but is very similar to the
network.c file from the other in-tree mpc8xx BSPs.  A little hand editing
will be required.

In addition to patching RTEMS, I also added code to our application to
tweak the TCP timers.  This code is in the attached file "net_startup.c",
and can be inserted in your application after the call to
rtems_bsdnet_initialize_network().

With those changes, my application is now rock-solid even under
sustained heavy load with default pool sizes.  I can offer patches if
anyone is interested; I don't know if these changes are something 
that would be desirable to merge into RTEMS or not.

Last, a warning and disclaimer:  I am not a TCP expert.  The patches
described above are working well for me, but I do not know if they have
negative side-effects.  They may contain bugs which will cause 
problems in the RTEMS system, or they may break the rules of TCP in
a way that turns us into a bad network neighbor.  I would welcome 
analysis from anyone who knows more about such things.

-------------- next part --------------
*** c/src/lib/libbsp/powerpc/nisqually/network/network.c	Mon Nov  8 10:19:46 2004
--- /usr/local/src/rtems-4.6.0/tools/rtems-4.6.0pre4/c/src/lib/libbsp/powerpc/nisqually/network/network.c	Fri Oct  8 13:29:19 2004
***************
*** 137,150 ****
  static rtems_isr
  m8xx_scc1_enet_interrupt_handler (rtems_vector_number v)
  {
- static unsigned32 droppedFrames = 0;
- 
- 	// Frame dropped because no buffers available?
- 	if (m8xx.scc1.scce & 0x4)
- 	{
- 		m8xx.scc1.scce = 0x4;
- 		droppedFrames++;
- 	}	

  	/*
  	 * Frame received?
--- 137,142 ----
***************
*** 724,732 ****
--- 716,758 ----
    }
  }

+ /* This is a private version of m_clalloc() which is only called by rxDaemon.
+    It doesn't wait on the rxDaemonIsWaitingForCluster flag. */
+ int
+ rxDaemon_m_clalloc(ncl, nowait)
+ {
+ 	m_reclaim ();
+ 	if (mclfree == NULL)
+ 	{
+ 		if (nowait)
+ 			return 0;
+ 
+ 		mbstat.m_wait++;
+ 		for (;;)
+ 		{
+ 			rtems_bsdnet_semaphore_release ();
+ 			rtems_task_wake_after (1);
+ 			rtems_bsdnet_semaphore_obtain ();
+ 			if (mclfree)
+ 				break;
+ 		}
+ 	}
+ 	else
+ 		mbstat.m_drops++;
+ 
+ 	return 1;
+ }
+ 
  /*
   * reader task
   */
+ 
+ // This flag is set by scc_rxDaemon() when it is waiting for a cluster to become available.
+ // Everyone else besides scc_rxDaemon() must wait for this flag to become cleared before
+ // allocating a cluster.  (This ensures that the rxDaemon will get the next available cluster
+ // regardless of task priority.)  The flag is defined in cpukit/libnetworking/rtems/rtems_glue.c.
+ extern int rxDaemonIsWaitingForCluster;
+ 
  static void
  scc_rxDaemon (void *arg)
  {
***************
*** 736,742 ****
  	rtems_unsigned16 status;
  	m8xxBufferDescriptor_t *rxBd;
  	int rxBdIndex;
!   
  	// Allocate space for incoming packets and start reception
  	for (rxBdIndex = 0 ; ;)
  	{
--- 762,770 ----
  	rtems_unsigned16 status;
  	m8xxBufferDescriptor_t *rxBd;
  	int rxBdIndex;
! 
! 	rxDaemonIsWaitingForCluster = 0;
! 
  	// Allocate space for incoming packets and start reception
  	for (rxBdIndex = 0 ; ;)
  	{
***************
*** 802,810 ****
  			m->m_data += sizeof(struct ether_header);
  			ether_input (ifp, eh, m);

! 			// Allocate a new mbuf
  			MGETHDR (m, M_WAIT, MT_DATA);
! 			MCLGET (m, M_WAIT);
  			m->m_pkthdr.rcvif = ifp;
  			sc->rxMbuf[rxBdIndex] = m;
  			rxBd->buffer = mtod (m, void *);
--- 830,863 ----
  			m->m_data += sizeof(struct ether_header);
  			ether_input (ifp, eh, m);

! 			// Allocate a new mbuf cluster.  First, set a flag which will keep other threads 
! 			// from getting there first.
! 			rxDaemonIsWaitingForCluster = 1;
! 
  			MGETHDR (m, M_WAIT, MT_DATA);
! 			MBUFLOCK(
! 				if (mclfree == 0)
! 					(void)rxDaemon_m_clalloc(1, (M_WAIT));
! 				if ((((m)->m_ext.ext_buf) = (caddr_t)mclfree) != 0)
! 				{
! 					++mclrefcnt[mtocl((m)->m_ext.ext_buf)];
! 					mbstat.m_clfree--;
! 					mclfree = ((union mcluster *)((m)->m_ext.ext_buf))->mcl_next;
! 				}
! 			)
! 
! 			if ((m)->m_ext.ext_buf != NULL)
! 			{
! 				(m)->m_data = (m)->m_ext.ext_buf;
! 				(m)->m_flags |= M_EXT;
! 				(m)->m_ext.ext_free = NULL;
! 				(m)->m_ext.ext_ref = NULL;
! 				(m)->m_ext.ext_size = MCLBYTES;
! 			}
! 
! 			// If we reach here, we got a cluster.
! 			rxDaemonIsWaitingForCluster = 0;
! 
  			m->m_pkthdr.rcvif = ifp;
  			sc->rxMbuf[rxBdIndex] = m;
  			rxBd->buffer = mtod (m, void *);
***************
*** 1231,1236 ****
--- 1284,1290 ----
  		rtems_bsdnet_event_receive (START_TRANSMIT_EVENT, RTEMS_EVENT_ANY | RTEMS_WAIT,
  									RTEMS_NO_TIMEOUT, &events);

+ 
  		// Send packets till queue is empty
  		for (;;)
  		{
*** cpukit/libnetworking/rtems/rtems_glue.c	Mon Nov  8 10:28:32 2004
--- /usr/local/src/rtems-4.6.0/tools/rtems-4.6.0pre4/cpukit/libnetworking/rtems/rtems_glue.c	Fri Oct  8 13:35:43 2004
***************
*** 1197,1227 ****
  	return 1;
  }

  int
  m_clalloc(ncl, nowait)
  {
- 	if (nowait)
- 		return 0;
  	m_reclaim ();
! 	if (mclfree == NULL) {
! 		int try = 0;
! 		int print_limit = 30 * rtems_bsdnet_ticks_per_second;

  		mbstat.m_wait++;
! 		for (;;) {
  			rtems_bsdnet_semaphore_release ();
  			rtems_task_wake_after (1);
  			rtems_bsdnet_semaphore_obtain ();
! 			if (mclfree)
  				break;
- 			if (++try >= print_limit) {
- 				printf ("Still waiting for mbuf cluster.\n");
- 				try = 0;
- 			}
  		}
  	}
! 	else {
  		mbstat.m_drops++;
! 	}
  	return 1;
  }
--- 1197,1229 ----
  	return 1;
  }

+ // This flag is set by scc_rxDaemon() when it is waiting for a cluster to become available.
+ // Everyone else besides scc_rxDaemon() must wait for this flag to become cleared before
+ // allocating a cluster.  (This ensures that the rxDaemon will get the next available cluster
+ // regardless of task priority.)
+ int rxDaemonIsWaitingForCluster = 0;
+ 
  int
  m_clalloc(ncl, nowait)
  {
  	m_reclaim ();
! 	if (mclfree == NULL)
! 	{
! 		if (nowait)
! 			return 0;

  		mbstat.m_wait++;
! 		for (;;)
! 		{
  			rtems_bsdnet_semaphore_release ();
  			rtems_task_wake_after (1);
  			rtems_bsdnet_semaphore_obtain ();
! 			if (mclfree && (rxDaemonIsWaitingForCluster == 0))
  				break;
  		}
  	}
! 	else
  		mbstat.m_drops++;
! 
  	return 1;
  }
*** cpukit/libnetworking/netinet/tcp_subr.c	Mon Nov  8 10:28:28 2004
--- /usr/local/src/rtems-4.6.0/tools/rtems-4.6.0pre4/cpukit/libnetworking/netinet/tcp_subr.c	Fri Oct  8 13:35:41 2004
***************
*** 308,313 ****
--- 308,331 ----
  	if (errnum == ETIMEDOUT && tp->t_softerror)
  		errnum = tp->t_softerror;
  	so->so_error = errnum;
+ 
+ 
+ 	/* experimental hack to flush the socket buffers when the tcp goes away */
+ 	if (errnum == ETIMEDOUT)
+ 	{	
+ 		rtems_bsdnet_semaphore_obtain ();
+ 		{
+ 			register struct sockbuf *sockBufPtr = &tp->t_inpcb->inp_socket->so_snd;
+ 			sbunlock(sockBufPtr);
+ 			sbflush(sockBufPtr);
+ 			sockBufPtr = &tp->t_inpcb->inp_socket->so_rcv;
+ 			sbunlock(sockBufPtr);
+ 			sbflush(sockBufPtr);
+ 		}
+ 		rtems_bsdnet_semaphore_release ();
+ 	}
+ 
+ 	
  	return (tcp_close(tp));
  }

*** cpukit/libnetworking/netinet/tcp_timer.c	Mon Nov  8 10:28:28 2004
--- /usr/local/src/rtems-4.6.0/tools/rtems-4.6.0pre4/cpukit/libnetworking/netinet/tcp_timer.c	Fri Oct  8 13:35:41 2004
***************
*** 81,87 ****
  SYSCTL_INT(_net_inet_tcp, TCPCTL_KEEPINTVL, keepintvl,
  	CTLFLAG_RW, &tcp_keepintvl , 0, "");

! static int	always_keepalive = 0;
  SYSCTL_INT(_net_inet_tcp, OID_AUTO, always_keepalive,
  	CTLFLAG_RW, &always_keepalive , 0, "");

--- 81,87 ----
  SYSCTL_INT(_net_inet_tcp, TCPCTL_KEEPINTVL, keepintvl,
  	CTLFLAG_RW, &tcp_keepintvl , 0, "");

! int	always_keepalive = 0;
  SYSCTL_INT(_net_inet_tcp, OID_AUTO, always_keepalive,
  	CTLFLAG_RW, &always_keepalive , 0, "");

-------------- next part --------------
////////////////////////////////////////////////////////////////////////////////////
// This section tweaks the BSD kernel's TCP variables to get
// more robust self-recovery in cases of memory exhaustion, client hang, etc.

		// Set always_keepalive to 1 (TRUE).  This is one of those new-fangled
		// dynamic sysctl variables, and RTEMS' sysctlbyname() doesn't seem to be
		// finished yet.  Sigh.
		extern int always_keepalive;
		always_keepalive = 1;
		int mib[4], value;
		size_t len;
		unsigned32 result;

		mib[0] = CTL_NET;
		mib[1] = PF_INET;
		mib[2] = IPPROTO_TCP;

		// set tcp_keepintvl to 20 (10 seconds between keepalive probes)
		mib[3] = TCPCTL_KEEPINTVL;
		value = 20;
		len = sizeof(value);
		result = sysctl(mib, 4, NULL, 0, &value, len);

		// set tcp_keepidle to 60 (30 seconds idle before first probe)
		mib[3] = TCPCTL_KEEPIDLE;
		value = 60;
		len = sizeof(value);
		result = sysctl(mib, 4, NULL, 0, &value, len);

// End of TCP tweaks.
////////////////////////////////////////////////////////////////////////////////////