More NFS Stuff

Wed Dec 6 00:24:23 UTC 2006

Hi,

We were getting an application crash, using the NFS daemon.  It was due 
to us changing time-outs, which exacerbated a potential race condition 
in the RPC IO daemon.

The details are:

This is how we understand the NFS/RPC to work.

In the NFS call.
 Retrieve a XACT (transaction item) from a pool of XACTS. (There is a 
message queue of these objects. If the message queue   is empty, create 
a new object).
 Set the timeouts, transaction ID and the ID of the calling thread in 
the XACT and place a pointer to it into another message queue.
 Send a TX_EVENT event to the RPC daemon.
 Wait for an RPC event.
 On receipt of the event, if the XACT is not marked as timed out process 
the XACT input buffer and release the buffer.
(This is where the code died because it was believed that if we got an 
RPC event and the XACT was not timed out then it had a valid buffer)
 put the XACT back into the XACT pool message queue.

In the RPC daemon.

Wait for RX or TX events.  RX Events are generated by callback from the 
socket on receipt of data or timeout.

TX_EVENT processing.
   Stage One.
   On receipt of a TX_EVENT, pull all XACTs in the message queue out of 
the queue and stick them in a list of XACTs needing processing. Mark 
them with trip set to "FIRST_TIME".

   Stage Two
   Ensure another list of the XACTs is empty. (newList)
   Go through the list of XACTs built on receipt of TX_EVENTs
      See if any of the XACTs has timed out (toLive < 0)
         if so mark the XACT as timed out and send the RPC event to the 
thread ID of the XACT. (Here is where we add,                   change 
the XACTs transaction ID)
     else
         Send the output buffer of the XACT to the daemon's server. (If 
the tx fails mark the XACT as failed and send a RPC                event 
to the caller thread)
         If the XACTs trip time is not FIRST_TIME, then this a a 
retransmit so
               adjust the retry_period keeping it below the maximum period.
        Now set the trip, age and retry_period of the XACT.
        Add the XACT to the head of the "newList".

   Stage Three
       Sort the newList by age       Go back and wait for events.

RX Event processing.

   Get data from the socket.
   If there is data, extract the transaction ID (xid) from the data and 
compare it to the xids stored in a hash table of the XACT objects (as 
XACTs are created they are added to the hash table).
       If we find an XACT in the hash table whose ID matches and also 
the server address and port matches
          Set the XACTs ibuf to the data we have received
      Remove the XACT from out of the xact transaction list.
      Change it's xid.
      Recalculate server timeouts based on how long this one took.
      Mark the XACT as rxed good.
      Send the RPC event to the XACTs caller thread ID.

And that's about it.

The problem we had was that if the XACT timed out we sent the XACT back 
to the caller marked as bad and the caller failed the read request. But 
if the data still came back in the XACT still matched via the xid in the 
hash table and the XACT was marked as good and had a buffer of data. The 
next time the nfs call is processed however it grabs a new XACT and 
sends it to the daemon and then waits on an RPC_EVENT but there is 
already a pending event for that thread and so the new XACT is processed 
but its buffer is invalid (NULL) and we got a DTLB miss processing the 
crap data, and subsequently crashed.

Following is a patch on rpcio.c with the change to fix the bug.

--- rpcio.c    2006-11-20 16:50:29.000000000 +1000
+++ rpcio.c    2006-12-04 17:30:53.633652216 +1000
@@ -1263,6 +1263,11 @@
         srv = xact->server;
 
         if (xact->tolive < 0) {
+        /* change the ID - there might still be
+         * a reply on the way. When it arrives we must not find it's ID
+         * in the hashtable
+         */
+          xact->obuf.xid        += XACT_HASHS;
           /* this one timed out */
           xact->status.re_errno  = ETIMEDOUT;
           xact->status.re_status = RPC_TIMEDOUT;

We are also investigating adding some new functionality to the NFS server:

1. A function to return all of the NFS/RPC statistics kept by the 
daemon, rather than just printing it out to a file.
2. A function to allow the default hardcoded timeouts to be changed at 
runtime.  We find the current time-outs way too long.  For example in 
our application NFS replies always within 100us so there is no point 
waiting for 100's of milliseconds to timeout. (any problems with us 
adding these 2 functions?)
3. NFS Read caching.  There are 2 options we identify, 1 is to make 
rpcio daemon nfs-read aware and to handle the read-ahead caching in 
there on nfs-read RPC calls.  The other way is to do it at the NFS 
layer, but as the calls are synchronous, we would need multiple threads 
to deal with the caching, without blocking the original caller and 
defeating the point of caching.  Does anyone have any comments on these 
2 approaches, which one would be more acceptable?

Steven J