More NFS Stuff

Till Straumann strauman at slac.stanford.edu
Wed Dec 6 03:17:05 UTC 2006


Steven Johnson wrote:
> Hi,
>
> We were getting an application crash, using the NFS daemon.  It was due 
> to us changing time-outs, which exacerbated a potential race condition 
> in the RPC IO daemon.
>
> The details are:
>
> This is how we understand the NFS/RPC to work.
>
> In the NFS call.
>  Retrieve a XACT (transaction item) from a pool of XACTS. (There is a 
> message queue of these objects. If the message queue   is empty, create 
> a new object).
>  Set the timeouts, transaction ID and the ID of the calling thread in 
> the XACT and place a pointer to it into another message queue.
>  Send a TX_EVENT event to the RPC daemon.
>  Wait for an RPC event.
>  On receipt of the event, if the XACT is not marked as timed out process 
> the XACT input buffer and release the buffer.
> (This is where the code died because it was believed that if we got an 
> RPC event and the XACT was not timed out then it had a valid buffer)
>  put the XACT back into the XACT pool message queue.
>
> In the RPC daemon.
>
> Wait for RX or TX events.  RX Events are generated by callback from the 
> socket on receipt of data or timeout.
>
> TX_EVENT processing.
>    Stage One.
>    On receipt of a TX_EVENT, pull all XACTs in the message queue out of 
> the queue and stick them in a list of XACTs needing processing. Mark 
> them with trip set to "FIRST_TIME".
>
>    Stage Two
>    Ensure another list of the XACTs is empty. (newList)
>    Go through the list of XACTs built on receipt of TX_EVENTs
>       See if any of the XACTs has timed out (toLive < 0)
>          if so mark the XACT as timed out and send the RPC event to the 
> thread ID of the XACT. (Here is where we add,                   change 
> the XACTs transaction ID)
>      else
>          Send the output buffer of the XACT to the daemon's server. (If 
> the tx fails mark the XACT as failed and send a RPC                event 
> to the caller thread)
>          If the XACTs trip time is not FIRST_TIME, then this a a 
> retransmit so
>                adjust the retry_period keeping it below the maximum period.
>         Now set the trip, age and retry_period of the XACT.
>         Add the XACT to the head of the "newList".
>
>    Stage Three
>        Sort the newList by age       Go back and wait for events.
>
> RX Event processing.
>
>    Get data from the socket.
>    If there is data, extract the transaction ID (xid) from the data and 
> compare it to the xids stored in a hash table of the XACT objects (as 
> XACTs are created they are added to the hash table).
>        If we find an XACT in the hash table whose ID matches and also 
> the server address and port matches
>           Set the XACTs ibuf to the data we have received
>       Remove the XACT from out of the xact transaction list.
>       Change it's xid.
>       Recalculate server timeouts based on how long this one took.
>       Mark the XACT as rxed good.
>       Send the RPC event to the XACTs caller thread ID.
>
> And that's about it.
>
> The problem we had was that if the XACT timed out we sent the XACT back 
> to the caller marked as bad and the caller failed the read request. But 
> if the data still came back in the XACT still matched via the xid in the 
> hash table and the XACT was marked as good and had a buffer of data. The 
> next time the nfs call is processed however it grabs a new XACT and 
> sends it to the daemon and then waits on an RPC_EVENT but there is 
> already a pending event for that thread and so the new XACT is processed 
> but its buffer is invalid (NULL) and we got a DTLB miss processing the 
> crap data, and subsequently crashed.
>
> Following is a patch on rpcio.c with the change to fix the bug.
>   
Thanks -- good catch!
> --- rpcio.c    2006-11-20 16:50:29.000000000 +1000
> +++ rpcio.c    2006-12-04 17:30:53.633652216 +1000
> @@ -1263,6 +1263,11 @@
>          srv = xact->server;
>  
>          if (xact->tolive < 0) {
> +        /* change the ID - there might still be
> +         * a reply on the way. When it arrives we must not find it's ID
> +         * in the hashtable
> +         */
> +          xact->obuf.xid        += XACT_HASHS;
>            /* this one timed out */
>            xact->status.re_errno  = ETIMEDOUT;
>            xact->status.re_status = RPC_TIMEDOUT;
>
> We are also investigating adding some new functionality to the NFS server:
>
> 1. A function to return all of the NFS/RPC statistics kept by the 
> daemon, rather than just printing it out to a file.
>   
sounds reasonable
> 2. A function to allow the default hardcoded timeouts to be changed at 
> runtime.  We find the current time-outs way too long.  For example in 
> our application NFS replies always within 100us so there is no point 
> waiting for 100's of milliseconds to timeout. (any problems with us 
> adding these 2 functions?)
>   
sounds reasonable
> 3. NFS Read caching.  There are 2 options we identify, 1 is to make 
> rpcio daemon nfs-read aware and to handle the read-ahead caching in 
> there on nfs-read RPC calls.  The other way is to do it at the NFS 
> layer, but as the calls are synchronous, we would need multiple threads 
> to deal with the caching, without blocking the original caller and 
> defeating the point of caching.  Does anyone have any comments on these 
> 2 approaches, which one would be more acceptable?
>   
I thought about this but I believe it is non-trivial to
implement a good solution.

I would rather not let the RPC daemon know anything about NFS.
Instead, I would implement a 'cache manager' (with potentially
multiple threads doing the actual I/O) in between the NFS
filesystem handlers and the RPC daemon.

The NFS layer would then do synchronous calls to the cache manager.

The cache manager would have to maintain a buffer/block cache
and a natural idea would be trying to use the existing 'bdbuf'
facility from 'libblock'. However, that facility disables task
preemption for  code  sequences long and complex enough
for me to shy away from using it - latency is my religion.

(I wonder if bdbuf couldn't be modified to use mutexes instead
of disabling preemption).

-- Till
> Steven J
> _______________________________________________
> rtems-users mailing list
> rtems-users at rtems.com
> http://rtems.rtems.org/mailman/listinfo/rtems-users
>   




More information about the users mailing list