Failed writes to block devices don't report an error

Wed Jun 23 01:18:02 UTC 2010

On 22/06/10 11:03 PM, Arnout Vandecappelle wrote:
>   Hoi all,
>
>   When you write to a file on a block device, failures of the device itself
> don't result in an error of the write.  That's because the bdbuf layer
> caches the blocks that are written and does the actual writeback
> asynchronously after some time.  At that point, of course, it's impossible
> to report an error because the write() call has already returned.
>
>   This behaviour can be pretty annoying, though.  For instance, I'm logging
> to an SD card.  When the SD card is removed, the logging just continues
> without any error at all...
>

The standards talk in terms of asynchronous IO (the aio_* type calls) 
and "Synchronous Input and Output". The standard talks about the need 
for AIO type functionality in real-time applications for high end 
performance and it defines "Synchronous Input and Output" as:

  "A determinism and robustness improvement mechanism to enhance
   the data input and output mechanisms, so that an application
   can ensure that the data being manipulated is physically present
   on secondary mass storage devices."

I see us needing both of these. Further the 'write' has:

  "[SIO] If the O_DSYNC bit has been set, write I/O operations
   on the file descriptor shall complete as defined by synchronized
   I/O data integrity completion."

This is an extension. I cannot find any further detail about the SIO 
extensions but to me it makes sense that we look at a standards base way 
of managing this. We have an AIO effort underway so we should consider 
the O_DSYNC flag.

The benefit is all file systems will need to support it and this is 
important. The base definition states for "Synchronized I/O Data 
Integrity Completion":

  "For read, when the operation has been completed or diagnosed
   if unsuccessful. The read is complete only when an image of
   the data has been successfully transferred to the requesting
   process. If there were any pending write requests affecting the
   data to be read at the time that the synchronized read operation
   was requested, these write requests are successfully transferred
   prior to reading the data.

   For write, when the operation has been completed or diagnosed if
   unsuccessful. The write is complete only when the data specified
   in the write request is successfully transferred and all file
   system information required to retrieve the data is successfully
   transferred."

This means a file that is read after a write also has to wait for the 
write to sync. I like this approach because it lets the application 
control the data it considers important without everything being effected.

>
>   I have a few ideas of how this can be solved but I'd like some feedback.
>

I think we will need to stage the development so we have a working 
O_DSYNC flag. I cannot see a simple direct path to it being implemented. 
We need to work on each file system to make it work.

> 1. Make sure rtems_bdbuf_sync() returns an error if the write fails.
> rtems_bdbuf_sync() is called when disks or files are sync'ed.  It writes
> back a block and waits for the writeback to finish.  Currently, it always
> returns successfully but we could put error detection code in here.

This would be good to have. I see it being needed which ever way we go. 
I also see the following:

  if (write(fd, buf, size) != size)
    goto handle_error
  if (sync_call(..) < 0)
    goto handle_error

in an application could be a work around until a more standard approach 
is made. I prefer this over the other suggestions.

>
> 2. Mark blocks as error and do synchronous writes for error blocks.  After a
> block is written it is not flushed from the cache, but remains there for
> future accesses.  We could mark the block as 'error'.  On the next write to
> that block, we write immediately instead of going through the cache.  That
> allows us to report any error occuring at that time.
>

I do not like this. Write blocks should not be flushed after being 
written. They stay until cache pressure reuses them. If you keep error 
blocks around I can see all blocks slowly being used up and staving the 
cache of buffers. To solve this we would have to add further complexity. 
I wonder if the O_DSYNC flag would be easier because the application 
knows the write failed and can manage it.

I suppose the key point is getting feedback to the application.

> 3. In the filesystem (dosfs) add a mount option that makes sure that all
> writes are done synchronously, i.e. they use rtems_bdbuf_sync() rather than
> rtems_bdbuf_release_modified().  This of course relies on #1 so
> rtems_bdbuf_sync() returns an error.

I do not see any long term value in this. All file systems need to be 
made to work. I prefer something that is per file and supported by all 
file systems. The application does not get any feedback.

>
> 4. In the block device driver itself (spi-sd-card.c in my case), detect that
> the media is removed and take appropriate action.  I'm not entirely sure,
> though, what 'appropriate action' would be.
>

I know Sebastian has something like this for USB devices. Maybe he can 
help here.

I also think we should think about adding error stat reporting at the 
device layer in a standard manner. A little like network drivers have so 
application can monitor the devices and manage any issues it sees as 
important. I see this as a device issue not a a bdbuf thing. You open 
the device node and perform the error stat ioctl call to get the stats.

Chris