RFC: Bdbuf transfer error handling

Sun Nov 22 00:21:09 UTC 2009

Thomas Doerfler wrote:
> Chris,
> 
> Chris Johns wrote:
>> Sebastian Huber wrote:
>>> R3. Read Ahead Request and No User
>>>
>>> We discard the buffer.  This is the current approach.
>>>
>> As you know I wish to move the read ahead logic out of the cache into
>> the file systems. I propose to change the API to have a
>> rtems_chain_control passed in for gets and reads and the buffers
>> returned linked the chain. This means a file system can determine the
>> number of buffers it wants and the cache will attempt to do this. If it
>> cannot it does what it can which could be 0 buffers returned because of
>> read errors. It is up to the file system to manage this, typically with
>> a EIO. Note, a resource issue in the cache would block the requester
>> until it can be completed. The way to return ENXIO when the device is
>> not available is something I need to figure out.
> 
> I fear that this would make the filesystems code more complicated,
> because then they are responsible for keeping track which read-ahead
> buffers they have requested.

Not in the file systems we have for RTEMS, including RFS. This is done by the 
cache.

> 
> Example:
> 
> - You open a bg file and read the first 1024 bytes.
> - the filesystem will request to read ahead from bdbuf.
> - therefore, the bdbuf read call will only return to the FS code, when
> ALL sectors are available.

This is no different to read ahead being in the cache.

> - then you process the 1024 bytes and it takes a VERY long time to do so
> (e.g. because you transfer them over a slow network connection or do
> complicated math, or send them to a slow output device or....)
> -> since these read-ahead blocks are requested and occupied from the
> file system, these block are not available for other caching.
> 
> In this scenario, a lot of buffer space gets eaten up.
> 

This is not what happens currently nor would I propose it into the future. Any 
file system code that holds bds when it releases its internal file system 
lock is a bug. The MSDOS file system currently does this and it is a bug.

> If you change this scenario slightly, because the read data is processed
> quickly, you get a performance gain since the read-ahead requires less
> transactions between bdbuf and the storage hardware.

Sure this is the purpose of read ahead but it breaks down when the file system 
knows it only wants 1 block. For example a small RFS (or ext2fs) partition 
with a block size such that you only have 1 bitmap allocator for blocks but 
you have a read ahead of 4 blocks. Every time you allocate a block you end up 
reading 3 blocks that you may never need at the cost of data already in the 
cache you may need. Also a single cache setting for read ahead has to fit all 
disk sizes, file systems and needs on a single system. That is difficult to 
get right.

> So my point is that it would make sense to have the "read-ahead" buffers
> marked as "reusable".

I am sorry I do not follow. Are you suggesting we implement read locks and 
write locks on bds ? This is a complication I have managed to avoid.

My suggestion is simple. Change the get, read and release calls to pass a 
chain. On the get and read calls the file system asks for the number of blocks 
to be returned. The cache will always attempt to return the first block and if 
it cannot returns an error code. This is part of the thread with Sebastian. 
The file system needs to manage getting less data than it asked for with a 
further request for data. The release calls using a chain allows a single lock 
of the cache to handle more than bd. This is a win. Currently the file system 
can only request single blocks even if the hardware and cache has read more. 
The overhead is repeated cache lock/unlock calls and more no real gain. It is 
rear to have 2 users accessing the same device at the cache level. It can 
happen with tools like dd and hexdump. The file system must release all bds 
back to the cache before returning to the user.

It is similar to the way the read and write calls work. You provide a buffer 
of a specific size and the file system fills as much data as it can returning 
the amount read or written. The user needs to handle the case where only some 
of the data was processed.

A file system with control of read size could map the data read to the amount 
requested by the user. For example a user passes in to a 64K buffer to read a 
file that is only 1k long. The file system can request just 1k. If the file is 
large it could request 64K. Currently the file systems sit is a look doing 
this so the time is similar for the user.

> ----------------
> 
> I agree that only the file system knows, if and how much read ahead
> really makes sense. OTOT, the switch from sector based bdbuf to block
> (cluster?) based bdbuf also reads bigger chunks, which partially solves
> the read-ahead requriements.

It does but we can go a step further for those file systems that can handle this.

> 
> Would it make sense that the filesystems code simply passes an
> additional "hint" parameter to each read/get call and the bdbuf layer is
> again responsible to do the read-ahead (or not) and to keep track of the
> available buffers?
> 

We could but it only adds complication. My proposal is to remove the logic 
from the cache to lower its complexity. In the file system all it needs to do 
is pop the first buffer from the chain and release the remaining buffers on 
the chain. This would be the same read ahead in the cache and I think simpler.

> I have already discussed with Sebastian, I think it would make sense to
> define some usage scenarios for the file systems/bdbuf/blockdev area,

In the RFS I already maintain a list of recently used and shared bds, It is a 
chain that holds about 5 bd at most. I define a file system transaction as the 
time from the file system being locked to being unlocked. The RFS holds 
buffers only during the transaction, that is all buffers must be release when 
the file system is unlocked.

For meta-data accesses such as bitmap allocators and inodes I only want 
single blocks and for and mapped data including the map's data and the number 
of blocks read would be capped. The exact number read would depend on the 
amount of data requested by the user and size of the map being read.

> so
> we can discuss the pros and cons of the different architectures from a
> common basis.

I think you will need to give me an example.

> And before you get the wrong impression: I really appreciate the great
> improvement you are doing in that area,

Yeah I agree and please do not only thank me. Sebastian is also doing a great 
job. It has been so good having a peer review the code and improve it.

> it is just that we have
> different use cases in mind when doing certain design decisions and
> therefore we tend to different paths.

What file system are you using ?
How does this all effect the USB disk access ?

> 
> With kind regards,
> 
> Thomas.
> 
>