tasskjapp at gmail.com
Fri Mar 17 13:01:08 UTC 2017
Thanks for the notice. I'll try to get around to writing a test case.
Here's a bit more detailed info from the dev who has done the deep
digging into this:
In msdos_shut_down ( msdos_fsunmount.c ) there is a call to fat_file_close( .. ) which attempts to close a file
descriptor and write a range of metadata to that file's director entry located in another cluster:
The problem is that this is the root node, and of course doesn't have a corresponding parent directory entry.
In addition, the "parent directory entry" cluster number is initialised to 0x1 (FAT_ROOTDIR_CLUSTER_NUM)
which is not working according to the FAT specification (cluster numbering starts at 2).
This actually creates a critical bug that overwrites random data to above sectors, because 2 is subtracted from 1
to calculate the sector number of the cluster -> through a series of function calls -> leads to a sector number at
the end of FAT2 (just below the start of the cluster region). The driver believes this is a FAT region (in fat_buf_release),
writes the sector to what it "thinks" is FAT1, proceeds to copy the changes to FAT2 -> adds FAT_LENGTH (8161) to sector,
leading to a write well into the cluster region, randomly overwriting files.
The three function calls above lead to fsck complaining about disk structure:
fsck from util-linux 2.27.1
fsck.fat 3.0.28 (2015-05-16)
0x41: Dirty bit is set. Fs was not properly unmounted and some data may be corrupt.
1) Remove dirty bit
2) No action
There are differences between boot sector and its backup.
This is mostly harmless. Differences: (offset:original/backup)
1) Copy original to backup
2) Copy backup to original
3) No action
Truncating second to 0 bytes because first is FAT32 root dir.
File size is 4096 bytes, cluster chain length is 0 bytes.
Truncating file to 0 bytes.
Perform changes ? (y/n) n
/dev/sdm1: 14 files, 1600/1044483 clusters
In particular the "shared cluster" problem is caused by fat_file_write_first_cluster_num, which adds a directory
entry to the root directory cluster pointing at itself; e.g. there is a directory entry in cluster 2 pointing to
a file in cluster 2. (Note: this occurs because we have fixed the "point to cluster # 1 issue" by reading the relative
location of the root cluster node from the FAT volume info strcture).
Removing the function call in msdos_shut_down ( .. ) to close the root file descriptor solves the problem perfectly
(clean fsck). However, we're a bit unsure about the intent behind closing the root directory.
On 03/17/17 10:29, Sebastian Huber wrote:
> I fixed a couple of FAT file system bugs yesterday on the Git master.
> It would be great if you could provide a self-contained test case for
> your issue. See for example "testsuite/fstests/fsdosfsname02".
> On 17/03/17 10:20, Tasslehoff Kjappfot wrote:
>> We have narrowed this down a bit, and I want to run something by you. It
>> seems the unmount of a FAT filesystem can cause random overwrites.
>> The sequence msdos_shutdown -> fat_file_close -> fat_file_update causes
>> the driver to operate on cluster #1 (set in (fat_fd->dir_pos.sname.cln).
>> The rootdir cluster is not #1, and it seems to be taken from the define
>> FAT_ROOTDIR_CLUSTER_NUM that is used a couple of places in the code.
>> The rootdir cluster found in fat.c is #2. // vol->rdir_cl =
>> We seem to get no corruption if we add the following line at the top of
>> fat_fd->dir_pos.sname.cln = 2 // should get this from rdir_cl
>> I suspect that the other places FAT_ROOTDIR_CLUSTER_NUM is used can also
>> cause problems.
>> Are we on to something?
>> On 03/13/17 16:36, Gedare Bloom wrote:
>>> On Mon, Mar 13, 2017 at 11:05 AM, Tasslehoff Kjappfot
>>> <tasskjapp at gmail.com> wrote:
>>>> On Mon, Mar 13, 2017 at 3:48 PM, Gedare Bloom <gedare at rtems.org>
>>>>> On Mon, Mar 13, 2017 at 9:42 AM, Tasslehoff Kjappfot
>>>>> <tasskjapp at gmail.com> wrote:
>>>>>> A little update on this. I found out that if I do the following, the
>>>>>> is wrong the second time I check it.
>>>>>> 1. Write upgrade files
>>>>>> 2. Check MD5
>>>>>> 3. Unmount
>>>>>> 4. Mount
>>>>>> 5. Check MD5
>>>>> What is the return value from unmount?
>>>> unmount is successful every time.
>>> I did not think dosfs supports unmount() function so this is
>>> surprising to me. How do you call it?
>>>>>> If I do not unmount/mount, the MD5 is ok, even after a reboot.
>>>>>> With JTAG I discovered that after I have initiated an unmount, the
>>>>>> bdbuf_swapout_task tries to do 3 writes into blocks inside the file
>>>>>> the MD5 check fails. If I just ignore those writes, it also works.
>>>>> Now that is strange. It may be worth it to inspect the
>>>>> bdbuf_cache.modified and bdbuf_cache.sync chains. Those are what the
>>>>> swapout task processes. A guess is maybe there is a race condition
>>>>> between the two lists when the sync happens, and you are getting a
>>>>> couple of extra writes.
>>>> Sounds plausible. Is it possible to bypass/disable the bdbuf cache
>>>> altogether? I have not configured anything related to SWAPOUT in my
>>>> application, and the BDBUF setup is the following.
>>> You can't entirely avoid it without changing the filesystem you use.
>>>> #define CONFIGURE_BDBUF_MAX_READ_AHEAD_BLOCKS (16)
>>>> #define CONFIGURE_BDBUF_MAX_WRITE_BLOCKS (64)
>>>> #define CONFIGURE_BDBUF_BUFFER_MIN_SIZE (512)
>>>> #define CONFIGURE_BDBUF_BUFFER_MAX_SIZE (32 * 1024)
>>>> #define CONFIGURE_BDBUF_CACHE_MEMORY_SIZE (4 * 1024 * 1024)
>>> You may like to define a smaller CONFIGURE_SWAPOUT_BLOCK_HOLD
>>> (and a smaller CONFIGURE_SWAPOUT_SWAP_PERIOD).
>>> These two control the delay before swapout writes to disk.
>>>>> You might also like to enable RTEMS_BDBUF_TRACE at the top of bdbuf.c
>>>> Thanks for the tip.
>> users mailing list
>> users at rtems.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the users