<html>
<head>
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<p>Thanks for the notice. I'll try to get around to writing a test
case.</p>
<p>Here's a bit more detailed info from the dev who has done the
deep digging into this:</p>
<p><tt>===</tt><br>
</p>
<p>
<meta http-equiv="content-type" content="text/html;
charset=windows-1252">
</p>
<pre style="color: rgb(0, 0, 0); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">In msdos_shut_down ( msdos_fsunmount.c ) there is a call to fat_file_close( .. ) which attempts to close a file
descriptor and write a range of metadata to that file's director entry located in another cluster:
* fat_file_write_first_cluster_num
* fat_file_write_file_size
* fat_file_write_time_and_date
The problem is that this is the root node, and of course doesn't have a corresponding parent directory entry.
In addition, the "parent directory entry" cluster number is initialised to 0x1 (FAT_ROOTDIR_CLUSTER_NUM)
which is not working according to the FAT specification (cluster numbering starts at 2).
This actually creates a critical bug that overwrites random data to above sectors, because 2 is subtracted from 1
to calculate the sector number of the cluster -> through a series of function calls -> leads to a sector number at
the end of FAT2 (just below the start of the cluster region). The driver believes this is a FAT region (in fat_buf_release),
writes the sector to what it "thinks" is FAT1, proceeds to copy the changes to FAT2 -> adds FAT_LENGTH (8161) to sector,
leading to a write well into the cluster region, randomly overwriting files.
The three function calls above lead to fsck complaining about disk structure:
#######
fsck from util-linux 2.27.1
fsck.fat 3.0.28 (2015-05-16)
0x41: Dirty bit is set. Fs was not properly unmounted and some data may be corrupt.
1) Remove dirty bit
2) No action
? 2
There are differences between boot sector and its backup.
This is mostly harmless. Differences: (offset:original/backup)
65:01/00
1) Copy original to backup
2) Copy backup to original
3) No action
? 3
/ and
/APPLICAT.ION
share clusters.
Truncating second to 0 bytes because first is FAT32 root dir.
/APPLICAT.ION
File size is 4096 bytes, cluster chain length is 0 bytes.
Truncating file to 0 bytes.
Perform changes ? (y/n) n
/dev/sdm1: 14 files, 1600/1044483 clusters
########
In particular the "shared cluster" problem is caused by fat_file_write_first_cluster_num, which adds a directory
entry to the root directory cluster pointing at itself; e.g. there is a directory entry in cluster 2 pointing to
a file in cluster 2. (Note: this occurs because we have fixed the "point to cluster # 1 issue" by reading the relative
location of the root cluster node from the FAT volume info strcture).
Removing the function call in msdos_shut_down ( .. ) to close the root file descriptor solves the problem perfectly
(clean fsck). However, we're a bit unsure about the intent behind closing the root directory. </pre>
<p><br>
</p>
<br>
<div class="moz-cite-prefix">On 03/17/17 10:29, Sebastian Huber
wrote:<br>
</div>
<blockquote cite="mid:58CBAC7D.8060303@embedded-brains.de"
type="cite">I fixed a couple of FAT file system bugs yesterday on
the Git master. It would be great if you could provide a
self-contained test case for your issue. See for example
"testsuite/fstests/fsdosfsname02".
<br>
<br>
On 17/03/17 10:20, Tasslehoff Kjappfot wrote:
<br>
<blockquote type="cite">We have narrowed this down a bit, and I
want to run something by you. It
<br>
seems the unmount of a FAT filesystem can cause random
overwrites.
<br>
<br>
The sequence msdos_shutdown -> fat_file_close ->
fat_file_update causes
<br>
the driver to operate on cluster #1 (set in
(fat_fd->dir_pos.sname.cln).
<br>
The rootdir cluster is not #1, and it seems to be taken from the
define
<br>
FAT_ROOTDIR_CLUSTER_NUM that is used a couple of places in the
code.
<br>
<br>
The rootdir cluster found in fat.c is #2. // vol->rdir_cl =
<br>
FAT_GET_BR_FAT32_ROOT_CLUSTER(boot_rec);
<br>
<br>
We seem to get no corruption if we add the following line at the
top of
<br>
msdos_shutdown:
<br>
<br>
fat_fd->dir_pos.sname.cln = 2 // should get this from rdir_cl
<br>
<br>
I suspect that the other places FAT_ROOTDIR_CLUSTER_NUM is used
can also
<br>
cause problems.
<br>
<br>
Are we on to something?
<br>
<br>
Tasslehoff
<br>
<br>
On 03/13/17 16:36, Gedare Bloom wrote:
<br>
<blockquote type="cite">On Mon, Mar 13, 2017 at 11:05 AM,
Tasslehoff Kjappfot
<br>
<a class="moz-txt-link-rfc2396E" href="mailto:tasskjapp@gmail.com"><tasskjapp@gmail.com></a> wrote:
<br>
<blockquote type="cite">On Mon, Mar 13, 2017 at 3:48 PM,
Gedare Bloom <a class="moz-txt-link-rfc2396E" href="mailto:gedare@rtems.org"><gedare@rtems.org></a> wrote:
<br>
<blockquote type="cite">On Mon, Mar 13, 2017 at 9:42 AM,
Tasslehoff Kjappfot
<br>
<a class="moz-txt-link-rfc2396E" href="mailto:tasskjapp@gmail.com"><tasskjapp@gmail.com></a> wrote:
<br>
<blockquote type="cite">A little update on this. I found
out that if I do the following, the
<br>
md5sum
<br>
is wrong the second time I check it.
<br>
<br>
1. Write upgrade files
<br>
2. Check MD5
<br>
3. Unmount
<br>
4. Mount
<br>
5. Check MD5
<br>
<br>
</blockquote>
What is the return value from unmount?
<br>
</blockquote>
unmount is successful every time.
<br>
</blockquote>
I did not think dosfs supports unmount() function so this is
<br>
surprising to me. How do you call it?
<br>
<br>
<blockquote type="cite">
<blockquote type="cite">
<blockquote type="cite">If I do not unmount/mount, the MD5
is ok, even after a reboot.
<br>
<br>
With JTAG I discovered that after I have initiated an
unmount, the
<br>
bdbuf_swapout_task tries to do 3 writes into blocks
inside the file
<br>
where
<br>
the MD5 check fails. If I just ignore those writes, it
also works.
<br>
<br>
</blockquote>
Now that is strange. It may be worth it to inspect the
<br>
bdbuf_cache.modified and bdbuf_cache.sync chains. Those
are what the
<br>
swapout task processes. A guess is maybe there is a race
condition
<br>
between the two lists when the sync happens, and you are
getting a
<br>
couple of extra writes.
<br>
</blockquote>
Sounds plausible. Is it possible to bypass/disable the bdbuf
cache
<br>
altogether? I have not configured anything related to
SWAPOUT in my
<br>
application, and the BDBUF setup is the following.
<br>
<br>
</blockquote>
You can't entirely avoid it without changing the filesystem
you use.
<br>
<br>
<blockquote type="cite">#define
CONFIGURE_BDBUF_MAX_READ_AHEAD_BLOCKS (16)
<br>
#define CONFIGURE_BDBUF_MAX_WRITE_BLOCKS (64)
<br>
#define CONFIGURE_BDBUF_BUFFER_MIN_SIZE (512)
<br>
#define CONFIGURE_BDBUF_BUFFER_MAX_SIZE (32 * 1024)
<br>
#define CONFIGURE_BDBUF_CACHE_MEMORY_SIZE (4 * 1024 *
1024)
<br>
<br>
</blockquote>
You may like to define a smaller CONFIGURE_SWAPOUT_BLOCK_HOLD
<br>
(and a smaller CONFIGURE_SWAPOUT_SWAP_PERIOD).
<br>
<br>
These two control the delay before swapout writes to disk.
<br>
<br>
<blockquote type="cite">
<blockquote type="cite">You might also like to enable
RTEMS_BDBUF_TRACE at the top of bdbuf.c
<br>
file.
<br>
</blockquote>
Thanks for the tip.
<br>
<br>
<br>
</blockquote>
</blockquote>
_______________________________________________
<br>
users mailing list
<br>
<a class="moz-txt-link-abbreviated" href="mailto:users@rtems.org">users@rtems.org</a>
<br>
<a class="moz-txt-link-freetext" href="http://lists.rtems.org/mailman/listinfo/users">http://lists.rtems.org/mailman/listinfo/users</a>
<br>
</blockquote>
<br>
</blockquote>
<br>
</body>
</html>