[RTEMS Project] #2944: FAT data corruption during unmount()

Tue Mar 28 08:15:10 UTC 2017

#2944: FAT data corruption during unmount()
-----------------------------+-----------------------
 Reporter:  Sebastian Huber  |       Owner:  chrisj@…
     Type:  defect           |      Status:  new
 Priority:  normal           |   Milestone:  4.12
Component:  filesystem       |     Version:  4.11
 Severity:  normal           |  Resolution:
 Keywords:                   |
-----------------------------+-----------------------

Comment (by slemstick):

 Replying to [comment:1 Chris Johns]:
 > Replying to [ticket:2944 Sebastian Huber]:
 > > Removing the function call in msdos_shut_down ( .. ) to close the root
 file descriptor solves the problem perfectly (clean fsck).
 >
 > I assume you mean fat_file_close?
 Yes.
 >
 > > However, we're a bit unsure about the intent behind closing the root
 directory.
 >
 > There is memory allocated in fat_file_open which you would leak.

 We fixed this issue by creating a special "root file close" function, by
 removing the call to fat_file_update() in fat_file_close() (which caused
 the corruption).

 >
 > I see the fat_file_close calls fat_buf_release and if the fs_info cache
 is not empty it is holding a bdbuf buffer so this would cause a leak of
 buffers.
 >
 > What about the fat_file_close calls in the msdos init call on error?
 Would they also cause the same problem?

 Yes, these will cause the same issues.

 To update / summarise this ticket a bit here:

 We originally attempted a fix to this problem by setting the hard-coded
 root directory cluster number to 2, as well as the above (remove
 corruption caused by fat_file_update() in fat_file_close() on unmount).

 However, our attempt to fix the broken root cluster numbering breaks a
 hashing mechanism in fat_file_open(..). This mechanism indexes open file
 descriptors based on 1) parent directory cluster number and 2) offset into
 that directory structure. The issue is that the root directory, and the
 file pointed to by the first directory entry in the root directory, may
 construct their hashes based on the same indexes:

 > Root directory: cluster number 2, offset 0
 > First file in root directory: cluster number 2, offset 0

 Before, this was not a problem of course, as the root directory had the
 hard-coded cluster number of 1, and the keys were therefore always unique.
 But this can actually cause a number of new issues.

 The fix to this problem is to set the hard-coded root cluster directory
 number back to 1, instead of drastically changing the key hashing method
 function calls and data structures, and trusting that removing calls to
 fat_file_update(on_root_node) are sufficient to avoid the data corruption
 issue described above.

 However, there are two other places in msdos_misc.c where the hardcoded
 root directory cluster number - FAT_ROOTDIR_CLUSTER_NUM - is used:

 > msdos_get_name_node()
 > msdos_get_dotdot_dir_info_cluster_num()

 Like this:

     if ( (MSDOS_EXTRACT_CLUSTER_NUM(dotdot_node)) == 0)
     {
         /*
          * we handle root dir for all FAT types in the same way with the
          * ordinary directories ( through fat_file_* calls )
          */
         fat_dir_pos_init(dir_pos);
         dir_pos->sname.cln = FAT_ROOTDIR_CLUSTER_NUM;
     }

 Which, to my understanding, will never occur as you should never have a
 cluster number below 2 in a compliant (msdos) FAT file system. Does anyone
 know the intent behind this?

--
Ticket URL: <http://devel.rtems.org/ticket/2944#comment:2>
RTEMS Project <http://www.rtems.org/>
RTEMS Project