Free considered dangerous was:: RTEMS-Users Need help: Race condition in ATA.c

Thomas Doerfler Thomas.Doerfler at embedded-brains.de
Tue Nov 21 08:19:53 UTC 2006


Hello,

some time ago I was looking for the reason of a problem concerning the
DOS filesystem, or in detail a locking problem in libblock. In rare
conditions, the managment of bdbf structures in libblock gets
inconsistent. Details:

At a piece of code, where preemption is disabled, obviously the
processor got rescheduled to another task, which, at rare occasions,
also started to fuzz around in libblock.

The reason for the preemption was a call to "free". When we reviewd the
relevant code, we considered "free" a non-blocking call in all
situations, but this is not true. By default, malloc/free also maintains
some statistics, and therefore queries some information about the heap
region. In this query, the current task can be suspended.

For libblock, I will apply a patch to rtems-4.7-branch and HEAD to fix
this situation. But our warning may be useful for other RTEMS users.

wkr,
Thomas Doerfler.


Thomas Doerfler schrieb:
> Hello,
> 
> currently we are experiencing a severe problem with libchip/ide/ata.c.
> In rare conditions, a block transfer to/from disk is started twice,
> which leads to an immediate hangup of the ata subsystem.
> 
> We are using RTEMS-4.6.99.2 with a bunch of patches (including the
> memory barrier stuff) on a MPC5200 PowerPC platform and GCC4.0.1 and
> friends.
> 
> I have tracked down the problem to two functions. Here are the details:
> 
> Function 1: ata_add_to_controller_queue:
> -----------------------------
> static void
> ata_add_to_controller_queue(rtems_device_minor_number  ctrl_minor,
>                             ata_req_t                 *areq)
> {
>     Chain_Append(&ata_ide_ctrls[ctrl_minor].reqs, &areq->link);
>     if (Chain_Has_only_one_node(&ata_ide_ctrls[ctrl_minor].reqs))
>     {
> 
>         ata_queue_msg_t msg;
>         ATA_SEND_EVT(msg, ATA_MSG_PROCESS_NEXT_EVT, ctrl_minor, 0);
>     }
> }
> -----------------------------
> 
> 
> Function 2:
> -----------------------------
> static inline void
> ata_request_done(ata_req_t *areq, rtems_device_minor_number ctrl_minor,
>                  rtems_status_code status, int error)
> {
>     preemption_key key;
> 
>     assert(areq);
> 
>     DISABLE_PREEMPTION(key);
>     ATA_EXEC_CALLBACK(areq, status, error);
>     Chain_Extract(&areq->link);
>     free(areq);
> 
>     if (!Chain_Is_empty(&ata_ide_ctrls[ctrl_minor].reqs))
>     {
>         ENABLE_PREEMPTION(key);
>         ata_process_request(ctrl_minor);
>         return;
>     }
>     ENABLE_PREEMPTION(key);
> }
> -----------------------------
> 
> Short description:
> 
> ata_add_to_controller_queue is called (from within libblock/bdbuf.c) in
> the context of the task, which issues a request to the ata driver. It
> adds a new data transfer request to the request queue and, if the
> request queue was empty, it triggers the ATA task to wake up and handle it.
> 
> ata_request_done is called in the context of the ATA task, whenever a
> request has been completely processed. Then it removes the data transfer
> request from the request queue, checks whether there are further
> requests to process (!Chain_Is_empty) and, if yes, immediately starts
> processing the next one.
> 
> ------------------
> What we have found out is, that in rare conditions it seems that:
> 
> 1.) ata_request_done extracts the last element from the reuqest chain
> 
> 2.) then it gets preempted (!!!) and another task calls
> add_to_controller_queue, which in turn adds a new request to the queue
> 
> 3.) and it also sends a trigger to the ATA task to process this request
> 
> 4.) when the ATA task gets processed again, handles the rest of the
> "ata_request_done" code, finds a request in the chain and handles/starts
> it directly.
> 
> 5.) then, in the ATA main function gets the trigger to look into the
> request chain, finds the (already started) data transfer request and
> start processing it again which leads to the disaster.
> 
> ------------------
> I have found various weaknesses in the two functions:
> 
> A) add_to_controller_queue does not disable preemption, although this
> would be the conforming way to ensure cherent accesses to the reuqest
> chain. BUT: when this function is called, preemption is currently always
> disabled (I have checked this with an added test function)
> 
> B) The chain data structure is not volatile, although it is accessed
> from two/multiple tasks. BUT: I have inspected the created assembler
> code, and it seems correct. The chain accesses are not moved out of the
> "preemption prison" in "ata_request_done"
> 
> C) The code is not very robust. BUT: It should work.
> 
> ------------------
> So now here I am. What I have proven with test variables is, that
> "add_to_controller_queue" is called, while the code in
> "ata_request_done" is inside the "preemption prison", which can not
> happen, because this should be avoided by the non-preemtive mode of the
> tasks involved.
> 
> Does anybody have an idea, what might go wrong? How can it be that the
> ata task gets preempted in a section of code, where preemption is disabled?
> 
> wkr,
> Thomas.
> 
> 


-- 
--------------------------------------------
embedded brains GmbH
Thomas Doerfler           Obere Lagerstr. 30
D-82178 Puchheim          Germany
Tel. : +49-89-18 90 80 79-2
Fax  : +49-89-18 90 80 79-9
email: Thomas.Doerfler at embedded-brains.de
PGP public key available on request



More information about the users mailing list