[RTEMS Project] #2556: Implement the O(m) Independence-Preserving Protocol (OMIP)

Wed Jan 27 08:45:28 UTC 2016

#2556: Implement the O(m) Independence-Preserving Protocol (OMIP)
-----------------------------+-----------------------------
 Reporter:  sebastian.huber  |      Owner:  sebastian.huber
     Type:  enhancement      |     Status:  new
 Priority:  normal           |  Milestone:  4.12
Component:  cpukit           |    Version:  4.12
 Severity:  normal           |   Keywords:
-----------------------------+-----------------------------
 = Background =

 The O(m) Independence-Preserving Protocol ([http://www.mpi-
 sws.org/~bbb/papers/pdf/ecrts13b.pdf OMIP]) is a generalization of the
 priority inheritance protocol to clustered scheduling which avoids the
 non-preemptive sections present with priority boosting.  The m denotes the
 number of processors in the system.  Its implementation requires an
 extension of the scheduler helping protocol already used for the [http
 ://www-users.cs.york.ac.uk/~burns/MRSPpaper.pdf MrsP] semaphores.
 However, the current implementation of the scheduler helping protocol has
 two major issues, see Catellani, Sebastiano, Luca Bonato, Sebastian Huber,
 and Enrico Mezzetti: Challenges in the Imple- mentation of MrsP. In
 Reliable Software Technologies - Ada-Europe 2015, pages 179–195, 2015.
 Firstly, the run-time of some scheduler operations depend on the size of
 the resource dependency tree.  Secondly, the scheduler operations of
 threads which don't use shared resources must deal with the scheduler
 helping protocol in case an owner of a shared resource is somehow
 involved.

 To illustrate the second issue, let us look at the following example.  We
 have a system with eight processors and two L2 caches.  We assign
 processor 0 to a partition P for latency sensitive real-time tasks (e.g.
 sensor and actuator handling), processors 1, 2 and 3 are assigned to a
 cluster C,,A,, and the remaining processors are assigned to a cluster
 C,,_B,, for soft real-time worker tasks.  The worker tasks use a shared
 resource, e.g. a file system for data storage.  Let us suppose a task R of
 partition P sends a message to the workers.  This may make a waiting
 worker ready, which in turn pre-empts the owner of a shared resource.  In
 this case the scheduler helping protocol takes action and is carried out
 by the task R.  This contradicts the intended isolation of scheduler
 instances.

 The reason for this unfortunate coupling is a design issue of the
 scheduler helping protocol implementation.  Some scheduler operations may
 return a thread in need of help.  For example, if a thread is unblocked
 which pre-empts an owner of a shared resource, then the pre-empted thread
 is returned.  Once a thread in need of help is returned, the ask for help
 operation of the scheduler is executed.  An alternative to this return
 value based approach is the introduction of a pre-emption intervention
 during thread dispatching.  Threads taking part in the scheduler helping
 protocol indicate this with a positive resource count value.  In case a
 thread dispatch occurs and pre-empts an owner of a shared resource, the
 scheduler ask for help operation is invoked.  So, the work is carried out
 on behalf of the thread which takes part in the scheduler helping
 protocol.

 To overcome the first issue, an improved resource dependency tracking is
 required.  One approach is to use a recursive red-black tree based data
 structure, see #2412.

 = Implementation =

 There are several steps necessary to implement OMIP.
 * Introduce per-scheduler locks.
 * Enable context switches with interrupts enabled.
 * Add a pre-emption intervention to the thread dispatch.
 * Add a table for priority nodes to the thread control block.  For each
 scheduler instance there is one priority node.
 * Update the table in case the thread blocks on a resource, a timeout
 while waiting for a resource occurs, or ownership of a resource is
 transferred to the thread.
 * Use this table in the pre-emption intervention.
 * Update the MrsP implementation to the new infrastructure.

 Currently, only one scheduler lock for all scheduler instances is used.
 This simplified the MrsP implementation and due to the presence of a Giant
 lock, this was not an issue.  With the elimination of the Giant lock,
 however, we need one scheduler lock per scheduler instance to really
 profit from a decoupled system due to clustered scheduling.

 The current implementation of thread dispatching has some implications
 with respect to the interrupt latency.  It is crucial to preserve the
 system invariant that a thread can execute on at most one processor in the
 system at a time.  This is accomplished with a boolean indicator in the
 thread context.  The processor architecture specific context switch code
 will mark that a thread context is no longer executing and waits that the
 heir context stopped execution before it restores the heir context and
 resumes execution of the heir thread (the boolean indicator is basically a
 TTAS lock).  So, there is one point in time in which a processor is
 without a thread.  This is essential to avoid cyclic dependencies in case
 multiple threads migrate at once.  Otherwise some supervising entity is
 necessary to prevent deadlocks.  Such a global supervisor would lead to
 scalability problems so this approach is not used.  Currently the context
 switch is performed with interrupts disabled.  Thus in case the heir
 thread is currently executing on another processor, the time of disabled
 interrupts is prolonged since one processor has to wait for another
 processor to make progress.

 If we add pre-emption intervention to the thread dispatch sequence, then
 there is an even greater need to avoid this issue with the interrupt
 latency.  Interrupts normally store the context of the interrupted thread
 on its stack.  In case a thread is marked as not executing, we must not
 use its thread stack to store such an interrupt context.  We cannot use
 the heir stack before it stopped execution on another processor.  If we
 enable interrupts during this transition, then we have to provide an
 alternative thread independent stack for interrupts in this time frame.

 The pre-emption intervention should be added to `_Thread_Do_dispatch()`
 before the heir is read and perform the following pseudo-code actions.
 {{{
 pre_emption_intervention(executing):
         if executing.resource_count > 0:
                 executing.lock()
                 if executing.is_ready():
                         for scheduler in executing.schedulers:
                                 scheduler.lock()
                         if !executing.is_scheduled():
                                 for scheduler in executing.schedulers:
                                         scheduler.ask_for_help(executing)
                         for scheduler in executing.schedulers:
                                 scheduler.unlock()
                 else if executing.active_help_level > 0:
                         idle.use(executing.scheduler_node)
                 executing.unlock()
 }}}
 The scheduler help operation affects multiple scheduler instances.  In
 terms of locking we have only two options,
 * use a global scheduler lock, or
 * obtain multiple per-scheduler locks at once.
 A global scheduler lock is not an option.  To avoid deadlocks obtain the
 per-scheduler locks in a fixed order.  However, in this case the per-
 scheduler locks will observe different worst-case and average-case acquire
 times (depending on the order).

 Use a recursive data structure to determine the highest priority available
 to a thread for each scheduler instance, e.g.
 {{{
 typedef struct Thread_Priority_node {
         Priority_Control current_priority;
         Priority_Control real_priority;
         struct Thread_Priority_node *owner;
         RBTree_Node Node;
         RBTree_Control Inherited_priorities;
 } Thread_Priority_node;

 typedef struct {
         ...
         Thread_Priority_node *priority_nodes; /* One per scheduler
 instances */
         ...
 } Thread_Control;
 }}}

 Initially a thread has a priority node reflecting its real priority. The
 `Thread_Priority_node::owner` is `NULL`. The
 `Thread_Priority_node::current_priority` is set to the real priority.  The
 `Thread_Priority_node::Inherited_priorities` is empty.

 In case the thread must wait for ownership of a mutex, then it enqueues
 its priority node in `Thread_Priority_node::Inherited_priorities` of the
 mutex owner.

 In case the thread is dequeued from the wait queue of a mutex, then it
 dequeues its priority node in `Thread_Priority_node::Inherited_priorities`
 of the previous mutex owner (ownership transfer) or the current mutex
 owner (acquire timeout).

 In case the minimum of the `Thread_Priority_node::real_priority` and the
 `Thread_Priority_node::Inherited_priorities` changes, then
 `Thread_Priority_node::current_priority` is updated. In case the
 `Thread_Priority_node::owner` its not `NULL`, the priority change
 propagates to the owner, and so on. In case
 `Thread_Priority_node::current_priority` changes, the corresponding
 scheduler is notified.

 Use the thread lock to protect the priority nodes.

--
Ticket URL: <http://devel.rtems.org/ticket/2556>
RTEMS Project <http://www.rtems.org/>
RTEMS Project