Red-Black-Tree for Priority Queues

Mon Nov 18 09:12:30 UTC 2013

Hello,

thanks for all your input.  I evaluate currently the RTEMS mutex with priority 
queues and priority inheritance as a role model.  My aim is to determine what 
must be done to introduce fine grained locking on SMP.

Currently we have one Giant lock (a recursive spin lock) that protects 
virtually ALL kernel operations.  So this is the Giant bottleneck in the 
system.  The goal is to introduce object and subsystem specific locks.  E.g. a 
mutex obtain sequence should look like this:

obj.lock()
	must_block = obj.operation()
obj.unlock()
if must_block:
	perform complex blocking operation

So in the good case of an uncontested object we don't interfere with the rest 
of the system.

The current RTEMS structure doesn't allow this so a massive restructuring of 
critical sections is necessary (after the RTEMS 4.11 branch creation).  Some 
parts of RTEMS blocking operation sequence are too complex.  You see this in 
bugs of core components which were undiscovered for many years, e.g.

https://www.rtems.org/bugzilla/show_bug.cgi?id=2140
https://www.rtems.org/bugzilla/show_bug.cgi?id=2136
https://www.rtems.org/bugzilla/show_bug.cgi?id=2151

It might be an option to enlarge the interrupt disabled critical sections a bit 
to simplify the code.

Two metrics are of interest here

1. the worst-case interrupt latency, and

2. the worst-case thread dispatch latency.

The worst-case interrupt latency is determined by the sections of disabled 
interrupts.  All these sections have a constant execution time in RTEMS as far 
as I know (there are no loops in these sections), so there is no problem.

The worst-case thread dispatch latency is much more difficult to determine. 
The most problematic part which I identified is the delta chain handling of the 
watchdogs.  The priority queues are actually not that bad if we keep the 256 
priority maximum.

On 2013-11-18 01:04, Pavel Pisa wrote:
> Hello Gedare and Sebastian,
>
> there is my opinion and list of other relevant documentation
> sources from Linux kernel.
>
> On Friday 15 of November 2013 18:40:40 Gedare Bloom wrote:
>> On Fri, Nov 15, 2013 at 12:27 PM, Joel Sherrill
>>
>> <joel.sherrill at oarcorp.com> wrote:
>>> With 256 priorities, what does that mean in practical search
>>> length? The insertion time is bounded once you find it.
>>
>> I'm not certain, but here is a guess. If you had a perfectly balanced
>> tree of 256 nodes the tree would be 8 levels deep thus 8 node accesses
>> for a search. Since the imbalance in a red-black tree can be 2x, I
>> suppose there could be some node at level 8 and some node at level 16
>> in a really badly balanced tree. So the upper bound would be 16 nodes
>> accessed during a search. The insert/remove code will be similar, plus
>> the cost of at most 3 rotations.
>>
>> I'm guessing the search through the block2N structure is much worse,
>> but probably the insert/remove is smaller once you know the slot?
>
> I expect that you do not attach (double linked) list to each allocated
> priority. I.e. waiters would be directly inserted into R-B tree.
> Then complexity is not bounded by maximal number of priorities
> but by maximal number of threads waiting on given queue.
> This can (in the theory) mean to use
>    2 x log2(CONFIGURE_MAXIMUM_POSIX_THREADS + CONFIGURE_MAXIMUM_TASKS)
> compare operations for search. The insert requires same maximal number
> for search and traceback + that fixed maximal number of rotations
> to balance. Delete by node pointer when parent pointer is present (if
> I remember well this is RTEMS R-B case) requires only limited
> number of rotation and same maximal traceback as insertion.

Yes, the search/insert/delete worst-case times for a R-B tree are log2(N) with 
N is the number of objects in the tree (here number of threads waiting).

>
> But anyway the actual result would be much better for larger amount
> of waiters than linear search and insert in double linked list.
> If you consider that competing tasks groups can differ in only in single
> priority level then solution with fixed O(1) insert and removal time
> is to have double linked list for each priority (256) for each wait queue.
> This is nonsense. 4 queues are some compromise but in most cases it is
> waste with list heads. There is no other option then full priority queue
> for EDF and other schedulers because their "priorities" are finegrained
> and cannot be distributed to the fixed number of queues.
>
> Above means that only reason to consider between R-B and double linked
> list selection is R-B overhead for case of 1 and up to 4 waiters where
> simpler linear search wins most probably. On more powerfull CPUs
> with deeper queues is the main problem miss-prediction in left or right
> condition in search. This can be optimized by conditional move/computed
> index into left/right child pointers array.

Yes, one chain head for each priority and object is not an option ;-)

The priority queues in RTEMS perform a linear search, but it is independent of 
the number of threads.  Each thread has a wait information structure 
(Thread_Wait_information) which contains a chain control.  The first thread of 
a priority waiting on a thread queue lends this chain control to the thread 
queue and additional threads of the same priority will use this chain to queue 
up (not the one of the thread queue).  So with 256 priority levels your search 
ends after at most 256 / 4 (chain controls of the thread queue) / 2 
(forward/backward search) = 32 steps.

> I have considered actual wait queues implementation in RTEMS as acceptable
> compromise but with significant weakness and even show stopper for larger
> applications where 10 or more tasks waits for same semaphore.
> I would be very happy if implementation is changed to more scalable
> one but some minimal timing tests should be run to that overhead for
> case of 1, 2, 3, and 4 waiters is acceptable. When R-B tree is used
> then optimized case for 1 would be with minimal overhead, for 2 maximal
> and for 4 already winning for some priorities insert order sequences.
>
> One reason for choice of R-B tree for priority queues can be that Linux
> kernel uses these in many similar places as well (scheduler, timers, etc).
> But for mutexes with full priority inheritance other structure is used
> in Linux kernel. It is plist which holds one double linked list of
> all nodes (i.e. waiters) and other double linked list where only
> the first node of given priority is stored.
>
>   * pl:prio_list (only for plist_node)
>   * nl:node_list
>   *   HEAD|             NODE(S)
>   *       |
>   *       ||------------------------------------|
>   *       ||->|pl|<->|pl|<--------------->|pl|<-|
>   *       |   |10|   |21|   |21|   |21|   |40|   (prio)
>   *       |   |  |   |  |   |  |   |  |   |  |
>   *       |   |  |   |  |   |  |   |  |   |  |
>   * |->|nl|<->|nl|<->|nl|<->|nl|<->|nl|<->|nl|<-|
>   * |-------------------------------------------|
>
> http://lxr.free-electrons.com/source/include/linux/plist.h
>
> What is more important on Linux RT-mutexes implementation is
> immediate priority adjustment/decrease when one of multiple
> held mutexes is released by holding task. RTEMS does adjustment
> only when all mutexes held by task are released.
> The behavior is correct when RTEMS_STRICT_ORDER_MUTEX is enabled
> but then code has to strictly follow release in reverse order
> than acquire.
>
> Linux solves this problem that it has priority list in each mutex
> which holds sorted list of all waiting tasks and yet another
> priority list for each task which provides information about
> all held mutexes which provides immediate information
> what is required actual task priority. But it costs all these
> lists manipulation for each mutex acquire and wait before acquire.
>
> http://lxr.free-electrons.com/source/Documentation/rt-mutex-design.txt

Ok, I will have a look at this, thanks for the pointers.

>
> In theory, use of RB-tree for both these list structures (instead of plist)
> should be better choice from algorithmic complexity point of view.
> The lists heads can be even single pointer for R-B tree. I think not
> the case for RTEMS R-B tree now. The node structures can be preallocated
> in task and mutex tsructures. Task can wait on only one mutex at time
> and mutex can be owned only by one task in maximum as well.

Yes, I think the R-B tree implementation will change if we have other use cases.

>
> But doing this right and to do full testing requires significant
> effort.

Yes, there is a lot to do.

-- 
Sebastian Huber, embedded brains GmbH

Address : Dornierstr. 4, D-82178 Puchheim, Germany
Phone   : +49 89 189 47 41-16
Fax     : +49 89 189 47 41-09
E-Mail  : sebastian.huber at embedded-brains.de
PGP     : Public key available on request.

Diese Nachricht ist keine geschäftliche Mitteilung im Sinne des EHUG.