Red-Black-Tree for Priority Queues

Mon Nov 18 00:04:39 UTC 2013

Hello Gedare and Sebastian,

there is my opinion and list of other relevant documentation
sources from Linux kernel.

On Friday 15 of November 2013 18:40:40 Gedare Bloom wrote:
> On Fri, Nov 15, 2013 at 12:27 PM, Joel Sherrill
>
> <joel.sherrill at oarcorp.com> wrote:
> > With 256 priorities, what does that mean in practical search
> > length? The insertion time is bounded once you find it.
>
> I'm not certain, but here is a guess. If you had a perfectly balanced
> tree of 256 nodes the tree would be 8 levels deep thus 8 node accesses
> for a search. Since the imbalance in a red-black tree can be 2x, I
> suppose there could be some node at level 8 and some node at level 16
> in a really badly balanced tree. So the upper bound would be 16 nodes
> accessed during a search. The insert/remove code will be similar, plus
> the cost of at most 3 rotations.
>
> I'm guessing the search through the block2N structure is much worse,
> but probably the insert/remove is smaller once you know the slot?

I expect that you do not attach (double linked) list to each allocated
priority. I.e. waiters would be directly inserted into R-B tree.
Then complexity is not bounded by maximal number of priorities
but by maximal number of threads waiting on given queue.
This can (in the theory) mean to use
  2 x log2(CONFIGURE_MAXIMUM_POSIX_THREADS + CONFIGURE_MAXIMUM_TASKS)
compare operations for search. The insert requires same maximal number
for search and traceback + that fixed maximal number of rotations
to balance. Delete by node pointer when parent pointer is present (if
I remember well this is RTEMS R-B case) requires only limited
number of rotation and same maximal traceback as insertion.

But anyway the actual result would be much better for larger amount
of waiters than linear search and insert in double linked list.
If you consider that competing tasks groups can differ in only in single
priority level then solution with fixed O(1) insert and removal time
is to have double linked list for each priority (256) for each wait queue.
This is nonsense. 4 queues are some compromise but in most cases it is
waste with list heads. There is no other option then full priority queue
for EDF and other schedulers because their "priorities" are finegrained
and cannot be distributed to the fixed number of queues.

Above means that only reason to consider between R-B and double linked
list selection is R-B overhead for case of 1 and up to 4 waiters where
simpler linear search wins most probably. On more powerfull CPUs
with deeper queues is the main problem miss-prediction in left or right
condition in search. This can be optimized by conditional move/computed
index into left/right child pointers array.

I have considered actual wait queues implementation in RTEMS as acceptable 
compromise but with significant weakness and even show stopper for larger
applications where 10 or more tasks waits for same semaphore.
I would be very happy if implementation is changed to more scalable 
one but some minimal timing tests should be run to that overhead for
case of 1, 2, 3, and 4 waiters is acceptable. When R-B tree is used
then optimized case for 1 would be with minimal overhead, for 2 maximal
and for 4 already winning for some priorities insert order sequences.

One reason for choice of R-B tree for priority queues can be that Linux
kernel uses these in many similar places as well (scheduler, timers, etc).
But for mutexes with full priority inheritance other structure is used
in Linux kernel. It is plist which holds one double linked list of
all nodes (i.e. waiters) and other double linked list where only
the first node of given priority is stored.

 * pl:prio_list (only for plist_node)
 * nl:node_list
 *   HEAD|             NODE(S)
 *       |
 *       ||------------------------------------|
 *       ||->|pl|<->|pl|<--------------->|pl|<-|
 *       |   |10|   |21|   |21|   |21|   |40|   (prio)
 *       |   |  |   |  |   |  |   |  |   |  |
 *       |   |  |   |  |   |  |   |  |   |  |
 * |->|nl|<->|nl|<->|nl|<->|nl|<->|nl|<->|nl|<-|
 * |-------------------------------------------|

http://lxr.free-electrons.com/source/include/linux/plist.h

What is more important on Linux RT-mutexes implementation is
immediate priority adjustment/decrease when one of multiple
held mutexes is released by holding task. RTEMS does adjustment
only when all mutexes held by task are released.
The behavior is correct when RTEMS_STRICT_ORDER_MUTEX is enabled
but then code has to strictly follow release in reverse order
than acquire.

Linux solves this problem that it has priority list in each mutex
which holds sorted list of all waiting tasks and yet another
priority list for each task which provides information about
all held mutexes which provides immediate information
what is required actual task priority. But it costs all these
lists manipulation for each mutex acquire and wait before acquire.

http://lxr.free-electrons.com/source/Documentation/rt-mutex-design.txt

In theory, use of RB-tree for both these list structures (instead of plist)
should be better choice from algorithmic complexity point of view.
The lists heads can be even single pointer for R-B tree. I think not
the case for RTEMS R-B tree now. The node structures can be preallocated
in task and mutex tsructures. Task can wait on only one mutex at time
and mutex can be owned only by one task in maximum as well.

But doing this right and to do full testing requires significant
effort.

Best wishes,

               Pavel