Red-Black-Tree for Priority Queues

Mon Nov 18 17:31:36 UTC 2013

On Sun, Nov 17, 2013 at 7:04 PM, Pavel Pisa <ppisa4lists at pikron.com> wrote:
> Hello Gedare and Sebastian,
>
> there is my opinion and list of other relevant documentation
> sources from Linux kernel.
>
> On Friday 15 of November 2013 18:40:40 Gedare Bloom wrote:
>> On Fri, Nov 15, 2013 at 12:27 PM, Joel Sherrill
>>
>> <joel.sherrill at oarcorp.com> wrote:
>> > With 256 priorities, what does that mean in practical search
>> > length? The insertion time is bounded once you find it.
>>
>> I'm not certain, but here is a guess. If you had a perfectly balanced
>> tree of 256 nodes the tree would be 8 levels deep thus 8 node accesses
>> for a search. Since the imbalance in a red-black tree can be 2x, I
>> suppose there could be some node at level 8 and some node at level 16
>> in a really badly balanced tree. So the upper bound would be 16 nodes
>> accessed during a search. The insert/remove code will be similar, plus
>> the cost of at most 3 rotations.
>>
>> I'm guessing the search through the block2N structure is much worse,
>> but probably the insert/remove is smaller once you know the slot?
>
> I expect that you do not attach (double linked) list to each allocated
> priority. I.e. waiters would be directly inserted into R-B tree.
> Then complexity is not bounded by maximal number of priorities
> but by maximal number of threads waiting on given queue.
> This can (in the theory) mean to use
>   2 x log2(CONFIGURE_MAXIMUM_POSIX_THREADS + CONFIGURE_MAXIMUM_TASKS)
> compare operations for search. The insert requires same maximal number
> for search and traceback + that fixed maximal number of rotations
> to balance. Delete by node pointer when parent pointer is present (if
> I remember well this is RTEMS R-B case) requires only limited
> number of rotation and same maximal traceback as insertion.
>
Thanks, you're right about the limit being due to threads not priorities.

> But anyway the actual result would be much better for larger amount
> of waiters than linear search and insert in double linked list.
> If you consider that competing tasks groups can differ in only in single
> priority level then solution with fixed O(1) insert and removal time
> is to have double linked list for each priority (256) for each wait queue.
> This is nonsense. 4 queues are some compromise but in most cases it is
> waste with list heads. There is no other option then full priority queue
> for EDF and other schedulers because their "priorities" are finegrained
> and cannot be distributed to the fixed number of queues.
>
Right. The current block protocol in RTEMS uses 4 queues with (IIRC) a
non-uniform distribution of the priorities among the 4.

> Above means that only reason to consider between R-B and double linked
> list selection is R-B overhead for case of 1 and up to 4 waiters where
> simpler linear search wins most probably. On more powerfull CPUs
> with deeper queues is the main problem miss-prediction in left or right
> condition in search. This can be optimized by conditional move/computed
> index into left/right child pointers array.
>
> I have considered actual wait queues implementation in RTEMS as acceptable
> compromise but with significant weakness and even show stopper for larger
> applications where 10 or more tasks waits for same semaphore.
> I would be very happy if implementation is changed to more scalable
> one but some minimal timing tests should be run to that overhead for
> case of 1, 2, 3, and 4 waiters is acceptable. When R-B tree is used
> then optimized case for 1 would be with minimal overhead, for 2 maximal
> and for 4 already winning for some priorities insert order sequences.
>
> One reason for choice of R-B tree for priority queues can be that Linux
> kernel uses these in many similar places as well (scheduler, timers, etc).
> But for mutexes with full priority inheritance other structure is used
> in Linux kernel. It is plist which holds one double linked list of
> all nodes (i.e. waiters) and other double linked list where only
> the first node of given priority is stored.
>
>  * pl:prio_list (only for plist_node)
>  * nl:node_list
>  *   HEAD|             NODE(S)
>  *       |
>  *       ||------------------------------------|
>  *       ||->|pl|<->|pl|<--------------->|pl|<-|
>  *       |   |10|   |21|   |21|   |21|   |40|   (prio)
>  *       |   |  |   |  |   |  |   |  |   |  |
>  *       |   |  |   |  |   |  |   |  |   |  |
>  * |->|nl|<->|nl|<->|nl|<->|nl|<->|nl|<->|nl|<-|
>  * |-------------------------------------------|
>
> http://lxr.free-electrons.com/source/include/linux/plist.h
>
> What is more important on Linux RT-mutexes implementation is
> immediate priority adjustment/decrease when one of multiple
> held mutexes is released by holding task. RTEMS does adjustment
> only when all mutexes held by task are released.
> The behavior is correct when RTEMS_STRICT_ORDER_MUTEX is enabled
> but then code has to strictly follow release in reverse order
> than acquire.
>
There is a small bug in RTEMS implementation:
https://www.rtems.org/bugzilla/show_bug.cgi?id=2124

> Linux solves this problem that it has priority list in each mutex
> which holds sorted list of all waiting tasks and yet another
> priority list for each task which provides information about
> all held mutexes which provides immediate information
> what is required actual task priority. But it costs all these
> lists manipulation for each mutex acquire and wait before acquire.
>
> http://lxr.free-electrons.com/source/Documentation/rt-mutex-design.txt
>
This was one design proposed for fixing the bug. It is a lot of management.

> In theory, use of RB-tree for both these list structures (instead of plist)
> should be better choice from algorithmic complexity point of view.
> The lists heads can be even single pointer for R-B tree. I think not
> the case for RTEMS R-B tree now. The node structures can be preallocated
> in task and mutex tsructures. Task can wait on only one mutex at time
> and mutex can be owned only by one task in maximum as well.
>
> But doing this right and to do full testing requires significant
> effort.
>
Thanks

-Gedare
> Best wishes,
>
>                Pavel
>
>
>