RTEMS scheduler bug ?

Tue Apr 9 08:54:53 UTC 2019

Hi,
My device just passed a 24h long duration test, I can say now that this
issue is history.
Since we were looking at this for quite a while, I would like to thank you,
Sebastian for your prompt & useful support !

regards,
Catalin

On Fri, Mar 29, 2019 at 11:56 AM Catalin Demergian <demergian at gmail.com>
wrote:

> Hi,
> We had some time ago (sept/oct 2018) a long discussion where I was
> suspecting a
> scheduler issue (subject "rtems_message_queue_receive/rtems_event_receive
> issues")
>
> We got to the point where I realized that _Chain_Append_unprotected might
> fail to add an
> element in the queue, with the effect of having a task in a funny state
> where state=READY, but
> the task will not be in the ready chain, so the task will never get CPU
> time anymore since a task
> needs to be blocked in order to be unblocked when new data arrives.
>
> We were using USB then, but this issue re-became hot because we just got
> the same issue
> over serial :)
> I believe there is a possible chain of events that can make
> _Chain_Append_unprotected to fail,
> explanations follow.
>
> /*
>
> ** @note It does NOT disable interrupts to ensure the atomicity of the*
>
> **       append operation.*
>
> */
>
>
>
> RTEMS_INLINE_ROUTINE void _Chain_Append_unprotected(
>
>   Chain_Control *the_chain,
>
>   Chain_Node    *the_node
>
> )
>
> {
>
>   Chain_Node *tail = _Chain_Tail( the_chain );
>
>   Chain_Node *old_last = tail->previous;
>
>
>
>   the_node->next = tail;
>
> *  tail->previous = the_node;*
>
> *  old_last->next = the_node;*
>
>   the_node->previous = old_last;
>
> }
>
> The
>
> *  tail->previous = the_node;*
>
> *  old_last->next = the_node;*
>
> lines are the ones that actually add the element
>
> to the ready chain.
>
> If a thread executes those lines, but just before executing
>
> the_node->previous = old_last;
>
> another thread comes to add another node in this chain, it will set
> another node in
>
> tail->previous and old_last->next, and as a result, when the interrupted
>
> thread will continue to execute the last line, it will be for nothing,
> because the initial node will not be added to the ready chain.
>
>
> If this chain of events occur (*and after a while they will*), we get
> starvation for that task.
>
> I'm reproducing this issue in a long duration test, the duration before
> this happens varies from run to run, but it always happens.
>
>
> *What I'm proposing is the following*: call _Chain_Append instead of
> _Chain_Append_unprotected in
> schedulerpriorityimpl.h, _Scheduler_priority_Ready_queue_enqueue function.
>
>
> void _Chain_Append(
>
>   Chain_Control *the_chain,
>
>   Chain_Node    *node
>
> )
>
> {
>
>   ISR_Level level;
>
>
>
>   _ISR_Disable( level );
>
>     _Chain_Append_unprotected( the_chain, node );
>
>   _ISR_Enable( level );
>
> }
>
>
> This way the add-element-to-chain operation becomes atomic.
>
> I was able to run a long duration test (8 hrs) in my setup with this fix
> successfully.
>
>
> What do you think ?
>
>
> regards,
> Catalin
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rtems.org/pipermail/users/attachments/20190409/8cfc8b76/attachment-0002.html>