RTEMS message queues and interrupt safety
ptorre at zetron.com
Tue Apr 18 18:24:01 UTC 2006
We've got a weird bug that may interest someone. I'd be interested
in hearing from anyone who has seen similar.
The setup: rtems-4.6.0 with various patches merged in from CVS,
running on MPC855T, with an unsubmitted BSP.
The PowerPC host processor receives interrupts from an external DSP
at a fairly high rate. The ISR that services that interrupt sends
messages to a queue that is read by a "classic API" task. Several
other foreground tasks also send messages to that queue. When
running at high load (lots of interrupts firing) for extended periods
of time, we sometimes see messages that have already been read from
the queue "reappear", as much as tens of milliseconds later. This
seems to be happening because the number_of_pending_messages member
of the CORE_message_queue_Control struct is zero, but the chain of
pending messages is non-empty. When a new message is submitted, it
goes to the end of the chain, and number_of_pending_messages becomes
1. The next time the queue is read, the count is decremented back
to zero, but the wrong message is returned.
I don't know exactly how the "count is zero but list is not empty"
condition comes about. I put in a bunch of instrumentation to try
and catch it in the act of happening. But, interrupts firing in
the middle of my debug code was causing my debug code to false trigger.
So, I resorted to turning interrupts off for the entire duration
of both _CORE_message_queue_Submit() and _CORE_message_queue_Seize().
Now my debug code doesn't false-trigger, but the actual bug doesn't
happen any more. We got pretty good at reproducing it, but with
interrupts disabled in those two functions, we can't make the bug
manifest any more. I don't know if I have actually fixed something,
or just forced the bug into hiding, biding its time.
Looking at the queue insert/remove code, I don't see a window. I may
be missing it, or there may not be one there and I've just changed
the timing enough with my interrupt disabling that we can't make
the bug show itself the same way.
Any comments would be welcome.
Phil Torre phone: 425-820-6363 x234
Design Engineer email: ptorre at zetron.com
Switching Systems Group fax: 425-820-7031
Zetron, Inc. web: http://www.zetron.com
More information about the users