RTEMS message queues and interrupt safety

Wed Apr 19 01:49:18 UTC 2006

Hi Phil,

You are correct in that PR904 only effects optimised code, but it would 
be worth checking that when you do a debug build, you are using -O0 in 
gcc when building RTEMS.

Would it be possible to provide more information on your queue in 
question, such as:

1. The call parameters you used to setup your queue?
2. The call parameters for sending a queue message from your ISR?
3. The call parameters for sending a queue message from your other tasks?
4. The call parameters for receiving a queue message?
5. Are the every more than 1 task receiving from the queue?

All this information will change the way the program flow will run 
through the code and would help in trying to narrow down the issue.

With just a quick look through the msg code, any place that 
number_of_pending_messages is changed, it is interrupt protected.

Is it possible that your queue is filling to full capacity under load 
and if so, do any of your queue sends wait if they can't put the message 
in straight away?

Also the code in coremsgseize is pretty much completely covered by an 
interrupt disable already.  Is it possible to disable interrupts just on 
the _CORE_message_queue_Submit() to see if this also "fixes" the issue.

If it does, then move the disable interrupts to the tasks that call the 
send and remove it from the submit, to see if it is an interaction issue 
between the ISR and the tasks sending messages.

regards,

Ian Caddy

Phil Torre wrote:
> Greetings All,
> 
> We've got a weird bug that may interest someone.  I'd be interested
> in hearing from anyone who has seen similar.
> 
> The setup:  rtems-4.6.0 with various patches merged in from CVS,
> running on MPC855T, with an unsubmitted BSP.
> 
> The PowerPC host processor receives interrupts from an external DSP
> at a fairly high rate.  The ISR that services that interrupt sends
> messages to a queue that is read by a "classic API" task.  Several
> other foreground tasks also send messages to that queue.  When 
> running at high load (lots of interrupts firing) for extended periods
> of time, we sometimes see messages that have already been read from
> the queue "reappear", as much as tens of milliseconds later.  This
> seems to be happening because the number_of_pending_messages member 
> of the CORE_message_queue_Control struct is zero, but the chain of
> pending messages is non-empty.  When a new message is submitted, it
> goes to the end of the chain, and number_of_pending_messages becomes
> 1.  The next time the queue is read, the count is decremented back
> to zero, but the wrong message is returned.
> 
> I don't know exactly how the "count is zero but list is not empty"
> condition comes about.  I put in a bunch of instrumentation to try
> and catch it in the act of happening.  But, interrupts firing in
> the middle of my debug code was causing my debug code to false trigger.
> So, I resorted to turning interrupts off for the entire duration
> of both _CORE_message_queue_Submit() and _CORE_message_queue_Seize().
> Now my debug code doesn't false-trigger, but the actual bug doesn't
> happen any more.  We got pretty good at reproducing it, but with
> interrupts disabled in those two functions, we can't make the bug
> manifest any more.  I don't know if I have actually fixed something,
> or just forced the bug into hiding, biding its time.
> 
> Looking at the queue insert/remove code, I don't see a window.  I may
> be missing it, or there may not be one there and I've just changed 
> the timing enough with my interrupt disabling that we can't make
> the bug show itself the same way.
> 
> Any comments would be welcome.
> 
> Thanks,
> -Phil
>  

-- 
Ian Caddy
Goanna Technologies Pty Ltd
+61 8 9221 1860