Event not timing out with a wait of 1 tick (and sometimes 2)
Joel Sherrill <joel@OARcorp.com>
joel.sherrill at OARcorp.com
Tue Dec 13 17:36:08 UTC 2005
Hi,
I was out of town last week and didn't see any email until yesterday.
Plus I had to think about this one.
I can only think of one scenario where this is possible. If you nest
clock tick ISRs, I can see this happening. Is this possibly happening?
Is it possible to add a bit of diagnostic code to detect this before the
decrement so you have a break point? Then we could see the stack trace.
In general, I believe your is OK and safe. It is definitely avoiding a
horrible situation. But it is also resulting in a lost tick. I just
want to understand what is happening and make sure we aren't missing
something.
--joel
Steven Johnson wrote:
> Hi,
>
> We were making some networking test code for our application, so that an
> end user could perform loop back diagnostics on our Ethernet interface.
> We were sending a packet with UDP, and then waiting to receive it.
> Waiting to receive used a socket wait, which uses an event to trigger if
> the packet was in, or it timed out if it wasnt. Our timeout ended up
> being 1 tick.
>
> Once every now and then, sometimes after a couple of hundred iterations,
> sometimes after a couple of thousand iterations our code would lock up
> in the socket wait.
>
> We tracked it to the event not occurring, and the timeout also not
> occurring.
>
> What it turned out to be is for some reason, on rare occasions
> the_watchdog->delta_interval of the head of the watchdog chain is 0. So
> on entry to watchdog tickle, it is decremented. 0 - 1 (unsigned) is a
> very big number. This meant that the timeout wasnt going to occur for a
> very long time 2^32 more ticks, instead of immediately. To fix it, we
> added a test to prevent the delta_interval being decremented if it was
> already zero. This fixed the problem. Also, because the delta_interval
> was so big, any events in the chain following it, would not be reached
> to timeout, as the loop to remove them would fail as soon as it hit the
> ~2^32 value near the head, effectively stalling these other events. (We
> never saw this occur, but it is our supposition from what we saw of the
> error.)
>
> The test "if (the_watchdog->delta_interval != 0)" is added to prevent
> this from occurring.
>
> We were not able to categorically identify the situation that causes
> this, but proved it to be true empirically. So this check causes
> correct behavior in this circumstance.
>
> The belief is that a race condition exists whereby an event at the head
> of the chain is removed (by a pending ISR or higher priority task)
> during the _ISR_Flash( level ); in _Watchdog_Insert, but the watchdog to
> be inserted has already had its delta_interval adjusted to 0, and so is
> added to the head of the chain with a delta_interval of 0.
>
> The attached patch is our fix, im sure there are other answers, but it
> works for us, and as we were not able to readily identify the exact
> location of the race condition we could not produce a known reliable fix
> to prevent the head having an interval of 0.
>
> This is in Rtems 4.6.5, using GCC 3.2.3 (the standard tool chain
> distribution) optimization level -O3, on a MPC862 PowerPC Target.
>
> Steven Johnson
--
Joel Sherrill, Ph.D. Director of Research & Development
joel at OARcorp.com On-Line Applications Research
Ask me about RTEMS: a free RTOS Huntsville AL 35805
Support Available (256) 722-9985
More information about the users
mailing list