Event not timing out with a wait of 1 tick (and sometimes 2)

Tue Dec 13 17:36:08 UTC 2005

Hi,

I was out of town last week and didn't see any email until yesterday.
Plus I had to think about this one.

I can only think of one scenario where this is possible.  If you nest 
clock tick ISRs, I can see this happening.  Is this possibly happening?

Is it possible to add a bit of diagnostic code to detect this before the
decrement so you have a break point?  Then we could see the stack trace.

In general, I believe your is OK and safe.  It is definitely avoiding a 
horrible situation.  But it is also resulting in a lost tick.  I just 
want to understand what is happening and make sure we aren't missing
something.

--joel

Steven Johnson wrote:
> Hi,
> 
> We were making some networking test code for our application, so that an 
> end user could perform loop back diagnostics on our Ethernet interface.  
> We were sending a packet with UDP, and then waiting to receive it.  
> Waiting to receive used a socket wait, which uses an event to trigger if 
> the packet was in, or it timed out if it wasnt.  Our timeout ended up 
> being 1 tick.
> 
> Once every now and then, sometimes after a couple of hundred iterations, 
> sometimes after a couple of thousand iterations our code would lock up 
> in the socket wait.
> 
> We tracked it to the event not occurring, and the timeout also not 
> occurring.
> 
> What it turned out to be is for some reason, on rare occasions 
> the_watchdog->delta_interval of the head of the watchdog chain is 0.  So 
> on entry to watchdog tickle, it is decremented. 0 - 1 (unsigned) is a 
> very big number.  This meant that the timeout wasnt going to occur for a 
> very long time 2^32 more ticks, instead of immediately.  To fix it, we 
> added a test to prevent the delta_interval being decremented if it was 
> already zero.  This fixed the problem.  Also, because the delta_interval 
> was so big, any events in the chain following it, would not be reached 
> to timeout, as the loop to remove them would fail as soon as it hit the 
> ~2^32 value near the head, effectively stalling these other events.  (We 
> never saw this occur, but it is our supposition from what we saw of the 
> error.)
> 
> The test "if (the_watchdog->delta_interval != 0)" is added to prevent 
> this from occurring.
> 
> We were not able to categorically identify the situation that causes 
> this, but proved it to be true empirically.  So this check causes 
> correct behavior in this circumstance.
> 
> The belief is that a race condition exists whereby an event at the head 
> of the chain is removed (by a pending ISR or higher priority task) 
> during the _ISR_Flash( level ); in _Watchdog_Insert, but the watchdog to 
> be inserted has already had its delta_interval adjusted to 0, and so is 
> added to the head of the chain with a delta_interval of 0.
> 
> The attached patch is our fix, im sure there are other answers, but it 
> works for us, and as we were not able to readily identify the exact 
> location of the race condition we could not produce a known reliable fix 
> to prevent the head having an interval of 0.
> 
> This is in Rtems 4.6.5, using GCC 3.2.3 (the standard tool chain 
> distribution) optimization level -O3, on a MPC862 PowerPC Target.
> 
> Steven Johnson

-- 
Joel Sherrill, Ph.D.             Director of Research & Development
joel at OARcorp.com                 On-Line Applications Research
Ask me about RTEMS: a free RTOS  Huntsville AL 35805
    Support Available             (256) 722-9985