Event not timing out with a wait of 1 tick (and sometimes 2)

Thu Dec 8 04:11:55 UTC 2005

Hi,

We were making some networking test code for our application, so that an 
end user could perform loop back diagnostics on our Ethernet interface.  
We were sending a packet with UDP, and then waiting to receive it.  
Waiting to receive used a socket wait, which uses an event to trigger if 
the packet was in, or it timed out if it wasnt.  Our timeout ended up 
being 1 tick.

Once every now and then, sometimes after a couple of hundred iterations, 
sometimes after a couple of thousand iterations our code would lock up 
in the socket wait.

We tracked it to the event not occurring, and the timeout also not 
occurring.

What it turned out to be is for some reason, on rare occasions 
the_watchdog->delta_interval of the head of the watchdog chain is 0.  So 
on entry to watchdog tickle, it is decremented. 0 - 1 (unsigned) is a 
very big number.  This meant that the timeout wasnt going to occur for a 
very long time 2^32 more ticks, instead of immediately.  To fix it, we 
added a test to prevent the delta_interval being decremented if it was 
already zero.  This fixed the problem.  Also, because the delta_interval 
was so big, any events in the chain following it, would not be reached 
to timeout, as the loop to remove them would fail as soon as it hit the 
~2^32 value near the head, effectively stalling these other events.  (We 
never saw this occur, but it is our supposition from what we saw of the 
error.)

The test "if (the_watchdog->delta_interval != 0)" is added to prevent 
this from occurring.

We were not able to categorically identify the situation that causes 
this, but proved it to be true empirically.  So this check causes 
correct behavior in this circumstance.

The belief is that a race condition exists whereby an event at the head 
of the chain is removed (by a pending ISR or higher priority task) 
during the _ISR_Flash( level ); in _Watchdog_Insert, but the watchdog to 
be inserted has already had its delta_interval adjusted to 0, and so is 
added to the head of the chain with a delta_interval of 0.

The attached patch is our fix, im sure there are other answers, but it 
works for us, and as we were not able to readily identify the exact 
location of the race condition we could not produce a known reliable fix 
to prevent the head having an interval of 0.

This is in Rtems 4.6.5, using GCC 3.2.3 (the standard tool chain 
distribution) optimization level -O3, on a MPC862 PowerPC Target.

Steven Johnson
-------------- next part --------------
A non-text attachment was scrubbed...
Name: rtems-4.6.5-eventtimeoutbug.patch
Type: text/x-patch
Size: 2242 bytes
Desc: not available
URL: <http://lists.rtems.org/pipermail/users/attachments/20051208/09a4f6dc/attachment.bin>