Event not timing out with a wait of 1 tick (and sometimes 2)

Wed Dec 14 04:43:52 UTC 2005

Hi ,

Here is our stack trace, although I dont think it is very useful as GDB 
seems to have great difficulty stack tracing through our ISR.

#0  _Watchdog_Tickle (header=0x5db1ec) at 
../../../../../../rtems-distribution/cpukit/score/src/watchdogtickle.c:53
#1  0x0009a7f4 in rtems_clock_tick () at 
../../../../bare/lib/include/rtems/score/watchdog.inl:113
#2  0x0002b3e0 in Clock_isr (vector=0) at 
Application/RTEMS/MPC860/BSP/Drivers/Clock/clock.cpp:74
#3  0x00004594 in save_stackframe ()
#4  0x00004594 in save_stackframe ()
Previous frame identical to this frame (corrupt stack?)

Im not sure it is a nested clock tick ISR in our target, because the way 
we have set it up our target shouldnt allow nesting ISR's, but i cant be 
sure.  In "watchdog insert" there is an "ISR Flash" with a suspicious 
comment about "if this causes problems comment it out"  and that is what 
we thought might be the root cause of the race condition, but we couldnt 
think of a way to be sure of this.

In answer to the question on whats the best fix, the perfectionist in me 
says 2), but the paranoiac in me says 1) so the pragmatist in me would 
suggest both 1) and 2).

Rationale, the overhead of fix 1) is minimal (a compare with 0 and 
branch if eq, 2 instructions on most processors im used to) so it wont 
really have much of a speed effect on watchdog_tickle, so on the off 
chance that fix 2) doesnt actually fix all possibilities that cause this 
circumstance, we are still protected from the unforeseen.

Another way to fix it (2 style) that wouldnt add any overhead, would be 
to change the delta time to be signed, and then pull it off if the 
decrement sends it 0 or -ve.  The only problem with this i can see is it 
halves the maximum duration of the delta.

I also dont think that missing the tick is very important.  The 
guaranteable precision on the timer is -1/+0 Clock Ticks anyway, so a 
missed clock tick has no worse latency than the watchdog being set 
immediately following a clock tick, which is the timeout requested (1 tick).

Steven

Jennifer Averett wrote:

>Hello,
>
>I have been talking with Joel about this and I think I have verified his
>theory.  I added code to break at and was able to duplicate this problem
>using the psim simulator and an application that has several Timer Server
>Routines.  I have attached the backtrace to the bottom of the message.
>
>There are multiple ways to fix this problem.
>  1) Use the patch that drops the clock tick
>  2) Put in special code to not drop the clock tick.
>
>Any input to which would be the better solution is appreciated.
>
>Thanks
>Jennifer
>
>
>
>(gdb) bt
>#0  _Watchdog_Tickle (header=0x15d09c)
>    at
>../../../../../../rtems-4.6.4-work/cpukit/score/src/watchdogtickle.c:61
>#1  0x000f8fbc in rtems_clock_tick ()
>    at ../../../../psim/lib/include/rtems/score/watchdog.inl:113
>#2  0x000f642c in clockIsr (unused=0x2042)
>    at
>../../../../../../../../../rtems-4.6.4-work/c/src/lib/libcpu/powerpc/mpc6xx/clock/c_clock.c:91
>#3  0x000f6ed4 in C_dispatch_irq_handler (frame=0x2042, excNum=0)
>    at
>../../../../../../../../../rtems-4.6.4-work/c/src/lib/libbsp/powerpc/psim/irq/irq.c:334
>#4  0x000f8118 in nested ()
>#5  0x0010b5f0 in _Watchdog_Tickle (header=0x15d09c)
>    at
>../../../../../../rtems-4.6.4-work/cpukit/score/src/watchdogtickle.c:76
>#6  0x000f8fbc in rtems_clock_tick ()
>    at ../../../../psim/lib/include/rtems/score/watchdog.inl:113
>#7  0x000f642c in clockIsr (unused=0x2042)
>    at
>../../../../../../../../../rtems-4.6.4-work/c/src/lib/libcpu/powerpc/mpc6xx/clock/c_clock.c:91
>#8  0x000f6ed4 in C_dispatch_irq_handler (frame=0x2042, excNum=0)
>    at
>../../../../../../../../../rtems-4.6.4-work/c/src/lib/libbsp/powerpc/psim/irq/irq.c:334
>#9  0x000f8118 in nested ()
>
>
>  
>
>>Hi,
>>
>>I was out of town last week and didn't see any email until yesterday.
>>Plus I had to think about this one.
>>
>>I can only think of one scenario where this is possible.  If you nest
>>clock tick ISRs, I can see this happening.  Is this possibly happening?
>>
>>Is it possible to add a bit of diagnostic code to detect this before the
>>decrement so you have a break point?  Then we could see the stack trace.
>>
>>In general, I believe your is OK and safe.  It is definitely avoiding a
>>horrible situation.  But it is also resulting in a lost tick.  I just
>>want to understand what is happening and make sure we aren't missing
>>something.
>>
>>--joel
>>
>>Steven Johnson wrote:
>>    
>>
>>>Hi,
>>>
>>>We were making some networking test code for our application, so that an
>>>end user could perform loop back diagnostics on our Ethernet interface.
>>>We were sending a packet with UDP, and then waiting to receive it.
>>>Waiting to receive used a socket wait, which uses an event to trigger if
>>>the packet was in, or it timed out if it wasnt.  Our timeout ended up
>>>being 1 tick.
>>>
>>>Once every now and then, sometimes after a couple of hundred iterations,
>>>sometimes after a couple of thousand iterations our code would lock up
>>>in the socket wait.
>>>
>>>We tracked it to the event not occurring, and the timeout also not
>>>occurring.
>>>
>>>What it turned out to be is for some reason, on rare occasions
>>>the_watchdog->delta_interval of the head of the watchdog chain is 0.  So
>>>on entry to watchdog tickle, it is decremented. 0 - 1 (unsigned) is a
>>>very big number.  This meant that the timeout wasnt going to occur for a
>>>very long time 2^32 more ticks, instead of immediately.  To fix it, we
>>>added a test to prevent the delta_interval being decremented if it was
>>>already zero.  This fixed the problem.  Also, because the delta_interval
>>>was so big, any events in the chain following it, would not be reached
>>>to timeout, as the loop to remove them would fail as soon as it hit the
>>>~2^32 value near the head, effectively stalling these other events.  (We
>>>never saw this occur, but it is our supposition from what we saw of the
>>>error.)
>>>
>>>The test "if (the_watchdog->delta_interval != 0)" is added to prevent
>>>this from occurring.
>>>
>>>We were not able to categorically identify the situation that causes
>>>this, but proved it to be true empirically.  So this check causes
>>>correct behavior in this circumstance.
>>>
>>>The belief is that a race condition exists whereby an event at the head
>>>of the chain is removed (by a pending ISR or higher priority task)
>>>during the _ISR_Flash( level ); in _Watchdog_Insert, but the watchdog to
>>>be inserted has already had its delta_interval adjusted to 0, and so is
>>>added to the head of the chain with a delta_interval of 0.
>>>
>>>The attached patch is our fix, im sure there are other answers, but it
>>>works for us, and as we were not able to readily identify the exact
>>>location of the race condition we could not produce a known reliable fix
>>>to prevent the head having an interval of 0.
>>>
>>>This is in Rtems 4.6.5, using GCC 3.2.3 (the standard tool chain
>>>distribution) optimization level -O3, on a MPC862 PowerPC Target.
>>>
>>>Steven Johnson
>>>      
>>>
>>--
>>Joel Sherrill, Ph.D.             Director of Research & Development
>>joel at OARcorp.com                 On-Line Applications Research
>>Ask me about RTEMS: a free RTOS  Huntsville AL 35805
>>    Support Available             (256) 722-9985
>>
>>
>>    
>>
>
>
>  
>

-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: WatchdogTickle.bt
URL: <http://lists.rtems.org/pipermail/users/attachments/20051214/8265352a/attachment-0001.ksh>