Event not timing out with a wait of 1 tick (and sometimes 2)

Wed Dec 14 13:29:08 UTC 2005

Steven Johnson wrote:
> Hi ,
> 
> Here is our stack trace, although I dont think it is very useful as GDB 
> seems to have great difficulty stack tracing through our ISR.
> 
> #0  _Watchdog_Tickle (header=0x5db1ec) at 
> ../../../../../../rtems-distribution/cpukit/score/src/watchdogtickle.c:53
> #1  0x0009a7f4 in rtems_clock_tick () at 
> ../../../../bare/lib/include/rtems/score/watchdog.inl:113
> #2  0x0002b3e0 in Clock_isr (vector=0) at 
> Application/RTEMS/MPC860/BSP/Drivers/Clock/clock.cpp:74
> #3  0x00004594 in save_stackframe ()
> #4  0x00004594 in save_stackframe ()
> Previous frame identical to this frame (corrupt stack?)
> 
> Im not sure it is a nested clock tick ISR in our target, because the way 
> we have set it up our target shouldnt allow nesting ISR's, but i cant be 
> sure.  In "watchdog insert" there is an "ISR Flash" with a suspicious 
> comment about "if this causes problems comment it out"  and that is what 
> we thought might be the root cause of the race condition, but we couldnt 
> think of a way to be sure of this.

Since you are hacking on the source anyway, you could add a variable 
which is set to note that you are at the flash and clear it when 
interrupts are redisabled.  Then when the 0 at the head of the delta 
chain occurs there is a bread crumb.  Several places could be marked 
this way.

When you get the fault, check _ISR_Nest_level and check it.

> In answer to the question on whats the best fix, the perfectionist in me 
> says 2), but the paranoiac in me says 1) so the pragmatist in me would 
> suggest both 1) and 2).

Me too.  But I would really like to narrow down on what is causing this 
in your case.  Jennifer is on a simulator where the number of 
instructions between clock ticks is deliberately very low so the 
application doesn't take forever when simulating.

> Rationale, the overhead of fix 1) is minimal (a compare with 0 and 
> branch if eq, 2 instructions on most processors im used to) so it wont 
> really have much of a speed effect on watchdog_tickle, so on the off 
> chance that fix 2) doesnt actually fix all possibilities that cause this 
> circumstance, we are still protected from the unforeseen.
> 
> Another way to fix it (2 style) that wouldnt add any overhead, would be 
> to change the delta time to be signed, and then pull it off if the 
> decrement sends it 0 or -ve.  The only problem with this i can see is it 
> halves the maximum duration of the delta.

Yes.  I thought about explicitly checking for 0xffffffff but am thinking 
that your fix of 0 or 1 is sufficient.  I just want to know what is 
causing this.

My overall suspicion is that a timer routine is firing and taking long 
enough (or a set is taking long enough) that another ISR occurs and they
are nesting.  With Jennifer, it was two clock ticks.  In yours, we need 
to narrow it down.

> I also dont think that missing the tick is very important.  The 
> guaranteable precision on the timer is -1/+0 Clock Ticks anyway, so a 
> missed clock tick has no worse latency than the watchdog being set 
> immediately following a clock tick, which is the timeout requested (1 
> tick).
> 
> Steven
> 
> 
> Jennifer Averett wrote:
> 
>> Hello,
>>
>> I have been talking with Joel about this and I think I have verified his
>> theory.  I added code to break at and was able to duplicate this problem
>> using the psim simulator and an application that has several Timer Server
>> Routines.  I have attached the backtrace to the bottom of the message.
>>
>> There are multiple ways to fix this problem.
>>  1) Use the patch that drops the clock tick
>>  2) Put in special code to not drop the clock tick.
>>
>> Any input to which would be the better solution is appreciated.
>>
>> Thanks
>> Jennifer
>>
>>
>>
>> (gdb) bt
>> #0  _Watchdog_Tickle (header=0x15d09c)
>>    at
>> ../../../../../../rtems-4.6.4-work/cpukit/score/src/watchdogtickle.c:61
>> #1  0x000f8fbc in rtems_clock_tick ()
>>    at ../../../../psim/lib/include/rtems/score/watchdog.inl:113
>> #2  0x000f642c in clockIsr (unused=0x2042)
>>    at
>> ../../../../../../../../../rtems-4.6.4-work/c/src/lib/libcpu/powerpc/mpc6xx/clock/c_clock.c:91 
>>
>> #3  0x000f6ed4 in C_dispatch_irq_handler (frame=0x2042, excNum=0)
>>    at
>> ../../../../../../../../../rtems-4.6.4-work/c/src/lib/libbsp/powerpc/psim/irq/irq.c:334 
>>
>> #4  0x000f8118 in nested ()
>> #5  0x0010b5f0 in _Watchdog_Tickle (header=0x15d09c)
>>    at
>> ../../../../../../rtems-4.6.4-work/cpukit/score/src/watchdogtickle.c:76
>> #6  0x000f8fbc in rtems_clock_tick ()
>>    at ../../../../psim/lib/include/rtems/score/watchdog.inl:113
>> #7  0x000f642c in clockIsr (unused=0x2042)
>>    at
>> ../../../../../../../../../rtems-4.6.4-work/c/src/lib/libcpu/powerpc/mpc6xx/clock/c_clock.c:91 
>>
>> #8  0x000f6ed4 in C_dispatch_irq_handler (frame=0x2042, excNum=0)
>>    at
>> ../../../../../../../../../rtems-4.6.4-work/c/src/lib/libbsp/powerpc/psim/irq/irq.c:334 
>>
>> #9  0x000f8118 in nested ()
>>
>>
>>  
>>
>>> Hi,
>>>
>>> I was out of town last week and didn't see any email until yesterday.
>>> Plus I had to think about this one.
>>>
>>> I can only think of one scenario where this is possible.  If you nest
>>> clock tick ISRs, I can see this happening.  Is this possibly happening?
>>>
>>> Is it possible to add a bit of diagnostic code to detect this before the
>>> decrement so you have a break point?  Then we could see the stack trace.
>>>
>>> In general, I believe your is OK and safe.  It is definitely avoiding a
>>> horrible situation.  But it is also resulting in a lost tick.  I just
>>> want to understand what is happening and make sure we aren't missing
>>> something.
>>>
>>> --joel
>>>
>>> Steven Johnson wrote:
>>>   
>>>
>>>> Hi,
>>>>
>>>> We were making some networking test code for our application, so 
>>>> that an
>>>> end user could perform loop back diagnostics on our Ethernet interface.
>>>> We were sending a packet with UDP, and then waiting to receive it.
>>>> Waiting to receive used a socket wait, which uses an event to 
>>>> trigger if
>>>> the packet was in, or it timed out if it wasnt.  Our timeout ended up
>>>> being 1 tick.
>>>>
>>>> Once every now and then, sometimes after a couple of hundred 
>>>> iterations,
>>>> sometimes after a couple of thousand iterations our code would lock up
>>>> in the socket wait.
>>>>
>>>> We tracked it to the event not occurring, and the timeout also not
>>>> occurring.
>>>>
>>>> What it turned out to be is for some reason, on rare occasions
>>>> the_watchdog->delta_interval of the head of the watchdog chain is 
>>>> 0.  So
>>>> on entry to watchdog tickle, it is decremented. 0 - 1 (unsigned) is a
>>>> very big number.  This meant that the timeout wasnt going to occur 
>>>> for a
>>>> very long time 2^32 more ticks, instead of immediately.  To fix it, we
>>>> added a test to prevent the delta_interval being decremented if it was
>>>> already zero.  This fixed the problem.  Also, because the 
>>>> delta_interval
>>>> was so big, any events in the chain following it, would not be reached
>>>> to timeout, as the loop to remove them would fail as soon as it hit the
>>>> ~2^32 value near the head, effectively stalling these other events.  
>>>> (We
>>>> never saw this occur, but it is our supposition from what we saw of the
>>>> error.)
>>>>
>>>> The test "if (the_watchdog->delta_interval != 0)" is added to prevent
>>>> this from occurring.
>>>>
>>>> We were not able to categorically identify the situation that causes
>>>> this, but proved it to be true empirically.  So this check causes
>>>> correct behavior in this circumstance.
>>>>
>>>> The belief is that a race condition exists whereby an event at the head
>>>> of the chain is removed (by a pending ISR or higher priority task)
>>>> during the _ISR_Flash( level ); in _Watchdog_Insert, but the 
>>>> watchdog to
>>>> be inserted has already had its delta_interval adjusted to 0, and so is
>>>> added to the head of the chain with a delta_interval of 0.
>>>>
>>>> The attached patch is our fix, im sure there are other answers, but it
>>>> works for us, and as we were not able to readily identify the exact
>>>> location of the race condition we could not produce a known reliable 
>>>> fix
>>>> to prevent the head having an interval of 0.
>>>>
>>>> This is in Rtems 4.6.5, using GCC 3.2.3 (the standard tool chain
>>>> distribution) optimization level -O3, on a MPC862 PowerPC Target.
>>>>
>>>> Steven Johnson
>>>>     
>>>
>>> -- 
>>> Joel Sherrill, Ph.D.             Director of Research & Development
>>> joel at OARcorp.com                 On-Line Applications Research
>>> Ask me about RTEMS: a free RTOS  Huntsville AL 35805
>>>    Support Available             (256) 722-9985
>>>
>>>
>>>   
>>
>>
>>
>>  
>>
> 

-- 
Joel Sherrill, Ph.D.             Director of Research & Development
joel at OARcorp.com                 On-Line Applications Research
Ask me about RTEMS: a free RTOS  Huntsville AL 35805
    Support Available             (256) 722-9985