Event not timing out with a wait of 1 tick (and sometimes 2)
Joel Sherrill <joel@OARcorp.com>
joel.sherrill at OARcorp.com
Wed Dec 14 13:29:08 UTC 2005
Steven Johnson wrote:
> Hi ,
>
> Here is our stack trace, although I dont think it is very useful as GDB
> seems to have great difficulty stack tracing through our ISR.
>
> #0 _Watchdog_Tickle (header=0x5db1ec) at
> ../../../../../../rtems-distribution/cpukit/score/src/watchdogtickle.c:53
> #1 0x0009a7f4 in rtems_clock_tick () at
> ../../../../bare/lib/include/rtems/score/watchdog.inl:113
> #2 0x0002b3e0 in Clock_isr (vector=0) at
> Application/RTEMS/MPC860/BSP/Drivers/Clock/clock.cpp:74
> #3 0x00004594 in save_stackframe ()
> #4 0x00004594 in save_stackframe ()
> Previous frame identical to this frame (corrupt stack?)
>
> Im not sure it is a nested clock tick ISR in our target, because the way
> we have set it up our target shouldnt allow nesting ISR's, but i cant be
> sure. In "watchdog insert" there is an "ISR Flash" with a suspicious
> comment about "if this causes problems comment it out" and that is what
> we thought might be the root cause of the race condition, but we couldnt
> think of a way to be sure of this.
Since you are hacking on the source anyway, you could add a variable
which is set to note that you are at the flash and clear it when
interrupts are redisabled. Then when the 0 at the head of the delta
chain occurs there is a bread crumb. Several places could be marked
this way.
When you get the fault, check _ISR_Nest_level and check it.
> In answer to the question on whats the best fix, the perfectionist in me
> says 2), but the paranoiac in me says 1) so the pragmatist in me would
> suggest both 1) and 2).
Me too. But I would really like to narrow down on what is causing this
in your case. Jennifer is on a simulator where the number of
instructions between clock ticks is deliberately very low so the
application doesn't take forever when simulating.
> Rationale, the overhead of fix 1) is minimal (a compare with 0 and
> branch if eq, 2 instructions on most processors im used to) so it wont
> really have much of a speed effect on watchdog_tickle, so on the off
> chance that fix 2) doesnt actually fix all possibilities that cause this
> circumstance, we are still protected from the unforeseen.
>
> Another way to fix it (2 style) that wouldnt add any overhead, would be
> to change the delta time to be signed, and then pull it off if the
> decrement sends it 0 or -ve. The only problem with this i can see is it
> halves the maximum duration of the delta.
Yes. I thought about explicitly checking for 0xffffffff but am thinking
that your fix of 0 or 1 is sufficient. I just want to know what is
causing this.
My overall suspicion is that a timer routine is firing and taking long
enough (or a set is taking long enough) that another ISR occurs and they
are nesting. With Jennifer, it was two clock ticks. In yours, we need
to narrow it down.
> I also dont think that missing the tick is very important. The
> guaranteable precision on the timer is -1/+0 Clock Ticks anyway, so a
> missed clock tick has no worse latency than the watchdog being set
> immediately following a clock tick, which is the timeout requested (1
> tick).
>
> Steven
>
>
> Jennifer Averett wrote:
>
>> Hello,
>>
>> I have been talking with Joel about this and I think I have verified his
>> theory. I added code to break at and was able to duplicate this problem
>> using the psim simulator and an application that has several Timer Server
>> Routines. I have attached the backtrace to the bottom of the message.
>>
>> There are multiple ways to fix this problem.
>> 1) Use the patch that drops the clock tick
>> 2) Put in special code to not drop the clock tick.
>>
>> Any input to which would be the better solution is appreciated.
>>
>> Thanks
>> Jennifer
>>
>>
>>
>> (gdb) bt
>> #0 _Watchdog_Tickle (header=0x15d09c)
>> at
>> ../../../../../../rtems-4.6.4-work/cpukit/score/src/watchdogtickle.c:61
>> #1 0x000f8fbc in rtems_clock_tick ()
>> at ../../../../psim/lib/include/rtems/score/watchdog.inl:113
>> #2 0x000f642c in clockIsr (unused=0x2042)
>> at
>> ../../../../../../../../../rtems-4.6.4-work/c/src/lib/libcpu/powerpc/mpc6xx/clock/c_clock.c:91
>>
>> #3 0x000f6ed4 in C_dispatch_irq_handler (frame=0x2042, excNum=0)
>> at
>> ../../../../../../../../../rtems-4.6.4-work/c/src/lib/libbsp/powerpc/psim/irq/irq.c:334
>>
>> #4 0x000f8118 in nested ()
>> #5 0x0010b5f0 in _Watchdog_Tickle (header=0x15d09c)
>> at
>> ../../../../../../rtems-4.6.4-work/cpukit/score/src/watchdogtickle.c:76
>> #6 0x000f8fbc in rtems_clock_tick ()
>> at ../../../../psim/lib/include/rtems/score/watchdog.inl:113
>> #7 0x000f642c in clockIsr (unused=0x2042)
>> at
>> ../../../../../../../../../rtems-4.6.4-work/c/src/lib/libcpu/powerpc/mpc6xx/clock/c_clock.c:91
>>
>> #8 0x000f6ed4 in C_dispatch_irq_handler (frame=0x2042, excNum=0)
>> at
>> ../../../../../../../../../rtems-4.6.4-work/c/src/lib/libbsp/powerpc/psim/irq/irq.c:334
>>
>> #9 0x000f8118 in nested ()
>>
>>
>>
>>
>>> Hi,
>>>
>>> I was out of town last week and didn't see any email until yesterday.
>>> Plus I had to think about this one.
>>>
>>> I can only think of one scenario where this is possible. If you nest
>>> clock tick ISRs, I can see this happening. Is this possibly happening?
>>>
>>> Is it possible to add a bit of diagnostic code to detect this before the
>>> decrement so you have a break point? Then we could see the stack trace.
>>>
>>> In general, I believe your is OK and safe. It is definitely avoiding a
>>> horrible situation. But it is also resulting in a lost tick. I just
>>> want to understand what is happening and make sure we aren't missing
>>> something.
>>>
>>> --joel
>>>
>>> Steven Johnson wrote:
>>>
>>>
>>>> Hi,
>>>>
>>>> We were making some networking test code for our application, so
>>>> that an
>>>> end user could perform loop back diagnostics on our Ethernet interface.
>>>> We were sending a packet with UDP, and then waiting to receive it.
>>>> Waiting to receive used a socket wait, which uses an event to
>>>> trigger if
>>>> the packet was in, or it timed out if it wasnt. Our timeout ended up
>>>> being 1 tick.
>>>>
>>>> Once every now and then, sometimes after a couple of hundred
>>>> iterations,
>>>> sometimes after a couple of thousand iterations our code would lock up
>>>> in the socket wait.
>>>>
>>>> We tracked it to the event not occurring, and the timeout also not
>>>> occurring.
>>>>
>>>> What it turned out to be is for some reason, on rare occasions
>>>> the_watchdog->delta_interval of the head of the watchdog chain is
>>>> 0. So
>>>> on entry to watchdog tickle, it is decremented. 0 - 1 (unsigned) is a
>>>> very big number. This meant that the timeout wasnt going to occur
>>>> for a
>>>> very long time 2^32 more ticks, instead of immediately. To fix it, we
>>>> added a test to prevent the delta_interval being decremented if it was
>>>> already zero. This fixed the problem. Also, because the
>>>> delta_interval
>>>> was so big, any events in the chain following it, would not be reached
>>>> to timeout, as the loop to remove them would fail as soon as it hit the
>>>> ~2^32 value near the head, effectively stalling these other events.
>>>> (We
>>>> never saw this occur, but it is our supposition from what we saw of the
>>>> error.)
>>>>
>>>> The test "if (the_watchdog->delta_interval != 0)" is added to prevent
>>>> this from occurring.
>>>>
>>>> We were not able to categorically identify the situation that causes
>>>> this, but proved it to be true empirically. So this check causes
>>>> correct behavior in this circumstance.
>>>>
>>>> The belief is that a race condition exists whereby an event at the head
>>>> of the chain is removed (by a pending ISR or higher priority task)
>>>> during the _ISR_Flash( level ); in _Watchdog_Insert, but the
>>>> watchdog to
>>>> be inserted has already had its delta_interval adjusted to 0, and so is
>>>> added to the head of the chain with a delta_interval of 0.
>>>>
>>>> The attached patch is our fix, im sure there are other answers, but it
>>>> works for us, and as we were not able to readily identify the exact
>>>> location of the race condition we could not produce a known reliable
>>>> fix
>>>> to prevent the head having an interval of 0.
>>>>
>>>> This is in Rtems 4.6.5, using GCC 3.2.3 (the standard tool chain
>>>> distribution) optimization level -O3, on a MPC862 PowerPC Target.
>>>>
>>>> Steven Johnson
>>>>
>>>
>>> --
>>> Joel Sherrill, Ph.D. Director of Research & Development
>>> joel at OARcorp.com On-Line Applications Research
>>> Ask me about RTEMS: a free RTOS Huntsville AL 35805
>>> Support Available (256) 722-9985
>>>
>>>
>>>
>>
>>
>>
>>
>>
>
--
Joel Sherrill, Ph.D. Director of Research & Development
joel at OARcorp.com On-Line Applications Research
Ask me about RTEMS: a free RTOS Huntsville AL 35805
Support Available (256) 722-9985
More information about the users
mailing list