lockup in delta chain using signal timers

Wed Feb 17 09:39:39 UTC 2016

Hello Sebastian,

I'll just pop in here a bit.

________________________________________
>From: Sebastian Huber [sebastian.huber at embedded-brains.de]
>Sent: Wednesday, February 17, 2016 09:30
>To: Martin Werner; users at rtems.org
>Cc: David Hennerström; Jakob Viketoft
>Subject: Re: lockup in delta chain using signal timers
>
>Hello Martin,
>
>which RTEMS version do you use (Git commit)?
>

We started our port on commit 74f5eaffb920390a7a33da6f7ed4a9fa96ccc286 and have waited to move with RTEMS as we were getting the BSP et al a bit more stable. There were a number of bugs in the original openrisc implementation which we have dealt with as well (patches are on their way). We were also hoping to wait for 4.11 to do the move, but I'm updating now to see how this performs with latest on the 4.11 branch.

>On 15/02/16 16:53, Martin Werner wrote:
>> We're seeing an issue in RTEMS where heavy use of signal timers causes
>> the internal RTEMS delta chain (_Watchdog_Ticks_chain) to end up with a
>> self-referencing node, which subsequently blocks insertion and locks the
>> application.
>
>If you build RTEMS with the --enable-rtems-debug option, do you get an
>assertion failure?

I'll try that as well, but I can't see an assert in watchdoginsert.c, is that done somewhere else?

>>
>> I've cobbled together a testcase[1] based on samples/ticker, and have
>> seen the issue when running on qemu/i386.
>>
>> Is this testcase valid, or is the usage of the signal timers here
>> incorrect?
>
>I was not able to reproduce the described error (self-referencing node),
>instead the test program overloaded the interrupt service with a delta
>chain length of about 300 so that the clock tick interrupt service time
>exceeded the clock tick interval.

There are obviously a big difference between our actual hardware/BSP and the qemu/i386, but we where hoping we had been able to find a more generic testcase by stressing it a bit more. I'll try again with updated RTEMS and toolchain to match, but is the gist of what you write that if the clock tick interrupt service time exceeds the clock tick interval, then things won't work? I might also mention that we added a breakpoint in the for-loop in Watchdog_Insert to identify when an object being added is already on the list and that's what we've had as the stopping point. Using the qemu/i386 version the error turned up also with only using the software we're porting to RTEMS, but then after a couple of days.

>> If it is valid, does anyone have a suggestion as to what may be the core
>> issue, assuming that the delta chain behaviour is only a symptom?
>
>I fixed two bugs in the delta chain code recently, but they were not
>related to a self-referencing node problem. You can get a
>self-referencing node, if you do a double insert or remove from the
>chain. However, this should be detected by debug asserts.

That's what we saw. I'll try 4.11 latest and get back to you.

>> We originally saw this issue on our custom or1k hardware, where due to
>> various circumstances it seems much more easy to provoke (there with
>> only 3 threads), and from there have tracked it back to something which
>> seems to be non-or1k-specific.
>
>--
>Sebastian Huber, embedded brains GmbH
>
>Address : Dornierstr. 4, D-82178 Puchheim, Germany
>Phone   : +49 89 189 47 41-16
>Fax     : +49 89 189 47 41-09
>E-Mail  : sebastian.huber at embedded-brains.de
>PGP     : Public key available on request.
>
>Diese Nachricht ist keine geschäftliche Mitteilung im Sinne des EHUG.

Jakob Viketoft
Senior Engineer in RTL and embedded software

ÅAC Microtec AB
Dag Hammarskjölds väg 48
SE-751 83 Uppsala, Sweden

T: +46 702 80 95 97
http://www.aacmicrotec.com