Random lwIP Crashes in _POSIX_Mutex_Lock_support()

Fri Oct 23 14:16:11 UTC 2015

We finally have a culprit. Enabling the CAN driver causes the same tests to fail.

Thank you everyone for all the help! We actually have something to go on now.

Isaac
On 10/23/2015 09:37 AM, Joel Sherrill wrote:
>
>
> On 10/23/2015 8:01 AM, Isaac Gutekunst wrote:
>> On 10/23/2015 08:50 AM, Joel Sherrill wrote:
>>>
>>>
>>> On October 23, 2015 7:02:34 AM CDT, Joel Sherrill <joel.sherrill at oarcorp.com> wrote:
>>>>
>>>>
>>>> On October 22, 2015 1:37:18 PM CDT, Isaac Gutekunst
>>>> <isaac.gutekunst at vecna.com> wrote:
>>>>> I think I may have some information that's actually useful.
>>>>>
>>>>> I've managed to actually execute some tests.... and lots of them are
>>>>> failing.
>>>>>
>>>>> sp01 and sp02 fail quite quickly, as an assertion fails.
>>>>>
>>>>> assertion "first != _Chain_Tail( &ready_queues[ index ] )" failed:
>>>> file
>>>>>
>>>>> "../../cpukit/../../../stm32f7x/lib/
>>>>> include/rtems/score/schedulerpriorityimpl.h", line
>>>>> 166, function: _Scheduler_priority_Ready_queue_first
>>>>>
>>>>> This failure is common to many of the failed tests so far. What does
>>>>> this mean?
>>>>>
>>>>
>>>> Does hello run?
>>>
>> Yes
>>
>> Yes. Here are the results of running all the samples:
>>
>>
>> RTEMS Testing - Tester, v0.2.0
>> [1/9] p:0 f:0 t:0 i:0 | arm/stm32f7x: base_sp.exe
>> [2/9] p:1 f:0 t:0 i:0 | arm/stm32f7x: capture.exe
>> [3/9] p:1 f:1 t:0 i:0 | arm/stm32f7x: fileio.exe
>> [4/9] p:2 f:1 t:0 i:0 | arm/stm32f7x: hello.exe
>> [5/9] p:3 f:1 t:0 i:0 | arm/stm32f7x: minimum.exe
>> [6/9] p:4 f:1 t:0 i:0 | arm/stm32f7x: nsecs.exe
>> [7/9] p:4 f:2 t:0 i:0 | arm/stm32f7x: paranoia.exe
>> [8/9] p:5 f:2 t:0 i:0 | arm/stm32f7x: ticker.exe
>> [9/9] p:5 f:3 t:0 i:0 | arm/stm32f7x: unlimited.exe
>> Passed: 5
>> Failed: 4
>> Timeouts: 0
>> Invalid: 0
>> Total: 9
>> Average test time: 0:00:13.072150
>> Testing time : 0:01:57.649357
>>
>>
>>> Sorry to follow up so quickly.  Set your console driver to polling mode. Run hello. It
>>> should work before you move on. There should be no interrupts at all.
>>>
>>> Then turn console interrupts on and get it running again. There should only be console
>>> interrupts.
>>>
>>> Then move on to ticker. Clock tick and console if not in polled.
>>
>> Ticker does not seem to work. This is a concrete area to investigate. I have some suspicions.
>
> Once this works, all of the standard tests should work.
>
>
>>>
>>> Did all interrupt handlers get installed using RTEMS ISR installation method? Are there any
>>> which do not use this?
>> I think they all use the RTEMS method. There might be some lurking in the HAL code from ST. If
>> so, that's a bit scary and could maybe explain a lot.
>
> This would break things in an interesting way.
>
> Any method marked as an interrupt handler via the gcc attribute would be
> suspect. But theymay have their own infrastructure as well.
>
>>>
>>> For all RTEMS standard tests, there should be no drivers except those explicitly configured.
>>> All other devices may need to be tickled to be quiet but usually are not touched or referenced.
>>>
>>> If the minimum test has driver symbols in it, it shouldn't. :)
>>>
>> We'll clean that up.
>>>>> Isaac
>>>>>
>>>>> On 10/22/2015 09:16 AM, Jay Doyle wrote:
>>>>>>
>>>>>>
>>>>>> On 10/22/2015 01:40 AM, Sebastian Huber wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 21/10/15 15:48, Jay Doyle wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10/21/2015 09:35 AM, Sebastian Huber wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 21/10/15 15:08, Isaac Gutekunst wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 10/21/2015 09:00 AM, Sebastian Huber wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 21/10/15 14:56, Isaac Gutekunst wrote:
>>>>>>>>>>>> On 10/21/2015 08:24 AM, Sebastian Huber wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 21/10/15 14:13, Isaac Gutekunst wrote:
>>>>>>>>>>>>>> Thanks for the reply.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 10/21/2015 01:50 AM, Sebastian Huber wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 20/10/15 16:02, Isaac Gutekunst wrote:
>>>>>>>>>>>>> [...]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> As far as I can tell this would only occur if the caller
>>>> of
>>>>>>>>>>>>>>>> pthread_mutex_lock was in a
>>>>>>>>>>>>>>>> "bad"
>>>>>>>>>>>>>>>> state. I don't believe it is in an interrupt context, and
>>>>>>>>>>>>>>>> don't know what other bad states
>>>>>>>>>>>>>>>> could exist.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We have
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> #define _CORE_mutex_Check_dispatch_for_seize(_wait) \
>>>>>>>>>>>>>>>      (!_Thread_Dispatch_is_enabled() \
>>>>>>>>>>>>>>>        && (_wait) \
>>>>>>>>>>>>>>>        && (_System_state_Get() >= SYSTEM_STATE_UP))
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> What is the thread dispatch disable level and the system
>>>>> state
>>>>>>>>>>>>>>> at this point?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In case the thread dispatch disable level is not zero, then
>>>>>>>>>>>>>>> something is probably broken
>>>>>>>>>>>>>>> in the
>>>>>>>>>>>>>>> operating system code which is difficult to find. Could be
>>>> a
>>>>>>>>>>>>>>> general memory corruption
>>>>>>>>>>>>>>> problem
>>>>>>>>>>>>>>> too. Which RTEMS version do you use?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The thread dispatch disable level is usually -1 or -2.
>>>>>>>>>>>>>> (0xFFFFFFFE or 0xFFFFFFD).
>>>>>>>>>>>>>
>>>>>>>>>>>>> A negative value is very bad, but easy to detect via manual
>>>>>>>>>>>>> instrumentation (only an hand full
>>>>>>>>>>>>> of spots touch this variable) or hardware
>>>>>>>>>>>>> breakpoints/watchpoints. Looks the rest of
>>>>>>>>>>>>> _Per_CPU_Information all right?
>>>>>>>>>>>>>
>>>>>>>>>>>> It looks like it's only the thread_dispatch_disable_level
>>>>> that's
>>>>>>>>>>>> broken.
>>>>>>>>>>>>
>>>>>>>>>>>> We'll go and grep for all places for all the places it's
>>>>> touched,
>>>>>>>>>>>> and look for something.
>>>>>>>>>>>>
>>>>>>>>>>>> The problem with watchpoints is they fire exceptionally often,
>>>>> and
>>>>>>>>>>>> putting in a conditional
>>>>>>>>>>>> watchpoint slows the code to a crawl, but that may be worth
>>>> it.
>>>>>>>>>>>>
>>>>>>>>>>>> Here are some printouts of the relevant structs right after a
>>>>> crash:
>>>>>>>>>>>>
>>>>>>>>>>>> $4 = {
>>>>>>>>>>>>     cpu_per_cpu = {<No data fields>},
>>>>>>>>>>>>     isr_nest_level = 0,
>>>>>>>>>>>>     thread_dispatch_disable_level = 4294967295,
>>>>>>>>>>>>     executing = 0xc01585c8,
>>>>>>>>>>>>     heir = 0xc0154038,
>>>>>>>>>>>>     dispatch_necessary = true,
>>>>>>>>>>>>     time_of_last_context_switch = {
>>>>>>>>>>>>       sec = 2992,
>>>>>>>>>>>>       frac = 10737447432380511034
>>>>>>>>>>>>     },
>>>>>>>>>>>>     Stats = {<No data fields>}
>>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> No, this doesn't look good. According to the stack trace you
>>>> are
>>>>> in
>>>>>>>>>>> thread context. However, we
>>>>>>>>>>> have executing != heir and dispatch_necessary == true. This is
>>>> a
>>>>>>>>>>> broken state itself. I guess,
>>>>>>>>>>> something is wrong with the interrupt level so that a context
>>>>>>>>>>> switch is blocked. On ARMv7-M
>>>>>>>>>>> this is done via the system call exception.
>>>>>>>>>>>
>>>>>>>>>> This is a bit beyond my RTEMS knowledge. What would you advise
>>>>>>>>>> looking into?
>>>>>>>>>
>>>>>>>>> I would try to instrument the code to figure out where the thread
>>>>>>>>> dispatch disable level goes negative.
>>>>>>>>>
>>>>>>>>
>>>>>>>> We just did.  I added a check in _ARMV7M_Interrupt_service_leave
>>>> to
>>>>>>>> see if the _Thread_Dispatch_disable_level is positive before the
>>>>>>>> decrementing it and this eventually fails.
>>>>>>>>
>>>>>>>> I'm not sure if this tells us much because I think the call itself
>>>>>>>> correct.  In this particular case it is processing an I2C
>>>>> interrupt.
>>>>>>>> I will try to see if we can capture information about the sequence
>>>>> of
>>>>>>>> changes to the _Thread_Dispatch_disable_level just before the
>>>> point
>>>>> in
>>>>>>>> which we know something is clearly wrong (i.e., decreasing it
>>>> below
>>>>>>>> zero.)
>>>>>>>
>>>>>>> Since the isr_nest_level is 0, I don't think its a problem with the
>>>>> spots that use
>>>>>>> _ARMV7M_Interrupt_service_leave(). Did you check the interrupt
>>>>> priorities? See also
>>>>>>>
>>>>>>> https://lists.rtems.org/pipermail/users/2015-June/029155.html
>>>>>>>
>>>>>> Thanks for the pointer to this posting.  It seems like a very
>>>> similar
>>>>> situation to what we are
>>>>>> experiencing -- especially considering that we invoke an RTEMS call
>>>>> in our ethernet isr.
>>>>>> Unfortunately, all our interrupts use the default interrupt priority
>>>>> level set in the bsp
>>>>>> header file as:
>>>>>>
>>>>>> #define BSP_ARMV7M_IRQ_PRIORITY_DEFAULT (13 << 4)
>>>>>>
>>>>>> which should be mean that they are all non-NMIs unless we explicitly
>>>>> set their interrupt level
>>>>>> lower.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> devel at rtems.org
>>>>>> http://lists.rtems.org/mailman/listinfo/devel
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel at rtems.org
>>>>> http://lists.rtems.org/mailman/listinfo/devel
>>>>
>>>> --joel
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel at rtems.org
>>>> http://lists.rtems.org/mailman/listinfo/devel
>>>
>>> --joel
>>>
>