Random lwIP Crashes in _POSIX_Mutex_Lock_support()

Wed Oct 21 14:21:43 UTC 2015

-------- Forwarded Message --------
Subject: 	Re: Random lwIP Crashes in _POSIX_Mutex_Lock_support()
Date: 	Wed, 21 Oct 2015 09:48:42 -0400
From: 	Jay Doyle <jay.doyle at vecna.com>
To: 	Sebastian Huber <sebastian.huber at embedded-brains.de>

On 10/21/2015 09:35 AM, Sebastian Huber wrote:
>
>
> On 21/10/15 15:08, Isaac Gutekunst wrote:
>>
>>
>> On 10/21/2015 09:00 AM, Sebastian Huber wrote:
>>>
>>>
>>> On 21/10/15 14:56, Isaac Gutekunst wrote:
>>>> On 10/21/2015 08:24 AM, Sebastian Huber wrote:
>>>>>
>>>>>
>>>>> On 21/10/15 14:13, Isaac Gutekunst wrote:
>>>>>> Thanks for the reply.
>>>>>>
>>>>>> On 10/21/2015 01:50 AM, Sebastian Huber wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 20/10/15 16:02, Isaac Gutekunst wrote:
>>>>> [...]
>>>>>>>
>>>>>>>>
>>>>>>>> As far as I can tell this would only occur if the caller of
>>>>>>>> pthread_mutex_lock was in a
>>>>>>>> "bad"
>>>>>>>> state. I don't believe it is in an interrupt context, and don't
>>>>>>>> know what other bad states
>>>>>>>> could exist.
>>>>>>>
>>>>>>> We have
>>>>>>>
>>>>>>> #define _CORE_mutex_Check_dispatch_for_seize(_wait) \
>>>>>>>    (!_Thread_Dispatch_is_enabled() \
>>>>>>>      && (_wait) \
>>>>>>>      && (_System_state_Get() >= SYSTEM_STATE_UP))
>>>>>>>
>>>>>>> What is the thread dispatch disable level and the system state
>>>>>>> at this point?
>>>>>>>
>>>>>>> In case the thread dispatch disable level is not zero, then
>>>>>>> something is probably broken
>>>>>>> in the
>>>>>>> operating system code which is difficult to find. Could be a
>>>>>>> general memory corruption
>>>>>>> problem
>>>>>>> too. Which RTEMS version do you use?
>>>>>>>
>>>>>>
>>>>>> The thread dispatch disable level is usually -1 or -2.
>>>>>> (0xFFFFFFFE or 0xFFFFFFD).
>>>>>
>>>>> A negative value is very bad, but easy to detect via manual
>>>>> instrumentation (only an hand full
>>>>> of spots touch this variable) or hardware breakpoints/watchpoints.
>>>>> Looks the rest of
>>>>> _Per_CPU_Information all right?
>>>>>
>>>> It looks like it's only the thread_dispatch_disable_level that's
>>>> broken.
>>>>
>>>> We'll go and grep for all places for all the places it's touched,
>>>> and look for something.
>>>>
>>>> The problem with watchpoints is they fire exceptionally often, and
>>>> putting in a conditional
>>>> watchpoint slows the code to a crawl, but that may be worth it.
>>>>
>>>> Here are some printouts of the relevant structs right after a crash:
>>>>
>>>> $4 = {
>>>>   cpu_per_cpu = {<No data fields>},
>>>>   isr_nest_level = 0,
>>>>   thread_dispatch_disable_level = 4294967295,
>>>>   executing = 0xc01585c8,
>>>>   heir = 0xc0154038,
>>>>   dispatch_necessary = true,
>>>>   time_of_last_context_switch = {
>>>>     sec = 2992,
>>>>     frac = 10737447432380511034
>>>>   },
>>>>   Stats = {<No data fields>}
>>>> }
>>>
>>> No, this doesn't look good. According to the stack trace you are in
>>> thread context. However, we
>>> have executing != heir and dispatch_necessary == true. This is a
>>> broken state itself. I guess,
>>> something is wrong with the interrupt level so that a context switch
>>> is blocked. On ARMv7-M
>>> this is done via the system call exception.
>>>
>> This is a bit beyond my RTEMS knowledge. What would you advise
>> looking into?
>
> I would try to instrument the code to figure out where the thread
> dispatch disable level goes negative.
>

We just did.  I added a check in _ARMV7M_Interrupt_service_leave to see
if the _Thread_Dispatch_disable_level is positive before the
decrementing it and this eventually fails.

I'm not sure if this tells us much because I think the call itself
correct.  In this particular case it is processing an I2C interrupt.   I
will try to see if we can capture information about the sequence of
changes to the _Thread_Dispatch_disable_level just before the point in
which we know something is clearly wrong (i.e., decreasing it below zero.)