Random lwIP Crashes in _POSIX_Mutex_Lock_support()

Joel Sherrill joel.sherrill at oarcorp.com
Wed Oct 21 13:58:56 UTC 2015



On 10/21/2015 8:35 AM, Sebastian Huber wrote:
>
>
> On 21/10/15 15:08, Isaac Gutekunst wrote:
>>
>>
>> On 10/21/2015 09:00 AM, Sebastian Huber wrote:
>>>
>>>
>>> On 21/10/15 14:56, Isaac Gutekunst wrote:
>>>> On 10/21/2015 08:24 AM, Sebastian Huber wrote:
>>>>>
>>>>>
>>>>> On 21/10/15 14:13, Isaac Gutekunst wrote:
>>>>>> Thanks for the reply.
>>>>>>
>>>>>> On 10/21/2015 01:50 AM, Sebastian Huber wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 20/10/15 16:02, Isaac Gutekunst wrote:
>>>>> [...]
>>>>>>>
>>>>>>>>
>>>>>>>> As far as I can tell this would only occur if the caller of
>>>>>>>> pthread_mutex_lock was in a
>>>>>>>> "bad"
>>>>>>>> state. I don't believe it is in an interrupt context, and don't
>>>>>>>> know what other bad states
>>>>>>>> could exist.
>>>>>>>
>>>>>>> We have
>>>>>>>
>>>>>>> #define _CORE_mutex_Check_dispatch_for_seize(_wait) \
>>>>>>>     (!_Thread_Dispatch_is_enabled() \
>>>>>>>       && (_wait) \
>>>>>>>       && (_System_state_Get() >= SYSTEM_STATE_UP))
>>>>>>>
>>>>>>> What is the thread dispatch disable level and the system state at
>>>>>>> this point?
>>>>>>>
>>>>>>> In case the thread dispatch disable level is not zero, then
>>>>>>> something is probably broken
>>>>>>> in the
>>>>>>> operating system code which is difficult to find. Could be a
>>>>>>> general memory corruption
>>>>>>> problem
>>>>>>> too. Which RTEMS version do you use?
>>>>>>>
>>>>>>
>>>>>> The thread dispatch disable level is usually -1 or -2.
>>>>>> (0xFFFFFFFE or 0xFFFFFFD).
>>>>>
>>>>> A negative value is very bad, but easy to detect via manual
>>>>> instrumentation (only an hand full
>>>>> of spots touch this variable) or hardware breakpoints/watchpoints.
>>>>> Looks the rest of
>>>>> _Per_CPU_Information all right?
>>>>>
>>>> It looks like it's only the thread_dispatch_disable_level that's
>>>> broken.
>>>>
>>>> We'll go and grep for all places for all the places it's touched,
>>>> and look for something.
>>>>
>>>> The problem with watchpoints is they fire exceptionally often, and
>>>> putting in a conditional
>>>> watchpoint slows the code to a crawl, but that may be worth it.
>>>>
>>>> Here are some printouts of the relevant structs right after a crash:
>>>>
>>>> $4 = {
>>>>    cpu_per_cpu = {<No data fields>},
>>>>    isr_nest_level = 0,
>>>>    thread_dispatch_disable_level = 4294967295,
>>>>    executing = 0xc01585c8,
>>>>    heir = 0xc0154038,
>>>>    dispatch_necessary = true,
>>>>    time_of_last_context_switch = {
>>>>      sec = 2992,
>>>>      frac = 10737447432380511034
>>>>    },
>>>>    Stats = {<No data fields>}
>>>> }
>>>
>>> No, this doesn't look good. According to the stack trace you are in
>>> thread context. However, we
>>> have executing != heir and dispatch_necessary == true. This is a
>>> broken state itself. I guess,
>>> something is wrong with the interrupt level so that a context switch
>>> is blocked. On ARMv7-M
>>> this is done via the system call exception.
>>>
>> This is a bit beyond my RTEMS knowledge. What would you advise looking
>> into?
>
> I would try to instrument the code to figure out where the thread
> dispatch disable level goes negative.
>

The test suite macros check that thread_dispatch_disable_level
is always 0 when a call returns (in a uniprocessor configuration).
If all the tests are passing on this BSP, then my assumption would
be a mismatch someone in architecture specific code related to
the interrupt path. This is not guaranteed but likely.

The settings above look like something decided to trigger a
context switch and left a dispatch disable critical but was
never in a dispatch disable critical section. So when it
left/decremented it, there was no corresponding enter/increment.

--joel




More information about the devel mailing list