Random lwIP Crashes in _POSIX_Mutex_Lock_support()

Wed Oct 21 14:32:22 UTC 2015

On 10/21/2015 9:09 AM, Isaac Gutekunst wrote:
>
>
> On 10/21/2015 09:58 AM, Joel Sherrill wrote:
>>
>>
>> On 10/21/2015 8:35 AM, Sebastian Huber wrote:
>>>
>>>
>>> On 21/10/15 15:08, Isaac Gutekunst wrote:
>>>>
>>>>
>>>> On 10/21/2015 09:00 AM, Sebastian Huber wrote:
>>>>>
>>>>>
>>>>> On 21/10/15 14:56, Isaac Gutekunst wrote:
>>>>>> On 10/21/2015 08:24 AM, Sebastian Huber wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 21/10/15 14:13, Isaac Gutekunst wrote:
>>>>>>>> Thanks for the reply.
>>>>>>>>
>>>>>>>> On 10/21/2015 01:50 AM, Sebastian Huber wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 20/10/15 16:02, Isaac Gutekunst wrote:
>>>>>>> [...]
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> As far as I can tell this would only occur if the caller of
>>>>>>>>>> pthread_mutex_lock was in a
>>>>>>>>>> "bad"
>>>>>>>>>> state. I don't believe it is in an interrupt context, and don't
>>>>>>>>>> know what other bad states
>>>>>>>>>> could exist.
>>>>>>>>>
>>>>>>>>> We have
>>>>>>>>>
>>>>>>>>> #define _CORE_mutex_Check_dispatch_for_seize(_wait) \
>>>>>>>>>      (!_Thread_Dispatch_is_enabled() \
>>>>>>>>>        && (_wait) \
>>>>>>>>>        && (_System_state_Get() >= SYSTEM_STATE_UP))
>>>>>>>>>
>>>>>>>>> What is the thread dispatch disable level and the system state at
>>>>>>>>> this point?
>>>>>>>>>
>>>>>>>>> In case the thread dispatch disable level is not zero, then
>>>>>>>>> something is probably broken
>>>>>>>>> in the
>>>>>>>>> operating system code which is difficult to find. Could be a
>>>>>>>>> general memory corruption
>>>>>>>>> problem
>>>>>>>>> too. Which RTEMS version do you use?
>>>>>>>>>
>>>>>>>>
>>>>>>>> The thread dispatch disable level is usually -1 or -2.
>>>>>>>> (0xFFFFFFFE or 0xFFFFFFD).
>>>>>>>
>>>>>>> A negative value is very bad, but easy to detect via manual
>>>>>>> instrumentation (only an hand full
>>>>>>> of spots touch this variable) or hardware breakpoints/watchpoints.
>>>>>>> Looks the rest of
>>>>>>> _Per_CPU_Information all right?
>>>>>>>
>>>>>> It looks like it's only the thread_dispatch_disable_level that's
>>>>>> broken.
>>>>>>
>>>>>> We'll go and grep for all places for all the places it's touched,
>>>>>> and look for something.
>>>>>>
>>>>>> The problem with watchpoints is they fire exceptionally often, and
>>>>>> putting in a conditional
>>>>>> watchpoint slows the code to a crawl, but that may be worth it.
>>>>>>
>>>>>> Here are some printouts of the relevant structs right after a crash:
>>>>>>
>>>>>> $4 = {
>>>>>>     cpu_per_cpu = {<No data fields>},
>>>>>>     isr_nest_level = 0,
>>>>>>     thread_dispatch_disable_level = 4294967295,
>>>>>>     executing = 0xc01585c8,
>>>>>>     heir = 0xc0154038,
>>>>>>     dispatch_necessary = true,
>>>>>>     time_of_last_context_switch = {
>>>>>>       sec = 2992,
>>>>>>       frac = 10737447432380511034
>>>>>>     },
>>>>>>     Stats = {<No data fields>}
>>>>>> }
>>>>>
>>>>> No, this doesn't look good. According to the stack trace you are in
>>>>> thread context. However, we
>>>>> have executing != heir and dispatch_necessary == true. This is a
>>>>> broken state itself. I guess,
>>>>> something is wrong with the interrupt level so that a context switch
>>>>> is blocked. On ARMv7-M
>>>>> this is done via the system call exception.
>>>>>
>>>> This is a bit beyond my RTEMS knowledge. What would you advise looking
>>>> into?
>>>
>>> I would try to instrument the code to figure out where the thread
>>> dispatch disable level goes negative.
>>>
>>
> We have done some testing and found an interrupt that decrements the value below zero. However,
> this may not be the problem, as a previous call may have incorrectly decremented it to zero.
> We'll keep looking.
>
>> The test suite macros check that thread_dispatch_disable_level
>> is always 0 when a call returns (in a uniprocessor configuration).
>> If all the tests are passing on this BSP, then my assumption would
>> be a mismatch someone in architecture specific code related to
>> the interrupt path. This is not guaranteed but likely.
>>
> Unfortunately we have not been able to run the tests for this BSP. This seems like good
> motivation to make them run.
>
> Does it seem likely that there would be problems between different ARMV7M BSPs? Where one
> Cortex-M4 BSP works while a different one doesn't on this level? My understanding is they share
> the same NVIC and should behave the same.

In general terms, the more specific you get with a CPU model, the
fewer and fewer users there are on a specific line of code. It is
possible that this code just hasn't been beaten up that much.

>> The settings above look like something decided to trigger a
>> context switch and left a dispatch disable critical but was
>> never in a dispatch disable critical section. So when it
>> left/decremented it, there was no corresponding enter/increment.
>>
> Is this something a new BPS that doesn't change anything besides the bare minimum would likely
> break? I'm wondering if we should compare to a different Cortex M BSP.

Everything can be wrong at some point. :(

I suspect a path through the interrupt code that somehow either didn't
increment it (unlikely) or double-decremented it (likely).

A desk check that this value is only incremented once and decremented
once per path might turn it up.

> Maybe there is a difference between the M4 and M7 that is sufficient to introduce problems.

That is possible. Without looking at the code, be suspicious of
code that is unique to your architectural revision or in conditionals.

--joel