Random lwIP Crashes in _POSIX_Mutex_Lock_support()

Thu Oct 22 13:16:40 UTC 2015

On 10/22/2015 01:40 AM, Sebastian Huber wrote:
>
>
> On 21/10/15 15:48, Jay Doyle wrote:
>>
>>
>> On 10/21/2015 09:35 AM, Sebastian Huber wrote:
>>>
>>>
>>> On 21/10/15 15:08, Isaac Gutekunst wrote:
>>>>
>>>>
>>>> On 10/21/2015 09:00 AM, Sebastian Huber wrote:
>>>>>
>>>>>
>>>>> On 21/10/15 14:56, Isaac Gutekunst wrote:
>>>>>> On 10/21/2015 08:24 AM, Sebastian Huber wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 21/10/15 14:13, Isaac Gutekunst wrote:
>>>>>>>> Thanks for the reply.
>>>>>>>>
>>>>>>>> On 10/21/2015 01:50 AM, Sebastian Huber wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 20/10/15 16:02, Isaac Gutekunst wrote:
>>>>>>> [...]
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> As far as I can tell this would only occur if the caller of
>>>>>>>>>> pthread_mutex_lock was in a
>>>>>>>>>> "bad"
>>>>>>>>>> state. I don't believe it is in an interrupt context, and
>>>>>>>>>> don't know what other bad states
>>>>>>>>>> could exist.
>>>>>>>>>
>>>>>>>>> We have
>>>>>>>>>
>>>>>>>>> #define _CORE_mutex_Check_dispatch_for_seize(_wait) \
>>>>>>>>>    (!_Thread_Dispatch_is_enabled() \
>>>>>>>>>      && (_wait) \
>>>>>>>>>      && (_System_state_Get() >= SYSTEM_STATE_UP))
>>>>>>>>>
>>>>>>>>> What is the thread dispatch disable level and the system state
>>>>>>>>> at this point?
>>>>>>>>>
>>>>>>>>> In case the thread dispatch disable level is not zero, then
>>>>>>>>> something is probably broken
>>>>>>>>> in the
>>>>>>>>> operating system code which is difficult to find. Could be a
>>>>>>>>> general memory corruption
>>>>>>>>> problem
>>>>>>>>> too. Which RTEMS version do you use?
>>>>>>>>>
>>>>>>>>
>>>>>>>> The thread dispatch disable level is usually -1 or -2.
>>>>>>>> (0xFFFFFFFE or 0xFFFFFFD).
>>>>>>>
>>>>>>> A negative value is very bad, but easy to detect via manual
>>>>>>> instrumentation (only an hand full
>>>>>>> of spots touch this variable) or hardware
>>>>>>> breakpoints/watchpoints. Looks the rest of
>>>>>>> _Per_CPU_Information all right?
>>>>>>>
>>>>>> It looks like it's only the thread_dispatch_disable_level that's
>>>>>> broken.
>>>>>>
>>>>>> We'll go and grep for all places for all the places it's touched,
>>>>>> and look for something.
>>>>>>
>>>>>> The problem with watchpoints is they fire exceptionally often, and
>>>>>> putting in a conditional
>>>>>> watchpoint slows the code to a crawl, but that may be worth it.
>>>>>>
>>>>>> Here are some printouts of the relevant structs right after a crash:
>>>>>>
>>>>>> $4 = {
>>>>>>   cpu_per_cpu = {<No data fields>},
>>>>>>   isr_nest_level = 0,
>>>>>>   thread_dispatch_disable_level = 4294967295,
>>>>>>   executing = 0xc01585c8,
>>>>>>   heir = 0xc0154038,
>>>>>>   dispatch_necessary = true,
>>>>>>   time_of_last_context_switch = {
>>>>>>     sec = 2992,
>>>>>>     frac = 10737447432380511034
>>>>>>   },
>>>>>>   Stats = {<No data fields>}
>>>>>> }
>>>>>
>>>>> No, this doesn't look good. According to the stack trace you are in
>>>>> thread context. However, we
>>>>> have executing != heir and dispatch_necessary == true. This is a
>>>>> broken state itself. I guess,
>>>>> something is wrong with the interrupt level so that a context
>>>>> switch is blocked. On ARMv7-M
>>>>> this is done via the system call exception.
>>>>>
>>>> This is a bit beyond my RTEMS knowledge. What would you advise
>>>> looking into?
>>>
>>> I would try to instrument the code to figure out where the thread
>>> dispatch disable level goes negative.
>>>
>>
>> We just did.  I added a check in _ARMV7M_Interrupt_service_leave to
>> see if the _Thread_Dispatch_disable_level is positive before the
>> decrementing it and this eventually fails.
>>
>> I'm not sure if this tells us much because I think the call itself
>> correct.  In this particular case it is processing an I2C interrupt.
>> I will try to see if we can capture information about the sequence of
>> changes to the _Thread_Dispatch_disable_level just before the point in
>> which we know something is clearly wrong (i.e., decreasing it below
>> zero.)
>
> Since the isr_nest_level is 0, I don't think its a problem with the 
> spots that use _ARMV7M_Interrupt_service_leave(). Did you check the 
> interrupt priorities? See also
>
> https://lists.rtems.org/pipermail/users/2015-June/029155.html
>
Thanks for the pointer to this posting.  It seems like a very similar 
situation to what we are experiencing -- especially considering that we 
invoke an RTEMS call in our ethernet isr. Unfortunately, all our 
interrupts use the default interrupt priority level set in the bsp 
header file as:

#define BSP_ARMV7M_IRQ_PRIORITY_DEFAULT (13 << 4)

which should be mean that they are all non-NMIs unless we explicitly set 
their interrupt level lower.