Random lwIP Crashes in _POSIX_Mutex_Lock_support()

Wed Oct 21 12:13:07 UTC 2015

Thanks for the reply.

On 10/21/2015 01:50 AM, Sebastian Huber wrote:
>
>
> On 20/10/15 16:02, Isaac Gutekunst wrote:
>> Hi Devel,
>>
>> I'm pretty sure this is a devel question, not users.
>>
>>
>> I'm working with a colleague at Vecna to port lwIP to the STM32F7 BSP we've developed.
>>
>> We have a basic HTTP server that prints out the current list of tasks. We refresh the page at
>> a very high rate, and after about 1-30 minutes, get a crash.
>>

>> Every time the exception is thrown after _CORE_mutex_Check_dispatch_for_seize( wait )  on
>> line 254 of coremuteximpl.h. Every time this is inside a pthread_mutex_lock() call.
>>
>>
>> Here is the full backtrace:
>>
>> stm32fxxxx_fatal_error_handler() at hal-fatal-error-handler.c:126 0x800af92
>> _User_extensions_Fatal_visitor() at userextiterate.c:123 0x803212c
>> _User_extensions_Iterate() at userextiterate.c:166 0x80321c0
>> _User_extensions_Fatal() at userextimpl.h:254 0x802a85e
>> _Terminate() at interr.c:44 0x802a888
>> _CORE_mutex_Seize_body() at coremuteximpl.h:255 0x8068df0
>> _POSIX_Mutex_Lock_support() at mutexlocksupp.c:57 0x806907e
>> pthread_mutex_lock() at mutexlock.c:40 0x8068bee
>> sys_arch_sem_wait() at sys_arch.c:485 0x808da8a
>> sys_arch_mbox_fetch() at sys_arch.c:357 0x808d804
>> sys_timeouts_mbox_fetch() at timers.c:532 0x80883ce
>> tcpip_thread() at tcpip.c:95 0x808c170
>> _Thread_Handler() at threadhandler.c:102 0x806bbe8
>> _User_extensions_Thread_exitted() at userextimpl.h:244 0x806bb60
>> bsp_section_work_begin() at 0xc016a12c
>>
>>
>> However, the lwip code calling pthread_mutex_lock varies, but is consistently from lwIP.
>>
>>
>> Does this ring any bells?
>
> Normally you get this if you obtain a locked mutex in interrupt context, but your stack trace
> says you are not.

That was my first suspicion as well.
>
>>
>> As far as I can tell this would only occur if the caller of pthread_mutex_lock was in a "bad"
>> state. I don't believe it is in an interrupt context, and don't know what other bad states
>> could exist.
>
> We have
>
> #define _CORE_mutex_Check_dispatch_for_seize(_wait) \
>    (!_Thread_Dispatch_is_enabled() \
>      && (_wait) \
>      && (_System_state_Get() >= SYSTEM_STATE_UP))
>
> What is the thread dispatch disable level and the system state at this point?
>
> In case the thread dispatch disable level is not zero, then something is probably broken in the
> operating system code which is difficult to find. Could be a general memory corruption problem
> too. Which RTEMS version do you use?
>

The thread dispatch disable level is usually -1 or -2.
(0xFFFFFFFE or 0xFFFFFFD).

We first suspected that _Thread_Dispatch_decrement_disable_level (in threaddispatch.h) was 
being called two many times (somehow). However, it always crashes without the check being fired.

For the record, I inserted this snippet of code:

     if (disable_level < 0) {
         _Terminate(
             INTERNAL_ERROR_CORE,
             true,
             INTERNAL_ERROR_MUTEX_OBTAIN_FROM_BAD_STATE
         );
       // In case the _Terminate call doesn't work
       __asm__ volatile ("BKPT #01");
     }

This pointed us towards a general memory corruption issue, so we are a bit stuck. Another 
avenue we are exploring is sticking tests for a negative disable_level all over the code hoping 
to get closer to the corruption.

We are running a fork based on 314ff3c43ff1c00232e201df68e39cc0e5600d95. Our changes since then 
include the addition of our STM32F BSPs, but no changes to the kernel except a new CAN driver.

Real trace functionality would be really nice, but we lack the hardware (both trace probes, and 
exposed trace lines).

This is probably a stretch, but does anyone have experience getting the ETM or ITM sending data 
to the ETB and getting the data over JTAG? (with RTEMS and GCC)

Isaac