Possible bug in _CORE_mutex_Seize()

Tue Sep 30 01:06:29 UTC 2003

Till Straumann wrote:

> Joel Sherrill wrote:
>
>> Till Straumann wrote:
>>
>>> Joel Sherrill wrote:
>>>
>>>> Following up on my own post since I didn't notice the bottom where 
>>>> you actually explained
>>>> where malloc() was called when I read this early this morning.
>>>>
>>>> Phil Torre wrote:
>>>>
>>>>> As unlikely as it sounds, I think we have found a bug in 
>>>>> _CORE_mutex_Seize()
>>>>> which violates mutual exclusion.
>>>>>  
>>>>>
>>>> It simply assumes that _Thread_Dispatch_disable_level is 0.  So 
>>>> calling malloc() from within
>>>> a directive which uses dispatch disable locking is dangerous.
>>>> I think I have a solution.  Move the body of  the libc start hook 
>>>> to the create hook. 
>>>
>>>
>>>
>>>
>>>
>>> This is not gonna work :-( -- rtems_task_create() also uses dispatch
>>> disabling... 
>>
>>
>>
>> It will work if the allocation mutex allows nesting so the 
>> initialization call in cpukit/score/.../apimutex.h
>> needs to change from nesting is error.  See my earlier comment below 
>> your -- Till. :)
>>
>
> Hmm - unless I'm missing something, I can't see what this has to do
> with nesting. Could you explain how this should work, please?
>
> (Scenario: Thread 2 holds the allocator mutex and is suspended; thread 1
>            tries to acquire the allocator mutex (nesting allowed) from
>            a dispatching disabled section -> blocking of 1 is effectively
>            postponed until 1 leaves the dispatching disabled section).

It doesn't unless you move the malloc to the create hook.   Also your 
patch doesn't
check for a NULL pointer from Workspace_Allocate which is a distinct 
possibility.

I generally like using the workspace instead of the main heap.  But all 
code allocating
memory should be in create hooks -- not start hooks.  Then the nesting 
will work since
it is inside another allocator critical section.

>
> -- Till
>
>
>> The error checking I proposed won't work without tinkering though.  
>> Let's ignore that for the
>> first pass.
>>
>>> How about allocating the reent structure from the workspace ?
>>
>>
>>
>> That's not necessarily a bad idea either.  It is designed for memory 
>> allocation inside a directive.
>> But if the allocation mutex allows nesting, it isn't necessary.
>>
>>>
>>>
>>> -- Till
>>>
>>>>  It is probably
>>>> also necessary to change 
>>>> cpukit/score/include/rtems/score/apimutex.h so that 
>>>> _API_Mutex_Allocate()
>>>> creates the Allocator mutex as nestable rather than 
>>>> CORE_MUTEX_NESTING_IS_ERROR.
>>>>
>>>> Finally, it might not be a bad idea for it to be considered a fatal 
>>>> RTEMS error if _API_Mutex_Locks
>>>> wants to block when _Thread_Dispatch_disable is non-zero.  That 
>>>> would be easier than this
>>>> happening again and debugging it.
>>>>
>>>> It might also be valid to consider it a fatal error in a memory 
>>>> allocation is attempted when
>>>> _Thread_Dispatch_disable is zero.
>>>>
>>>>> This pertains to rtems-4.6.0pre4 running on MPC860 with an 
>>>>> unsubmitted BSP.
>>>>> The sequence of events goes like this:
>>>>>
>>>>>
>>>>> 1.    Thread 1 (Init) is running at priority 1.  It creates and 
>>>>> starts     thread 2 (notification_task) at priority 196.  Since 
>>>>> thread 2 is
>>>>>     at a lower priority, it doesn't start executing yet.
>>>>>
>>>>> 2.    Thread 1 sleeps with rtems_task_wake_after(10 ms) to wait 
>>>>> for some
>>>>>     external hardware to do something.  As soon as it goes to sleep,
>>>>>     thread 2 is now runnable and starts executing.
>>>>>
>>>>> 3.    Thread 2 does some stuff, and then calls malloc().  Halfway 
>>>>> through
>>>>>     rtems_region_get_segment(), the 10ms timer set by thread 1 
>>>>> expires.
>>>>>     We do a context switch and thread 1 is now running.
>>>>>
>>>>>     ** Before it lost the CPU, thread 2 had successfully called
>>>>> **
>>>>>     ** _RTEMS_Lock_allocator().  _RTEMS_Allocator_Mutex is held 
>>>>> by    **
>>>>>     ** thread 2 when the context switch back to thread 1 
>>>>> occurs.    **
>>>>>
>>>>> 4.    Thread 1 now calls rtems_start_task(), which invokes 
>>>>> malloc(), which
>>>>> calls
>>>>>     rtems_region_get_segment(), which calls _RTEMS_Lock_allocator().
>>>>>
>>>>>     _RTEMS_Lock_allocator() returns, *without blocking*.  The 
>>>>> allocator
>>>>>     mutex is still held by thread 2, yet thread 1 proceeds in the 
>>>>> belief
>>>>>     that it has the mutex.
>>>>>
>>>>>     More detail:
>>>>>     When thread 1 calls rtems_task_start() in step #4, that function
>>>>>     calls _Thread_Get() on the task we want to start.  As a side 
>>>>> effect,
>>>>>     _Thread_Get() increments _Thread_Dispatch_disable_level to 1.
>>>>>
>>>>>     Shortly thereafter, _User_extensions_Thread_start() is called, 
>>>>> which
>>>>>     calls libc_start_hook(), which calls calloc()->malloc()->
>>>>>     
>>>>> rtems_region_get_segment()->_RTEMS_Lock_allocator()->_CORE_mutex_Seize(). 
>>>>>
>>>>>     (Note that _Thread_Dispatch_disable_level is stil 1.)
>>>>>     _CORE_mutex_Seize_interrupt_trylock() returns 1 (as it 
>>>>> should), so
>>>>> we
>>>>>     call _Thread_Disable_dispatch() (disable level is now 2!) 
>>>>> followed
>>>>> by
>>>>>     _CORE_mutex_Seize_interrupt_blocking() to block on the mutex.
>>>>>         Because _Thread_Dispatch_disable_level is 2, the call to
>>>>> _Thread_Enable_dispatch()
>>>>>     just decrements it to 1 and returns without calling
>>>>> _Thread_Dispatch().
>>>>>     Thread 1 now happily proceeds to corrupt the heap free block 
>>>>> chain.
>>>>>    I don't understand the semantics of 
>>>>> _Thread_Dispatch_disable_level well
>>>>> enough to
>>>>> provide a patch.  For now we will work around it by making sure 
>>>>> our tasks
>>>>> don't call
>>>>> malloc() at the same time.  Hopefully those with deep kernel 
>>>>> understanding
>>>>> can
>>>>> take a look at this and tell me if I'm smoking crack.  :)
>>>>>
>>>>> -Phil
>>>>>
>>>>>  
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>
>