Possible bug in _CORE_mutex_Seize()

Tue Sep 30 00:50:20 UTC 2003

Joel Sherrill wrote:
> Till Straumann wrote:
> 
>> Joel Sherrill wrote:
>>
>>> Following up on my own post since I didn't notice the bottom where 
>>> you actually explained
>>> where malloc() was called when I read this early this morning.
>>>
>>> Phil Torre wrote:
>>>
>>>> As unlikely as it sounds, I think we have found a bug in 
>>>> _CORE_mutex_Seize()
>>>> which violates mutual exclusion.
>>>>  
>>>>
>>> It simply assumes that _Thread_Dispatch_disable_level is 0.  So 
>>> calling malloc() from within
>>> a directive which uses dispatch disable locking is dangerous.
>>> I think I have a solution.  Move the body of  the libc start hook to 
>>> the create hook. 
>>
>>
>>
>>
>> This is not gonna work :-( -- rtems_task_create() also uses dispatch
>> disabling... 
> 
> 
> It will work if the allocation mutex allows nesting so the 
> initialization call in cpukit/score/.../apimutex.h
> needs to change from nesting is error.  See my earlier comment below 
> your -- Till. :)
> 

Hmm - unless I'm missing something, I can't see what this has to do
with nesting. Could you explain how this should work, please?

(Scenario: Thread 2 holds the allocator mutex and is suspended; thread 1
            tries to acquire the allocator mutex (nesting allowed) from
            a dispatching disabled section -> blocking of 1 is effectively
            postponed until 1 leaves the dispatching disabled section).

-- Till

> The error checking I proposed won't work without tinkering though.  
> Let's ignore that for the
> first pass.
> 
>> How about allocating the reent structure from the workspace ?
> 
> 
> That's not necessarily a bad idea either.  It is designed for memory 
> allocation inside a directive.
> But if the allocation mutex allows nesting, it isn't necessary.
> 
>>
>>
>> -- Till
>>
>>>  It is probably
>>> also necessary to change cpukit/score/include/rtems/score/apimutex.h 
>>> so that _API_Mutex_Allocate()
>>> creates the Allocator mutex as nestable rather than 
>>> CORE_MUTEX_NESTING_IS_ERROR.
>>>
>>> Finally, it might not be a bad idea for it to be considered a fatal 
>>> RTEMS error if _API_Mutex_Locks
>>> wants to block when _Thread_Dispatch_disable is non-zero.  That would 
>>> be easier than this
>>> happening again and debugging it.
>>>
>>> It might also be valid to consider it a fatal error in a memory 
>>> allocation is attempted when
>>> _Thread_Dispatch_disable is zero.
>>>
>>>> This pertains to rtems-4.6.0pre4 running on MPC860 with an 
>>>> unsubmitted BSP.
>>>> The sequence of events goes like this:
>>>>
>>>>
>>>> 1.    Thread 1 (Init) is running at priority 1.  It creates and 
>>>> starts     thread 2 (notification_task) at priority 196.  Since 
>>>> thread 2 is
>>>>     at a lower priority, it doesn't start executing yet.
>>>>
>>>> 2.    Thread 1 sleeps with rtems_task_wake_after(10 ms) to wait for 
>>>> some
>>>>     external hardware to do something.  As soon as it goes to sleep,
>>>>     thread 2 is now runnable and starts executing.
>>>>
>>>> 3.    Thread 2 does some stuff, and then calls malloc().  Halfway 
>>>> through
>>>>     rtems_region_get_segment(), the 10ms timer set by thread 1 expires.
>>>>     We do a context switch and thread 1 is now running.
>>>>
>>>>     ** Before it lost the CPU, thread 2 had successfully called
>>>> **
>>>>     ** _RTEMS_Lock_allocator().  _RTEMS_Allocator_Mutex is held 
>>>> by    **
>>>>     ** thread 2 when the context switch back to thread 1 occurs.    **
>>>>
>>>> 4.    Thread 1 now calls rtems_start_task(), which invokes malloc(), 
>>>> which
>>>> calls
>>>>     rtems_region_get_segment(), which calls _RTEMS_Lock_allocator().
>>>>
>>>>     _RTEMS_Lock_allocator() returns, *without blocking*.  The allocator
>>>>     mutex is still held by thread 2, yet thread 1 proceeds in the 
>>>> belief
>>>>     that it has the mutex.
>>>>
>>>>     More detail:
>>>>     When thread 1 calls rtems_task_start() in step #4, that function
>>>>     calls _Thread_Get() on the task we want to start.  As a side 
>>>> effect,
>>>>     _Thread_Get() increments _Thread_Dispatch_disable_level to 1.
>>>>
>>>>     Shortly thereafter, _User_extensions_Thread_start() is called, 
>>>> which
>>>>     calls libc_start_hook(), which calls calloc()->malloc()->
>>>>     
>>>> rtems_region_get_segment()->_RTEMS_Lock_allocator()->_CORE_mutex_Seize(). 
>>>>
>>>>     (Note that _Thread_Dispatch_disable_level is stil 1.)
>>>>     _CORE_mutex_Seize_interrupt_trylock() returns 1 (as it should), so
>>>> we
>>>>     call _Thread_Disable_dispatch() (disable level is now 2!) followed
>>>> by
>>>>     _CORE_mutex_Seize_interrupt_blocking() to block on the mutex.
>>>>         Because _Thread_Dispatch_disable_level is 2, the call to
>>>> _Thread_Enable_dispatch()
>>>>     just decrements it to 1 and returns without calling
>>>> _Thread_Dispatch().
>>>>     Thread 1 now happily proceeds to corrupt the heap free block chain.
>>>>    I don't understand the semantics of 
>>>> _Thread_Dispatch_disable_level well
>>>> enough to
>>>> provide a patch.  For now we will work around it by making sure our 
>>>> tasks
>>>> don't call
>>>> malloc() at the same time.  Hopefully those with deep kernel 
>>>> understanding
>>>> can
>>>> take a look at this and tell me if I'm smoking crack.  :)
>>>>
>>>> -Phil
>>>>
>>>>  
>>>>
>>>
>>>
>>>
>>
>>
> 
> 
>