Possible bug in _CORE_mutex_Seize()

Tue Sep 30 13:14:26 UTC 2003

Till Straumann wrote:
> Joel Sherrill wrote:
> 
>> Till Straumann wrote:
>>
>>> Joel Sherrill wrote:
>>>
>>>> Till Straumann wrote:
>>>>
>>>>> Joel Sherrill wrote:
>>>>>
>>>>>> Following up on my own post since I didn't notice the bottom where 
>>>>>> you actually explained
>>>>>> where malloc() was called when I read this early this morning.
>>>>>>
>>>>>> Phil Torre wrote:
>>>>>>
>>>>>>> As unlikely as it sounds, I think we have found a bug in 
>>>>>>> _CORE_mutex_Seize()
>>>>>>> which violates mutual exclusion.
>>>>>>>  
>>>>>>>
>>>>>> It simply assumes that _Thread_Dispatch_disable_level is 0.  So 
>>>>>> calling malloc() from within
>>>>>> a directive which uses dispatch disable locking is dangerous.
>>>>>> I think I have a solution.  Move the body of  the libc start hook 
>>>>>> to the create hook. 
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> This is not gonna work :-( -- rtems_task_create() also uses dispatch
>>>>> disabling... 
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> It will work if the allocation mutex allows nesting so the 
>>>> initialization call in cpukit/score/.../apimutex.h
>>>> needs to change from nesting is error.  See my earlier comment below 
>>>> your -- Till. :)
>>>>
>>>
>>> Hmm - unless I'm missing something, I can't see what this has to do
>>> with nesting. Could you explain how this should work, please?
>>>
>>> (Scenario: Thread 2 holds the allocator mutex and is suspended; thread 1
>>>            tries to acquire the allocator mutex (nesting allowed) from
>>>            a dispatching disabled section -> blocking of 1 is 
>>> effectively
>>>            postponed until 1 leaves the dispatching disabled section).
>>
>>
>>
>>
>> It doesn't unless you move the malloc to the create hook.   Also your 
>> patch doesn't
>> check for a NULL pointer from Workspace_Allocate which is a distinct 
>> possibility. 
> 
> 
> 
> You're right - 'memset()' should be called after checking the ptr for 
> NULL, of course.

And false returned if it is NULL.

>> I generally like using the workspace instead of the main heap.  But 
>> all code allocating
>> memory should be in create hooks -- not start hooks.  Then the nesting 
>> will work since
>> it is inside another allocator critical section. 
> 
> 
> ??? Sorry for being a pain - I still don't understand. Would you mind 
> explaining
> this in a little more detail, please? When I look at 
> rtems_task_create(), I don't see where the
> allocator mutex is acquired outside of the _Thread_Dispatch_disable() 
> protected section
> (as a matter of fact, I don't see any nested RTEMS_Lock_allocator during 
> 'create' at all)...

You are right.  I am thinking in future tense.  When I implemented the
allocator mutex to address the context switch latency you were seeing
in checking how much free memory was available, I thought about making
all create/delete operations use the allocator mutex.  At the time,
I decided that I had changed enough behavior and wanted to hold off
for stability.  I still think it is the right thing to do and may put 
that on my 4.7 list.

>>
>>
>>>
>>> -- Till
>>>
>>>
>>>> The error checking I proposed won't work without tinkering though.  
>>>> Let's ignore that for the
>>>> first pass.
>>>>
>>>>> How about allocating the reent structure from the workspace ?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> That's not necessarily a bad idea either.  It is designed for memory 
>>>> allocation inside a directive.
>>>> But if the allocation mutex allows nesting, it isn't necessary.
>>>>
>>>>>
>>>>>
>>>>> -- Till
>>>>>
>>>>>>  It is probably
>>>>>> also necessary to change 
>>>>>> cpukit/score/include/rtems/score/apimutex.h so that 
>>>>>> _API_Mutex_Allocate()
>>>>>> creates the Allocator mutex as nestable rather than 
>>>>>> CORE_MUTEX_NESTING_IS_ERROR.
>>>>>>
>>>>>> Finally, it might not be a bad idea for it to be considered a 
>>>>>> fatal RTEMS error if _API_Mutex_Locks
>>>>>> wants to block when _Thread_Dispatch_disable is non-zero.  That 
>>>>>> would be easier than this
>>>>>> happening again and debugging it.
>>>>>>
>>>>>> It might also be valid to consider it a fatal error in a memory 
>>>>>> allocation is attempted when
>>>>>> _Thread_Dispatch_disable is zero.
>>>>>>
>>>>>>> This pertains to rtems-4.6.0pre4 running on MPC860 with an 
>>>>>>> unsubmitted BSP.
>>>>>>> The sequence of events goes like this:
>>>>>>>
>>>>>>>
>>>>>>> 1.    Thread 1 (Init) is running at priority 1.  It creates and 
>>>>>>> starts     thread 2 (notification_task) at priority 196.  Since 
>>>>>>> thread 2 is
>>>>>>>     at a lower priority, it doesn't start executing yet.
>>>>>>>
>>>>>>> 2.    Thread 1 sleeps with rtems_task_wake_after(10 ms) to wait 
>>>>>>> for some
>>>>>>>     external hardware to do something.  As soon as it goes to sleep,
>>>>>>>     thread 2 is now runnable and starts executing.
>>>>>>>
>>>>>>> 3.    Thread 2 does some stuff, and then calls malloc().  Halfway 
>>>>>>> through
>>>>>>>     rtems_region_get_segment(), the 10ms timer set by thread 1 
>>>>>>> expires.
>>>>>>>     We do a context switch and thread 1 is now running.
>>>>>>>
>>>>>>>     ** Before it lost the CPU, thread 2 had successfully called
>>>>>>> **
>>>>>>>     ** _RTEMS_Lock_allocator().  _RTEMS_Allocator_Mutex is held 
>>>>>>> by    **
>>>>>>>     ** thread 2 when the context switch back to thread 1 
>>>>>>> occurs.    **
>>>>>>>
>>>>>>> 4.    Thread 1 now calls rtems_start_task(), which invokes 
>>>>>>> malloc(), which
>>>>>>> calls
>>>>>>>     rtems_region_get_segment(), which calls _RTEMS_Lock_allocator().
>>>>>>>
>>>>>>>     _RTEMS_Lock_allocator() returns, *without blocking*.  The 
>>>>>>> allocator
>>>>>>>     mutex is still held by thread 2, yet thread 1 proceeds in the 
>>>>>>> belief
>>>>>>>     that it has the mutex.
>>>>>>>
>>>>>>>     More detail:
>>>>>>>     When thread 1 calls rtems_task_start() in step #4, that function
>>>>>>>     calls _Thread_Get() on the task we want to start.  As a side 
>>>>>>> effect,
>>>>>>>     _Thread_Get() increments _Thread_Dispatch_disable_level to 1.
>>>>>>>
>>>>>>>     Shortly thereafter, _User_extensions_Thread_start() is 
>>>>>>> called, which
>>>>>>>     calls libc_start_hook(), which calls calloc()->malloc()->
>>>>>>>     
>>>>>>> rtems_region_get_segment()->_RTEMS_Lock_allocator()->_CORE_mutex_Seize(). 
>>>>>>>
>>>>>>>     (Note that _Thread_Dispatch_disable_level is stil 1.)
>>>>>>>     _CORE_mutex_Seize_interrupt_trylock() returns 1 (as it 
>>>>>>> should), so
>>>>>>> we
>>>>>>>     call _Thread_Disable_dispatch() (disable level is now 2!) 
>>>>>>> followed
>>>>>>> by
>>>>>>>     _CORE_mutex_Seize_interrupt_blocking() to block on the mutex.
>>>>>>>         Because _Thread_Dispatch_disable_level is 2, the call to
>>>>>>> _Thread_Enable_dispatch()
>>>>>>>     just decrements it to 1 and returns without calling
>>>>>>> _Thread_Dispatch().
>>>>>>>     Thread 1 now happily proceeds to corrupt the heap free block 
>>>>>>> chain.
>>>>>>>    I don't understand the semantics of 
>>>>>>> _Thread_Dispatch_disable_level well
>>>>>>> enough to
>>>>>>> provide a patch.  For now we will work around it by making sure 
>>>>>>> our tasks
>>>>>>> don't call
>>>>>>> malloc() at the same time.  Hopefully those with deep kernel 
>>>>>>> understanding
>>>>>>> can
>>>>>>> take a look at this and tell me if I'm smoking crack.  :)
>>>>>>>
>>>>>>> -Phil
>>>>>>>
>>>>>>>  
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
> 
> 

-- 
Joel Sherrill, Ph.D.             Director of Research & Development
joel at OARcorp.com                 On-Line Applications Research
Ask me about RTEMS: a free RTOS  Huntsville AL 35805
Support Available                (256) 722-9985