Possible bug in _CORE_mutex_Seize()

Mon Sep 29 21:51:17 UTC 2003

Till Straumann wrote:

> Joel Sherrill wrote:
> 
>> Following up on my own post since I didn't notice the bottom where you 
>> actually explained
>> where malloc() was called when I read this early this morning.
>>
> 
> Funny - I had missed that, too...

Easy enough to do even though Phil did a great job on the write-up.

>> Phil Torre wrote:

Phil.. are you able to implement a patch based upon my suggestion
or do you need me to generate an untested patch for you?

>> I think I have a solution.  Move the body of  the libc start hook to 
>> the create hook.   It is probably
>> also necessary to change cpukit/score/include/rtems/score/apimutex.h 
>> so that _API_Mutex_Allocate()
>> creates the Allocator mutex as nestable rather than 
>> CORE_MUTEX_NESTING_IS_ERROR.
>>
>> Finally, it might not be a bad idea for it to be considered a fatal 
>> RTEMS error if _API_Mutex_Locks
>> wants to block when _Thread_Dispatch_disable is non-zero.  That would 
>> be easier than this
>> happening again and debugging it.
>>
>> It might also be valid to consider it a fatal error in a memory 
>> allocation is attempted when
>> _Thread_Dispatch_disable is zero.
>>
> 
> you mean _Thread_Dispatch_disable_level is non-zero?

Yes.  It is an application or RTEMS design error to get into
this situation so considering it a fatal error encourages it to
be fixed in the lab.

> Joel - are you going to put out a patch for this? We soon are
> going into production and issues like this one make me nervous ;-)

They make me nervous also.  I had already cut 4.6.0 when I noticed
that Solaris did not have a complete set of tool binaries.   :(

> IMHO, it would be nice to have pre5 out ASAP, as there are a
> few critical fixes which have been closed since pre4

I think this is in order.  We need the fix for this in a pre5 though.

Especially with the rtems 4.6 specific
toolset implemented since pre4.  You will now be able to keep 4.6
tools in parallel with snapshot tools.  Plus I rewrote the Windows
chapter in the Getting Started guide.

NOTE: If anyone wants to enhance the sample application discussion
in the getting started guide to include all samples, it would be
welcomed.

> -- Till
> 
>>> This pertains to rtems-4.6.0pre4 running on MPC860 with an 
>>> unsubmitted BSP.
>>> The sequence of events goes like this:
>>>
>>>
>>> 1.    Thread 1 (Init) is running at priority 1.  It creates and 
>>> starts     thread 2 (notification_task) at priority 196.  Since 
>>> thread 2 is
>>>     at a lower priority, it doesn't start executing yet.
>>>
>>> 2.    Thread 1 sleeps with rtems_task_wake_after(10 ms) to wait for some
>>>     external hardware to do something.  As soon as it goes to sleep,
>>>     thread 2 is now runnable and starts executing.
>>>
>>> 3.    Thread 2 does some stuff, and then calls malloc().  Halfway 
>>> through
>>>     rtems_region_get_segment(), the 10ms timer set by thread 1 expires.
>>>     We do a context switch and thread 1 is now running.
>>>
>>>     ** Before it lost the CPU, thread 2 had successfully called
>>> **
>>>     ** _RTEMS_Lock_allocator().  _RTEMS_Allocator_Mutex is held by    **
>>>     ** thread 2 when the context switch back to thread 1 occurs.    **
>>>
>>> 4.    Thread 1 now calls rtems_start_task(), which invokes malloc(), 
>>> which
>>> calls
>>>     rtems_region_get_segment(), which calls _RTEMS_Lock_allocator().
>>>
>>>     _RTEMS_Lock_allocator() returns, *without blocking*.  The allocator
>>>     mutex is still held by thread 2, yet thread 1 proceeds in the belief
>>>     that it has the mutex.
>>>
>>>     More detail:
>>>     When thread 1 calls rtems_task_start() in step #4, that function
>>>     calls _Thread_Get() on the task we want to start.  As a side effect,
>>>     _Thread_Get() increments _Thread_Dispatch_disable_level to 1.
>>>
>>>     Shortly thereafter, _User_extensions_Thread_start() is called, which
>>>     calls libc_start_hook(), which calls calloc()->malloc()->
>>>     
>>> rtems_region_get_segment()->_RTEMS_Lock_allocator()->_CORE_mutex_Seize(). 
>>>
>>>     (Note that _Thread_Dispatch_disable_level is stil 1.)
>>>     _CORE_mutex_Seize_interrupt_trylock() returns 1 (as it should), so
>>> we
>>>     call _Thread_Disable_dispatch() (disable level is now 2!) followed
>>> by
>>>     _CORE_mutex_Seize_interrupt_blocking() to block on the mutex.
>>>         Because _Thread_Dispatch_disable_level is 2, the call to
>>> _Thread_Enable_dispatch()
>>>     just decrements it to 1 and returns without calling
>>> _Thread_Dispatch().
>>>     Thread 1 now happily proceeds to corrupt the heap free block chain.
>>>    
>>> I don't understand the semantics of _Thread_Dispatch_disable_level well
>>> enough to
>>> provide a patch.  For now we will work around it by making sure our 
>>> tasks
>>> don't call
>>> malloc() at the same time.  Hopefully those with deep kernel 
>>> understanding
>>> can
>>> take a look at this and tell me if I'm smoking crack.  :)
>>>
>>> -Phil
>>>
>>>  
>>>
>>
>>
>>
> 
> 

-- 
Joel Sherrill, Ph.D.             Director of Research & Development
joel at OARcorp.com                 On-Line Applications Research
Ask me about RTEMS: a free RTOS  Huntsville AL 35805
Support Available                (256) 722-9985