still problem with ARM and Unlimited Task Test

Joel Sherrill joel.sherrill at OARcorp.com
Thu Jun 16 16:58:08 UTC 2011


On 06/16/2011 11:29 AM, Gedare Bloom wrote:
> On Wed, Jun 15, 2011 at 12:07 PM, Joel Sherrill
> <joel.sherrill at oarcorp.com>  wrote:
>> Chris.. cc'ing you on this due to the unlimited nature.
>>
>> Sebastian.. cc'ing you to conform me on heap side-effect.
>>
>> On 06/15/2011 09:31 AM, Joachim Rahn wrote:
>>> On 15.06.2011 15:44, Joel Sherrill wrote:
>>>> On 06/15/2011 08:08 AM, Joachim Rahn wrote:
>>>>> Hi Joel,
>>>>>
>>>>> hope I don't bother you too much...
>>>>>
>>>>> Now I found some time beside my main work to come back to my problem
>>>>> with the
>>>>> failing Unlimited Task Test on our ARM board.
>>>>>
>>>>> Any advice regarding the following would be welcome!!!
>>>>> ------------------------------------------------------
>>>>>
>>>>> In fact our problem has nothing directly to do with the update between
>>>>> RTEMS 4.9.3 and 4.9.5 !
>>>>>
>>>>> BTW: We now use the shared code in the BSP tree to initialize the
>>>>> workspace (BSP_BOOTCARD_HANDLES_RAM_ALLOCATION = TRUE)
>>>>>        and we compile every thing with all RTEMS and HEAP debugging on.
>>>>>
>>>>> We now have a setup with RTEMS 4.9.3 on a AT91SAM9263-EK evaluation
>>>>> board from Atmel which reproduces
>>>>> our data abort fault.
>>>>>
>>>>> The problem seems to be that under some circumstances the routine
>>>>>
>>>>>           _RTEMS_task_Switch_extension(Thread_Control *executing,
>>>>> Thread_Control *heir)
>>>>>
>>>>> will be called with a reference to an executing task (*executing) after
>>>>> the routine
>>>>>
>>>>>           _RTEMS_task_Delete_extension(Thread_Control *executing,
>>>>> Thread_Control *heir)
>>>>>
>>>>> has already deleted the certain task and a following call sequence to
>>>>>
>>>>>           _RTEMS_tasks_Free
>>>>>           _Objects_Free
>>>>>           _Objects_Shrink_information
>>>>>
>>>>> ends up in a call to
>>>>>
>>>>>           _Heap_Free(Heap_Control *the_heap, void *starting_address)
>>>>>
>>>>> which frees the memory used by this certain Thread_Control struct of
>>>>> that task and overwites the
>>>>> pointer "executing->task_variables" (which now should be NULL) with some
>>>>> heap information.
>>>>> Because "executing->task_variables" now is corrupted the call to
>>>>> _RTEMS_task_Switch_extension
>>>>> leads to a data_abort.
>>>>>
>>>>> GDB output of the concerning call sequences using the gdb up command
>>>>> looks like:
>>>>>
>>>>> GDB stack walk: _RTEMS_tasks_Delete_extension (executing=0x23fca500,
>>>>> deleted=0x23fca500)
>>>>>                   _User_extensions_Thread_delete (the_thread=0x23fca500)
>>>>>                   _Thread_Close (information=0x2002fd98,
>>>>> the_thread=0x23fca500)
>>>>>                    rtems_task_delete (id=0)
>>>>>                    test_task (my_number=8)
>>>>>                   _Thread_Handler ()
>>>>>                   _Objects_API_maximum_class (api=536959452)
>>>>>
>>>>> GDB stack walk: _Heap_Free (the_heap=0x2002fe4c,
>>>>> starting_address=0x23fca0a0)
>>>>>                   _Workspace_Free (block=0x23fca0a0)
>>>>>                   _Objects_Shrink_information (information=0x2002fd98)
>>>>>                   _Objects_Free (information=0x2002fd98,
>>>>> the_object=0x23fca500)
>>>>>                   _RTEMS_tasks_Free (the_task=0x23fca500)
>>>>>                    rtems_task_delete (id=0)
>>>>>                    test_task (my_number=8)
>>>>>                   _Thread_Handler ()
>>>>>                   _Objects_API_maximum_class (api=536959452)
>>>>>
>>>>> NOW "executing->task_variables" IS CORRUPTED !!!!!
>>>>>
>>>>> GDB stack walk: _RTEMS_tasks_Switch_extension (executing=0x23fca500,
>>>>> heir=0x23fac488)
>>>>>                   _User_extensions_Thread_switch (executing=0x23fca500,
>>>>> heir=0x23fac488)
>>>>>                   _Thread_Dispatch ()
>>>>>                   _Thread_Enable_dispatch ()
>>>>>                    rtems_task_delete (id=0)
>>>>>                    test_task (my_number=8)
>>>>>                   _Thread_Handler ()
>>>>>                   _Objects_API_maximum_class (api=536959452)
>>>>>
>>>>>
>>>>> By chance the pointer to "next_block->prev_size" in the call to
>>>>> _Heap_Free has the same location as
>>>>> "executing->task_variables" in the concerning Thread_Control struct and
>>>>> therefore _RTEMS_task_Switch_extension
>>>>> tries to access a bad memory location which of course leads to a
>>>>> data_abort.
>>>>> May be under other circumstances one will never stumble upon this?
>>>>>
>>>> The memory has indeed been freed and is not supposed to be used.
>>>> In fact, executing->task_variables should be NULL.  I see it set
>>>> to NULL in _RTEMS_tasks_Delete_extension.  Can you verify that?
>>>>
>>> YES: I've verified it, executing->task_variables is set to NULL by
>>> _RTEMS_tasks_Delete_extension!
>>>
>>> BUT: after _Heap_Free has been called executing->task_variables is altered
>>>       because at the former location of executing->task_variables now the
>>> _Heap_Free routine
>>>       expects next_block->prev_size and alters it to 3096 or 0xCE0.
>>>
>>>       The following call to _RTEMS_tasks_Switch_extension checks if
>>> executing->task_variables is NULL
>>>       but it's now 0xCE0 resp. NOT NULL.
>> I hope this is reproducible enough to verify this.
>>
>> Can you set a watchpoint after the memory location that is set to
>> NULL and then overwritten?  I don't see any reasonable way for the
>> heap to write to this address.   task_variables is the last field
>> in the Thread_Control block.  TCBs are pre-allocated and never
>> freed back to the heap except for "shrinking" in the unlimited
>> object case.
>>
>> What I suspect is happening, is that when the task in question
>> is deleted, _Objects_Shrink_information is being called and freeing
>> a "chunk" of unused TCBs.  If the array of TCBS EXACTLY meets the
>> alignment, then when it is freed, there will be no pad at the end
>> of it since it is a multiple of 4 (not sure what else).  The last four
>> bytes must be getting overwritten when the memory is freed.
>>
>> Chris .. Sebastian.. does that seem possible?
>>
>> One way to check this is to add a few unused fields to the end
>> of the Thread_Control, set them to 0 when initialized and
>> check that they are overwritten.
>>
>> Permanent fixes include:
>>
>> + checking for dormant start in switch extension
> If I understand the root of the problem is that switch extension is
> called on a task switch when the executing task has been deleted and
> is being removed from consideration, i.e. state is dormant. The free'd
> memory should not be accessed by any switch extension. I think
> checking for dormant state is a better fix than the latter. I would
> say to put it in the score function _User_extensions_Thread_switch,
> this will introduce a single if statement, but will allow extensions
> to be unchanged.
>
I thought about this but the problem is that most extensions
are written as:

switch( executing, heir )
{
   save something on executing
   restore something for heir
}

If you skip the switch extensions entirely when executing is
dormant, you miss the ability to do  something for the heir.

The check logic would have to be in every switch extension
implementation. :(
>> + moving something large-ish like Start down to the
>> end of the TCB structure.  Then when freed like this, it
>> won't matter since it has already been used and won't
>> be referenced.
>>
> This might fix the problem this time, but it is a hack and introduces
> undefined behavior (accessing/using memory that was freed).
>
Yep.  I was only considering this for release branches as
a workaround.  Alternatively adding a pad field to the TCB
on the release branches is a solution but that adds to the
memory requirement for all applications.

The cleanest solution is to defer actually freeing the memory.
This has always been a potential issue with delete(SELF). You
have to continue to use the stack and we accounted for that
on single core.  But SMP makes this even dangerous.  I want to
do this on the head.

We need a simple non-invasive solution for the release branches.

This case is insidious because it is essentially a random
combination of proper alignment, size of the allocation
and the fact it is a task deleting itself and shrinking
the set of unlimited objects.   This case can't occur without
unlimited objects enabled.

Joachim.. does the patch I posted even solve the issue?
>> If we decide the second alternative is better, then there
>> needs to be some documentation.
>>
>> If the first is implemented, then unfortunately, I think
>> it needs to be in every switch extension.
>>
>>> <...snip... cpukit/rtems/src/tasks.c>
>>>
>>> void _RTEMS_tasks_Switch_extension(
>>>    Thread_Control *executing,
>>>    Thread_Control *heir
>>> )
>>> {
>>>    rtems_task_variable_t *tvp;
>>>
>>>    /*
>>>     *  Per Task Variables
>>>     */
>>>
>>>    tvp = executing->task_variables;
>>>    while (tvp) {
>>>      tvp->tval = *tvp->ptr;
>>>
>>> <...snip...>
>>>
>>>       therefore the check of NULL fails and the last line of code in the
>>> snippet results into a data abort...
>>>
>>>
>>>> I think the extensions should ensure they are not operating on
>>>> a deleted task.  The extensions pointers and task variable
>>>> pointer should be NULL at this point.  Worst case, they can
>>>> check the state of executing and if is has STATES_DORMANT set,
>>>> then don't do anything for executing.
>>>>
>>>> I checked the 4.9 source for this part of the Classic API extensions.
>>>> They are setting things to NULL and the switch extension is checking
>>>> it.
>>>>
>>>> FWIW there is a PR outstanding spotted on SMP work where the
>>>> thread stack is freed and potentially reallocated for some other
>>>> purpose before the delete(SELF) task is finished switching out.
>>>> I don't think that's happening here but it is worth mentioning.
>>>>
>>>>> BTW: When I change (as a test) the definition of CPU_HEAP_ALIGNMENT in
>>>>>        ..../rtems/cpukit/score/cpu/arm/rtems/score/cpu.h
>>>>>        from CPU_ALIGNMENT (which is 4) to something larger than
>>>>> CPU_ALIGNMENT,
>>>>>        the unlimited test works fine.
>>>>>
>>>>> Any idea or advice ...?
>>>>>
>>>>> Regards,
>>>>> Joachim
>>>>>
>>>>> On 01.03.2011 17:09, Joel Sherrill wrote:
>>>>>> On 03/01/2011 07:48 AM, Joachim Rahn wrote:
>>>>>>> Hi all,
>>>>>>>
>>>>>>> after updating from rtems-4.9.3 to rtems-4.9.5 the "Unlimited Task
>>>>>>> Test" on my
>>>>>>> ARM cpu at91sam9263 fails with a message like...
>>>>>>>
>>>>>>> [...skip...]
>>>>>>> task 19 ending.
>>>>>>> task 20 ending.
>>>>>>> task 21 ending.
>>>>>>> task 7 ending.
>>>>>>> task 8 ending.
>>>>>>>
>>>>>>> INSN_LDR
>>>>>>> data_abort at address 0x20018CD8, instruction: 0xE5932000,   spsr =
>>>>>>> 0x20000013
>>>>>>> active thread thread 0x0A010001
>>>>>>> Previous sp=0x200629A8 lr=0x200135E0 and actual cpsr=60000097
>>>>>>>     0x20038E30 0x20056EA8 0x0000117C 0x200629E0 0x200629C4 0x200135E0
>>>>>>>     0x20018CB8 0x20038E30 0x20056EA8 0x20026EC0 0x20026EC0 0x20062A18
>>>>>>>     0x200629E4 0x20010100 0x200135AC 0x00000000 0x00000000 0x00000000
>>>>>>>     0x00000000 0x20056EA8 0x20038E30 0x60000013 0x600000D3 0x00000000
>>>>>>>     0x00000000 0x20062A28 0x20062A1C 0x2000AE48 0x2000FFF8 0x20062A4C
>>>>>>>     0x20062A2C 0x2000ADA0 0x2000AE24 0x521C9845 0x20056EA8 0x00000000
>>>>>>>     0x00000000 0x2002ACD8 0x20062A64 0x20062A50 0x20000348 0x2000AD28
>>>>>>>     0x00000008 0x00000001 0x20062A84 0x20062A68 0x2001C2D4 0x20000310
>>>>>>>
>>>>>>> [...skip...]
>>>>>>>
>>>>>>> which commonly means the cpu tries to access non available memory.
>>>>>>>
>>>>>>> After removing the bugfix bug1718 the "Unlimited Task Test" works
>>>>>>> fine.
>>>>>>>
>>>>>>> (https://www.rtems.org/bugzilla/show_bug.cgi?id=1718)
>>>>>>>
>>>>>>> *** rtems-4.9.3: ./cpukit/sapi/include/confdefs.h *** unlimited task
>>>>>>> test works
>>>>>>> [...skip...]
>>>>>>>
>>>>>>>      #define CONFIGURE_MEMORY_PER_TASK_FOR_POSIX_API \
>>>>>>>        _Configure_From_workspace( \
>>>>>>>          sizeof (POSIX_API_Control) + \
>>>>>>>         (sizeof (void *) * (CONFIGURE_MAXIMUM_POSIX_KEYS)) \
>>>>>>>        )
>>>>>>>
>>>>>>> [...skip...]
>>>>>>>
>>>>>>> *** rtems-4.9.5: ./cpukit/sapi/include/confdefs.h *** unlimited task
>>>>>>> test doesn't work
>>>>>>> [...skip...]
>>>>>>>      #define CONFIGURE_MEMORY_PER_TASK_FOR_POSIX_API \
>>>>>>>        _Configure_From_workspace( \
>>>>>>>          CONFIGURE_MINIMUM_TASK_STACK_SIZE + \
>>>>>>>          sizeof (POSIX_API_Control) + \
>>>>>>>         (sizeof (void *) * (CONFIGURE_MAXIMUM_POSIX_KEYS)) \
>>>>>>>        )
>>>>>>> [...skip...]
>>>>>>>
>>>>>>> Any hints respectively does anyone observe the same?
>>>>>>>
>>>>>> That patch wouldn't directly cause that failure.
>>>>>> The only think I can see is that does change the
>>>>>> amount of workspace reserved up front (by a lot).
>>>>>>
>>>>>> Is this a BSP which is in the RTEMS tree?  I am
>>>>>> suspicious that there isn't enough memory for
>>>>>> the workspace/heap and the BSP initialization
>>>>>> isn't recognizing this.  Eventually the task stacks,
>>>>>> heap, etc all collide, there is corruption and you crash.
>>>>>>
>>>>>> So we would need to know the following:
>>>>>>
>>>>>> + address of end of BSS
>>>>>> + start of memory for heap and length
>>>>>> + start of memory for RTEMS workspace and length.
>>>>>> + amount of RAM
>>>>>>
>>>>>> Assuming that the workspace/heap are from end of
>>>>>> BSS to the end of RAM.
>>>>>>> Cheers
>>>>>>>
>>>>>>> --
>>>>>>> Joachim
>>>>>>>
>>>>>>> ________________________________
>>>>>>>
>>>>>>> Helmholtz-Zentrum Berlin für Materialien und Energie GmbH
>>>>>>>
>>>>>>> Mitglied der Hermann von Helmholtz-Gemeinschaft Deutscher
>>>>>>> Forschungszentren e.V
>>>>>>>
>>>>>>> Aufsichtsrat: Vorsitzender Prof. Dr. Dr. h.c. mult. Joachim Treusch,
>>>>>>> stv. Vorsitzende Dr. Beatrix Vierkorn- Rudolph
>>>>>>> Geschäftsführer: Prof. Dr. Anke Rita Kaysser-Pyzalla, Prof. Dr. Dr.
>>>>>>> h.c. Wolfgang Eberhardt, Dr. Ulrich Breuer
>>>>>>>
>>>>>>> Sitz Berlin, AG Charlottenburg, 89 HRB 5583
>>>>>>>
>>>>>>> Postadresse:
>>>>>>> Hahn-Meitner-Platz 1
>>>>>>> D-14109 Berlin
>>>>>>>
>>>>>>> http://www.helmholtz-berlin.de
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> rtems-users mailing list
>>>>>>> rtems-users at rtems.org
>>>>>>> http://www.rtems.org/mailman/listinfo/rtems-users
>>>>> ________________________________
>>>>>
>>>>> Helmholtz-Zentrum Berlin für Materialien und Energie GmbH
>>>>>
>>>>> Mitglied der Hermann von Helmholtz-Gemeinschaft Deutscher
>>>>> Forschungszentren e.V.
>>>>>
>>>>> Aufsichtsrat: Vorsitzender Prof. Dr. Dr. h.c. mult. Joachim Treusch,
>>>>> stv. Vorsitzende Dr. Beatrix Vierkorn-Rudolph
>>>>> Geschäftsführer: Prof. Dr. Anke Rita Kaysser-Pyzalla, Dr. Ulrich Breuer
>>>>>
>>>>> Sitz Berlin, AG Charlottenburg, 89 HRB 5583
>>>>>
>>>>> Postadresse:
>>>>> Hahn-Meitner-Platz 1
>>>>> D-14109 Berlin
>>>>>
>>>>> http://www.helmholtz-berlin.de
>>> ________________________________
>>>
>>> Helmholtz-Zentrum Berlin für Materialien und Energie GmbH
>>>
>>> Mitglied der Hermann von Helmholtz-Gemeinschaft Deutscher
>>> Forschungszentren e.V.
>>>
>>> Aufsichtsrat: Vorsitzender Prof. Dr. Dr. h.c. mult. Joachim Treusch, stv.
>>> Vorsitzende Dr. Beatrix Vierkorn-Rudolph
>>> Geschäftsführer: Prof. Dr. Anke Rita Kaysser-Pyzalla, Dr. Ulrich Breuer
>>>
>>> Sitz Berlin, AG Charlottenburg, 89 HRB 5583
>>>
>>> Postadresse:
>>> Hahn-Meitner-Platz 1
>>> D-14109 Berlin
>>>
>>> http://www.helmholtz-berlin.de
>>
>> --
>> Joel Sherrill, Ph.D.             Director of Research&   Development
>> joel.sherrill at OARcorp.com        On-Line Applications Research
>> Ask me about RTEMS: a free RTOS  Huntsville AL 35805
>>    Support Available             (256) 722-9985
>>
>>
>> _______________________________________________
>> rtems-users mailing list
>> rtems-users at rtems.org
>> http://www.rtems.org/mailman/listinfo/rtems-users
>>


-- 
Joel Sherrill, Ph.D.             Director of Research&  Development
joel.sherrill at OARcorp.com        On-Line Applications Research
Ask me about RTEMS: a free RTOS  Huntsville AL 35805
    Support Available             (256) 722-9985





More information about the users mailing list