still problem with ARM and Unlimited Task Test

Joachim Rahn Joachim.Rahn at helmholtz-berlin.de
Fri Jun 17 06:39:29 UTC 2011


On 16.06.2011 18:58, Joel Sherrill wrote:
> On 06/16/2011 11:29 AM, Gedare Bloom wrote:
>> On Wed, Jun 15, 2011 at 12:07 PM, Joel Sherrill
>> <joel.sherrill at oarcorp.com>  wrote:
>>> Chris.. cc'ing you on this due to the unlimited nature.
>>>
>>> Sebastian.. cc'ing you to conform me on heap side-effect.
>>>
>>> On 06/15/2011 09:31 AM, Joachim Rahn wrote:
>>>> On 15.06.2011 15:44, Joel Sherrill wrote:
>>>>> On 06/15/2011 08:08 AM, Joachim Rahn wrote:
>>>>>> Hi Joel,
>>>>>>
>>>>>> hope I don't bother you too much...
>>>>>>
>>>>>> Now I found some time beside my main work to come back to my problem
>>>>>> with the
>>>>>> failing Unlimited Task Test on our ARM board.
>>>>>>
>>>>>> Any advice regarding the following would be welcome!!!
>>>>>> ------------------------------------------------------
>>>>>>
>>>>>> In fact our problem has nothing directly to do with the update between
>>>>>> RTEMS 4.9.3 and 4.9.5 !
>>>>>>
>>>>>> BTW: We now use the shared code in the BSP tree to initialize the
>>>>>> workspace (BSP_BOOTCARD_HANDLES_RAM_ALLOCATION = TRUE)
>>>>>>        and we compile every thing with all RTEMS and HEAP debugging on.
>>>>>>
>>>>>> We now have a setup with RTEMS 4.9.3 on a AT91SAM9263-EK evaluation
>>>>>> board from Atmel which reproduces
>>>>>> our data abort fault.
>>>>>>
>>>>>> The problem seems to be that under some circumstances the routine
>>>>>>
>>>>>>           _RTEMS_task_Switch_extension(Thread_Control *executing,
>>>>>> Thread_Control *heir)
>>>>>>
>>>>>> will be called with a reference to an executing task (*executing) after
>>>>>> the routine
>>>>>>
>>>>>>           _RTEMS_task_Delete_extension(Thread_Control *executing,
>>>>>> Thread_Control *heir)
>>>>>>
>>>>>> has already deleted the certain task and a following call sequence to
>>>>>>
>>>>>>           _RTEMS_tasks_Free
>>>>>>           _Objects_Free
>>>>>>           _Objects_Shrink_information
>>>>>>
>>>>>> ends up in a call to
>>>>>>
>>>>>>           _Heap_Free(Heap_Control *the_heap, void *starting_address)
>>>>>>
>>>>>> which frees the memory used by this certain Thread_Control struct of
>>>>>> that task and overwites the
>>>>>> pointer "executing->task_variables" (which now should be NULL) with some
>>>>>> heap information.
>>>>>> Because "executing->task_variables" now is corrupted the call to
>>>>>> _RTEMS_task_Switch_extension
>>>>>> leads to a data_abort.
>>>>>>
>>>>>> GDB output of the concerning call sequences using the gdb up command
>>>>>> looks like:
>>>>>>
>>>>>> GDB stack walk: _RTEMS_tasks_Delete_extension (executing=0x23fca500,
>>>>>> deleted=0x23fca500)
>>>>>>                   _User_extensions_Thread_delete (the_thread=0x23fca500)
>>>>>>                   _Thread_Close (information=0x2002fd98,
>>>>>> the_thread=0x23fca500)
>>>>>>                    rtems_task_delete (id=0)
>>>>>>                    test_task (my_number=8)
>>>>>>                   _Thread_Handler ()
>>>>>>                   _Objects_API_maximum_class (api=536959452)
>>>>>>
>>>>>> GDB stack walk: _Heap_Free (the_heap=0x2002fe4c,
>>>>>> starting_address=0x23fca0a0)
>>>>>>                   _Workspace_Free (block=0x23fca0a0)
>>>>>>                   _Objects_Shrink_information (information=0x2002fd98)
>>>>>>                   _Objects_Free (information=0x2002fd98,
>>>>>> the_object=0x23fca500)
>>>>>>                   _RTEMS_tasks_Free (the_task=0x23fca500)
>>>>>>                    rtems_task_delete (id=0)
>>>>>>                    test_task (my_number=8)
>>>>>>                   _Thread_Handler ()
>>>>>>                   _Objects_API_maximum_class (api=536959452)
>>>>>>
>>>>>> NOW "executing->task_variables" IS CORRUPTED !!!!!
>>>>>>
>>>>>> GDB stack walk: _RTEMS_tasks_Switch_extension (executing=0x23fca500,
>>>>>> heir=0x23fac488)
>>>>>>                   _User_extensions_Thread_switch (executing=0x23fca500,
>>>>>> heir=0x23fac488)
>>>>>>                   _Thread_Dispatch ()
>>>>>>                   _Thread_Enable_dispatch ()
>>>>>>                    rtems_task_delete (id=0)
>>>>>>                    test_task (my_number=8)
>>>>>>                   _Thread_Handler ()
>>>>>>                   _Objects_API_maximum_class (api=536959452)
>>>>>>
>>>>>>
>>>>>> By chance the pointer to "next_block->prev_size" in the call to
>>>>>> _Heap_Free has the same location as
>>>>>> "executing->task_variables" in the concerning Thread_Control struct and
>>>>>> therefore _RTEMS_task_Switch_extension
>>>>>> tries to access a bad memory location which of course leads to a
>>>>>> data_abort.
>>>>>> May be under other circumstances one will never stumble upon this?
>>>>>>
>>>>> The memory has indeed been freed and is not supposed to be used.
>>>>> In fact, executing->task_variables should be NULL.  I see it set
>>>>> to NULL in _RTEMS_tasks_Delete_extension.  Can you verify that?
>>>>>
>>>> YES: I've verified it, executing->task_variables is set to NULL by
>>>> _RTEMS_tasks_Delete_extension!
>>>>
>>>> BUT: after _Heap_Free has been called executing->task_variables is altered
>>>>       because at the former location of executing->task_variables now the
>>>> _Heap_Free routine
>>>>       expects next_block->prev_size and alters it to 3096 or 0xCE0.
>>>>
>>>>       The following call to _RTEMS_tasks_Switch_extension checks if
>>>> executing->task_variables is NULL
>>>>       but it's now 0xCE0 resp. NOT NULL.
>>> I hope this is reproducible enough to verify this.
>>>
>>> Can you set a watchpoint after the memory location that is set to
>>> NULL and then overwritten?  I don't see any reasonable way for the
>>> heap to write to this address.   task_variables is the last field
>>> in the Thread_Control block.  TCBs are pre-allocated and never
>>> freed back to the heap except for "shrinking" in the unlimited
>>> object case.
>>>
>>> What I suspect is happening, is that when the task in question
>>> is deleted, _Objects_Shrink_information is being called and freeing
>>> a "chunk" of unused TCBs.  If the array of TCBS EXACTLY meets the
>>> alignment, then when it is freed, there will be no pad at the end
>>> of it since it is a multiple of 4 (not sure what else).  The last four
>>> bytes must be getting overwritten when the memory is freed.
>>>
>>> Chris .. Sebastian.. does that seem possible?
>>>
>>> One way to check this is to add a few unused fields to the end
>>> of the Thread_Control, set them to 0 when initialized and
>>> check that they are overwritten.
>>>
>>> Permanent fixes include:
>>>
>>> + checking for dormant start in switch extension
>> If I understand the root of the problem is that switch extension is
>> called on a task switch when the executing task has been deleted and
>> is being removed from consideration, i.e. state is dormant. The free'd
>> memory should not be accessed by any switch extension. I think
>> checking for dormant state is a better fix than the latter. I would
>> say to put it in the score function _User_extensions_Thread_switch,
>> this will introduce a single if statement, but will allow extensions
>> to be unchanged.
>>
> I thought about this but the problem is that most extensions
> are written as:
>
> switch( executing, heir )
> {
>   save something on executing
>   restore something for heir
> }
>
> If you skip the switch extensions entirely when executing is
> dormant, you miss the ability to do  something for the heir.
>
> The check logic would have to be in every switch extension
> implementation. :(
>>> + moving something large-ish like Start down to the
>>> end of the TCB structure.  Then when freed like this, it
>>> won't matter since it has already been used and won't
>>> be referenced.
>>>
>> This might fix the problem this time, but it is a hack and introduces
>> undefined behavior (accessing/using memory that was freed).
>>
> Yep.  I was only considering this for release branches as
> a workaround.  Alternatively adding a pad field to the TCB
> on the release branches is a solution but that adds to the
> memory requirement for all applications.
>
> The cleanest solution is to defer actually freeing the memory.
> This has always been a potential issue with delete(SELF). You
> have to continue to use the stack and we accounted for that
> on single core.  But SMP makes this even dangerous.  I want to
> do this on the head.
>
> We need a simple non-invasive solution for the release branches.
>
> This case is insidious because it is essentially a random
> combination of proper alignment, size of the allocation
> and the fact it is a task deleting itself and shrinking
> the set of unlimited objects.   This case can't occur without
> unlimited objects enabled.
>
> Joachim.. does the patch I posted even solve the issue?
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
First try this morning looks good, unlimited works without complains !!!!
I will test some other configurations and report if something fails.

Thanks a lot for the patch!!!!!

>>> If we decide the second alternative is better, then there
>>> needs to be some documentation.
>>>
>>> If the first is implemented, then unfortunately, I think
>>> it needs to be in every switch extension.
>>>
>>>> <...snip... cpukit/rtems/src/tasks.c>
>>>>
>>>> void _RTEMS_tasks_Switch_extension(
>>>>    Thread_Control *executing,
>>>>    Thread_Control *heir
>>>> )
>>>> {
>>>>    rtems_task_variable_t *tvp;
>>>>
>>>>    /*
>>>>     *  Per Task Variables
>>>>     */
>>>>
>>>>    tvp = executing->task_variables;
>>>>    while (tvp) {
>>>>      tvp->tval = *tvp->ptr;
>>>>
>>>> <...snip...>
>>>>
>>>>       therefore the check of NULL fails and the last line of code in the
>>>> snippet results into a data abort...
>>>>
>>>>
>>>>> I think the extensions should ensure they are not operating on
>>>>> a deleted task.  The extensions pointers and task variable
>>>>> pointer should be NULL at this point.  Worst case, they can
>>>>> check the state of executing and if is has STATES_DORMANT set,
>>>>> then don't do anything for executing.
>>>>>
>>>>> I checked the 4.9 source for this part of the Classic API extensions.
>>>>> They are setting things to NULL and the switch extension is checking
>>>>> it.
>>>>>
>>>>> FWIW there is a PR outstanding spotted on SMP work where the
>>>>> thread stack is freed and potentially reallocated for some other
>>>>> purpose before the delete(SELF) task is finished switching out.
>>>>> I don't think that's happening here but it is worth mentioning.
>>>>>
>>>>>> BTW: When I change (as a test) the definition of CPU_HEAP_ALIGNMENT in
>>>>>>        ..../rtems/cpukit/score/cpu/arm/rtems/score/cpu.h
>>>>>>        from CPU_ALIGNMENT (which is 4) to something larger than
>>>>>> CPU_ALIGNMENT,
>>>>>>        the unlimited test works fine.
>>>>>>
>>>>>> Any idea or advice ...?
>>>>>>
>>>>>> Regards,
>>>>>> Joachim
>>>>>>
>>>>>> On 01.03.2011 17:09, Joel Sherrill wrote:
>>>>>>> On 03/01/2011 07:48 AM, Joachim Rahn wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> after updating from rtems-4.9.3 to rtems-4.9.5 the "Unlimited Task
>>>>>>>> Test" on my
>>>>>>>> ARM cpu at91sam9263 fails with a message like...
>>>>>>>>
>>>>>>>> [...skip...]
>>>>>>>> task 19 ending.
>>>>>>>> task 20 ending.
>>>>>>>> task 21 ending.
>>>>>>>> task 7 ending.
>>>>>>>> task 8 ending.
>>>>>>>>
>>>>>>>> INSN_LDR
>>>>>>>> data_abort at address 0x20018CD8, instruction: 0xE5932000,   spsr =
>>>>>>>> 0x20000013
>>>>>>>> active thread thread 0x0A010001
>>>>>>>> Previous sp=0x200629A8 lr=0x200135E0 and actual cpsr=60000097
>>>>>>>>     0x20038E30 0x20056EA8 0x0000117C 0x200629E0 0x200629C4 0x200135E0
>>>>>>>>     0x20018CB8 0x20038E30 0x20056EA8 0x20026EC0 0x20026EC0 0x20062A18
>>>>>>>>     0x200629E4 0x20010100 0x200135AC 0x00000000 0x00000000 0x00000000
>>>>>>>>     0x00000000 0x20056EA8 0x20038E30 0x60000013 0x600000D3 0x00000000
>>>>>>>>     0x00000000 0x20062A28 0x20062A1C 0x2000AE48 0x2000FFF8 0x20062A4C
>>>>>>>>     0x20062A2C 0x2000ADA0 0x2000AE24 0x521C9845 0x20056EA8 0x00000000
>>>>>>>>     0x00000000 0x2002ACD8 0x20062A64 0x20062A50 0x20000348 0x2000AD28
>>>>>>>>     0x00000008 0x00000001 0x20062A84 0x20062A68 0x2001C2D4 0x20000310
>>>>>>>>
>>>>>>>> [...skip...]
>>>>>>>>
>>>>>>>> which commonly means the cpu tries to access non available memory.
>>>>>>>>
>>>>>>>> After removing the bugfix bug1718 the "Unlimited Task Test" works
>>>>>>>> fine.
>>>>>>>>
>>>>>>>> (https://www.rtems.org/bugzilla/show_bug.cgi?id=1718)
>>>>>>>>
>>>>>>>> *** rtems-4.9.3: ./cpukit/sapi/include/confdefs.h *** unlimited task
>>>>>>>> test works
>>>>>>>> [...skip...]
>>>>>>>>
>>>>>>>>      #define CONFIGURE_MEMORY_PER_TASK_FOR_POSIX_API \
>>>>>>>>        _Configure_From_workspace( \
>>>>>>>>          sizeof (POSIX_API_Control) + \
>>>>>>>>         (sizeof (void *) * (CONFIGURE_MAXIMUM_POSIX_KEYS)) \
>>>>>>>>        )
>>>>>>>>
>>>>>>>> [...skip...]
>>>>>>>>
>>>>>>>> *** rtems-4.9.5: ./cpukit/sapi/include/confdefs.h *** unlimited task
>>>>>>>> test doesn't work
>>>>>>>> [...skip...]
>>>>>>>>      #define CONFIGURE_MEMORY_PER_TASK_FOR_POSIX_API \
>>>>>>>>        _Configure_From_workspace( \
>>>>>>>>          CONFIGURE_MINIMUM_TASK_STACK_SIZE + \
>>>>>>>>          sizeof (POSIX_API_Control) + \
>>>>>>>>         (sizeof (void *) * (CONFIGURE_MAXIMUM_POSIX_KEYS)) \
>>>>>>>>        )
>>>>>>>> [...skip...]
>>>>>>>>
>>>>>>>> Any hints respectively does anyone observe the same?
>>>>>>>>
>>>>>>> That patch wouldn't directly cause that failure.
>>>>>>> The only think I can see is that does change the
>>>>>>> amount of workspace reserved up front (by a lot).
>>>>>>>
>>>>>>> Is this a BSP which is in the RTEMS tree?  I am
>>>>>>> suspicious that there isn't enough memory for
>>>>>>> the workspace/heap and the BSP initialization
>>>>>>> isn't recognizing this.  Eventually the task stacks,
>>>>>>> heap, etc all collide, there is corruption and you crash.
>>>>>>>
>>>>>>> So we would need to know the following:
>>>>>>>
>>>>>>> + address of end of BSS
>>>>>>> + start of memory for heap and length
>>>>>>> + start of memory for RTEMS workspace and length.
>>>>>>> + amount of RAM
>>>>>>>
>>>>>>> Assuming that the workspace/heap are from end of
>>>>>>> BSS to the end of RAM.
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> --
>>>>>>>> Joachim
>>>>>>>>
>>>>>>>> ________________________________
>>>>>>>>
>>>>>>>> Helmholtz-Zentrum Berlin für Materialien und Energie GmbH
>>>>>>>>
>>>>>>>> Mitglied der Hermann von Helmholtz-Gemeinschaft Deutscher
>>>>>>>> Forschungszentren e.V
>>>>>>>>
>>>>>>>> Aufsichtsrat: Vorsitzender Prof. Dr. Dr. h.c. mult. Joachim Treusch,
>>>>>>>> stv. Vorsitzende Dr. Beatrix Vierkorn- Rudolph
>>>>>>>> Geschäftsführer: Prof. Dr. Anke Rita Kaysser-Pyzalla, Prof. Dr. Dr.
>>>>>>>> h.c. Wolfgang Eberhardt, Dr. Ulrich Breuer
>>>>>>>>
>>>>>>>> Sitz Berlin, AG Charlottenburg, 89 HRB 5583
>>>>>>>>
>>>>>>>> Postadresse:
>>>>>>>> Hahn-Meitner-Platz 1
>>>>>>>> D-14109 Berlin
>>>>>>>>
>>>>>>>> http://www.helmholtz-berlin.de
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> rtems-users mailing list
>>>>>>>> rtems-users at rtems.org
>>>>>>>> http://www.rtems.org/mailman/listinfo/rtems-users
>>>>>> ________________________________
>>>>>>
>>>>>> Helmholtz-Zentrum Berlin für Materialien und Energie GmbH
>>>>>>
>>>>>> Mitglied der Hermann von Helmholtz-Gemeinschaft Deutscher
>>>>>> Forschungszentren e.V.
>>>>>>
>>>>>> Aufsichtsrat: Vorsitzender Prof. Dr. Dr. h.c. mult. Joachim Treusch,
>>>>>> stv. Vorsitzende Dr. Beatrix Vierkorn-Rudolph
>>>>>> Geschäftsführer: Prof. Dr. Anke Rita Kaysser-Pyzalla, Dr. Ulrich Breuer
>>>>>>
>>>>>> Sitz Berlin, AG Charlottenburg, 89 HRB 5583
>>>>>>
>>>>>> Postadresse:
>>>>>> Hahn-Meitner-Platz 1
>>>>>> D-14109 Berlin
>>>>>>
>>>>>> http://www.helmholtz-berlin.de
>>>> ________________________________
>>>>
>>>> Helmholtz-Zentrum Berlin für Materialien und Energie GmbH
>>>>
>>>> Mitglied der Hermann von Helmholtz-Gemeinschaft Deutscher
>>>> Forschungszentren e.V.
>>>>
>>>> Aufsichtsrat: Vorsitzender Prof. Dr. Dr. h.c. mult. Joachim Treusch, stv.
>>>> Vorsitzende Dr. Beatrix Vierkorn-Rudolph
>>>> Geschäftsführer: Prof. Dr. Anke Rita Kaysser-Pyzalla, Dr. Ulrich Breuer
>>>>
>>>> Sitz Berlin, AG Charlottenburg, 89 HRB 5583
>>>>
>>>> Postadresse:
>>>> Hahn-Meitner-Platz 1
>>>> D-14109 Berlin
>>>>
>>>> http://www.helmholtz-berlin.de
>>>
>>> --
>>> Joel Sherrill, Ph.D.             Director of Research&   Development
>>> joel.sherrill at OARcorp.com        On-Line Applications Research
>>> Ask me about RTEMS: a free RTOS  Huntsville AL 35805
>>>    Support Available             (256) 722-9985
>>>
>>>
>>> _______________________________________________
>>> rtems-users mailing list
>>> rtems-users at rtems.org
>>> http://www.rtems.org/mailman/listinfo/rtems-users
>>>
>
>

--
Joachim Rahn
____________________________________________________________
Joachim.Rahn at Helmholtz-Berlin.de
Albert-Einstein-Strasse 15, 12489 Berlin, Germany
Phone: +49 30 8062 - 14864
Fax:   +49 30 8062 - 14632

________________________________

Helmholtz-Zentrum Berlin für Materialien und Energie GmbH

Mitglied der Hermann von Helmholtz-Gemeinschaft Deutscher Forschungszentren e.V.

Aufsichtsrat: Vorsitzender Prof. Dr. Dr. h.c. mult. Joachim Treusch, stv. Vorsitzende Dr. Beatrix Vierkorn-Rudolph
Geschäftsführer: Prof. Dr. Anke Rita Kaysser-Pyzalla, Dr. Ulrich Breuer

Sitz Berlin, AG Charlottenburg, 89 HRB 5583

Postadresse:
Hahn-Meitner-Platz 1
D-14109 Berlin

http://www.helmholtz-berlin.de




More information about the users mailing list