Determining the cause of a segfault in RTEMS

Wed Mar 13 14:36:01 UTC 2013

On 3/13/2013 7:39 AM, Gedare Bloom wrote:
> On Wed, Mar 13, 2013 at 1:59 AM, Mohammed Khoory <mkhoory at eiast.ae> wrote:
>> I've increased the stack size to 32K by defining
>> CONFIGURE_MINIMUM_TASK_STACK_SIZE and CONFIGURE_MINIMUM_STACK_SIZE and
>> CONFIGURE_INIT_TASK_STACK_SIZE .. I also made sure that I was starting new
>> tasks using RTEMS_CONFIGURED_MINIMUM_STACK_SIZE .. I'm still getting the
>> issue however as if nothing has changed. I think this means that there's
>> something wrong in my code, like something somewhere is writing out of
>> bounds, and not from the stack being too small or anything like that... So
>> I'll keep looking
>>
>> I forgot to mention that the arrays in question that involve a lot of
>> copying are around 100-120 chars in size for each task (which is probably
>> nothing compared to the default 2k-4k allocated for the stacks).
>>
> An array of 100 char is 800 bytes. 5 such arrays will overflow 4k
> limit. If you call multiple such functions the stack pressure can grow
> quickly. One easy check is to move all your arrays to global
> variables. Then they will be pre-allocated for you in the .data
> section of your program binary. Of course this won't work if you are
> multitasking or have reentrant functions.
Allocating large buffers on a thread stack is bad form in embedded
systems with fixed task stack size. You have to accommodate them.

If you have enough memory, just increase the minimum to something
monstrous like 512K.

You could also turn on CONFIGURE_STACK_CHECKER_ENABLED. Call
CONFIGURE_STACK_CHECKER_ENABLED after creating all the tasks but
before they run. This will give you a report showing where in memory
their stacks are. At the crash look at the LOWER address edge of each
thread stack for corruption.

You could also set various breakpoints and print the stack pointer to
see if it is inside the range.

Or in a common routine in a deep path, you could call 
rtems_stack_checker_is_blown(). It returns a bool but I think it actually
prints messages if blown.

But eventually you have to figure out how much stack you are using,
and figure out how large the stack should be or whether you need to
have a dedicated pool of these temporary buffers.
>> Thanks for the replies, it really helped me look in the right direction.
>>
>> Small question: is it normal for RTEMS_CONFIGURED_MINIMUM_STACK_SIZE to be
>> defined as 0? I've noticed this while stepping through the program, and I
>> was expecting it to be 32768. I assume maybe the RTEMS code considers 0 as
>> "check configuration" or something... I just want to make sure.
>>
> See cpukit/rtems/include/rtems.h where it is defined. I don't actually
> see the macro used anywhere in the tree though, so I don't know if it
> has any effect.
It should have effect. I am sure there is test code in the tree for
it but the low_ticker examples in examples-v2 demonstrate using
it to lower the stack size and reduce footprint.

rtems_minimum_stack_size is the variable initialized in confdefs.h
by this configuration setting. Follow that thread.
>>> -----Original Message-----
>>> From: Joel Sherrill [mailto:Joel.Sherrill at OARcorp.com]
>>> Sent: Wednesday, March 13, 2013 11:03 AM
>>> To: Mohammed Khoory
>>> Cc: Chris Johns; rtems-users at rtems.org
>>> Subject: RE: Determining the cause of a segfault in RTEMS
>>>
>>> For architectural reasons, 2k is very likely much too small on any SPARC
>> tbsp.
>>> Try increasing the minimum to something like 32k or larger to prove it is a
>> stack
>>> problem.
>>>
>>> If it runs, we can talk about stack checker and usage reports.
>>>
>>> --joel
>>>
>>> Mohammed Khoory <mkhoory at eiast.ae> wrote:
>>>
>>>
>>>>> Normally in general-purpose (not embedded) programming, the most
>>>>> straightforward way to determine the cause of a segfault is to look
>>>>> at its backtrace. However, this approach isn't really helpful in my
>>>>> case.. I'm writing an RTEMS application that has around 4 tasks, and
>>>>> stepping through the program doesn't exactly show context switches.
>>>>> When I get a segfault, the backtrace only shows the following
>>>>>
>>>>> #0  0xcd95a758 in ?? ()
>>>>> #1  0x40000190 in trap_table () at
>>>>> ../../../../../../../../rtems-4.10.2/c/src/lib/libbsp/sparc/leon3/../.
>>>>> ./spar
>>>>> c/shared/start.S:88
>>>>>
>>>>> Which is extremely unhelpful. Stepping through the program also
>>>>> doesn't really help, because it seems to crash while waiting for
>>>>> events, which makes no sense to me.
>>>>>
>>>> The stack appears corrupt because the exception stack frame is a
>>>> different format to the standard stack frame gdb expects and attempts
>>>> to decode. All the data is present, it is just not available via gdb's
>>>> stack frame
>>> printing.
>>>
>>> That is very helpful, thanks. I'm doing some string copying on arrays
>> allocated
>>> on the stack, which is what I suspected is causing it, but then I dismissed
>> it
>>> because I knew for sure that I'm not copying anything larger than what the
>>> array can hold. But I guess I should take a better look at the copying code
>> now
>>> as I hadn't considered the fact that embedded targets tend to have small
>>> stacks.
>>>
>>> As Angelo Fraietta mentioned it could be caused by my stack size being too
>>> small.. however I saw that my minimum stack size is configured to be
>> 1024*2,
>>> which should be enough for what I'm doing.. but I'll play around with it a
>> bit
>>> more and see how that goes.
>>>
>>>>> Is there any other proper way to figure out what's causing the
>>>>> segfault in RTEMS? I'm thinking maybe using the capture engine might
>>>>> be a good idea because it should tell what task was running last,
>>>>> but I haven't used it yet, I only know what it does.. so I'm not
>>>>> sure if
>>> that'll help.
>>>> This is architecture and sometimes BSP specific so exact details are
>>>> not
>>> easy
>>>> to give. The best solution is find the address the exception is
>>>> branching
>>> to
>>>> and then set a break point there. The idea is to get as close to the
>>>> point
>>> the
>>>> exception happens. More often than not this lets you see a decent
>>>> stack frame in gdb. Have a look start.S and see if it is easy to see a
>>>> possible
>>> entry
>>>> point.
>>> The line in start.S that the backtrace refers to only defines an entry for
>> a table
>>> of traps from what I can tell.. in this case it's a DMA access error, which
>>> indicates that something is writing somewhere that it's not supposed to.
>> But
>>> that's the only thing I can figure out from it.
>>>
>>> Generally speaking the start.S file isn't very helpful.. It only contains
>> code
>>> related to starting up the SPARC cpu from what I can tell...
>>>
>>> I thought that having a backtrace like this from segfaults on RTEMS was
>> normal,
>>> which is why I sent the message in the first place :)
>>>
>>>> Which version of RTEMS are you using ?
>>> 4.10.2
>>>
>>>> Which BSP are you using ?
>>> LEON3
>>>
>>> _______________________________________________
>>> rtems-users mailing list
>>> rtems-users at rtems.org
>>> http://www.rtems.org/mailman/listinfo/rtems-users
>> _______________________________________________
>> rtems-users mailing list
>> rtems-users at rtems.org
>> http://www.rtems.org/mailman/listinfo/rtems-users

-- 
Joel Sherrill, Ph.D.             Director of Research & Development
joel.sherrill at OARcorp.com        On-Line Applications Research
Ask me about RTEMS: a free RTOS  Huntsville AL 35805
Support Available                (256) 722-9985