does rtems 5.1 support create a core dump file when accessing a invalid address or other fatal errors?

Wed Sep 16 01:45:01 UTC 2020

Thanks very much for you reply.  I get many useful information from your suggestion.
Our BSP has a 8MB memory for rtems and a 32MB flash device. So it may not spend too much time for dumping the core. 
The question is that our application running with rtems is very complex. There are several threads and many async processes.
When the fatal error happened, it is possible that the backtrace of current thread is not enough.
We need analyse other data structures to debug, such as all items in a queue or some global variable.
What's more, when our product sold to consumer and some bugs trigger a crash. We could not connect to user's environment to debug.
So, it will be a better option if a core dump is generated.

However, if the core dump policy is not possible, the method you provided is feasible.
I will study it and check how it will be integrated to our system.

smallphd at aliyun.com

From: Chris Johns
Date: 2020-09-16 08:33
To: smallphd at aliyun.com; devel
Subject: Re: does rtems 5.1 support create a core dump file when accessing a invalid address or other fatal errors?
On 15/9/20 8:58 pm, smallphd at aliyun.com wrote:
> I am developing applications in rtems 5.1. As we know, my application and rtems
> kernel are both in the same address space.
> So if my application access an invalid address or encounter other fatal errors,
> I want the kernel not just being hunging, but create a core dump file.
> This file contains the whole contents of memory and I could use a debuger to
> analyse the file to handle the bug.
> The question arise because I do not want always debug rtems in the bsp.

This is an interesting question. For production units I think capturing and
reporting an error is important but a full core is not worth the effort.

Core images can be saved with a single address space OS. I remember Cisco's
single address space OS for routers from 20 years ago could capture a complete
core that could be loaded by gdb. Those devices had a Compact flash card
installed to capture the core and I suppose their users did not mind the wait
while the core was saved.

As others have explained capturing the full address space and saving it so gdb
could be taught to load it is difficult. You need to put aside some memory to
construct the core image as you save it and you need to have small stand alone
drivers and what ever else to get the image off the target and saved. RTEMS
cannot be used. Where this approach gets hard is when you start to consider
hardware failure type issues.

My preferred solution is to add a small storage area away from the RTEMS memory
map called the Run Time Error (RTE) store. This is a piece of RAM that can
survive a reset or reboot and is not part of the RTEMS memory map. Internal SoC
memory can often be enough. The memory cannot be cleared or corrupted during
reset. The struct is something like:

typedef struct
{
  uint32_t type;    /* The type of error in this trace buffer. */
  uint32_t count;   /* The number of times we have had an error. */
  uint64_t uptime;  /* The period of time we have been up. */
  union
  {
    error_trace_fatal  fatal;  /* A fatal error. */
    error_trace_assert assert; /* An assert error. */
    error_trace_error  error;  /* An error code. */
  } error;
  uint32_t crc;     /* Checksum */
} error_trace;

You provide a struct for a fatal error, an assert or an error. It is a matter of
hooking the error handlers and saving the data. The fatal error is something like:

typedef struct
{
  rtems_fatal_source  source;
  uint32_t            internal;
  rtems_fatal_code    code;
  CPU_Exception_frame frame;
  uint32_t            stack[ET_STACK_SIZE];
} error_trace_fatal;

Catch the fatal error handler and fill in the fields including the crc then
reset the board. Limit the code you call before reset.

When RTEMS starts get your application to check the RTE and if the checksum is
valid check the error count. If the error count is not zero you have captured an
error. You now have a working RTEMS that can be used to process and save the
error. I have production systems that save errors to a JFFS2 disk and a web
interface can be used to download it. I also have systems that send the data to
a syslog type server when the devices are networked. In those systems it is
really important to capture _every_ reset.

Finding the error in the code is a matter of getting the PC address and the ELF
executable image with the DWARF debug information and using `objdump -d
--source`. Disassemble the exe with source and search for the PC address. The
report points to the location and the dumped registers will help you see what
the issue is. Most of the time the issue can be found or investigated further
and resolved but sometimes it cannot be found directly. In those cases you need
to stress the system in a lab to expose the crash and then investigate. The key
information is the crash happened and where.

Chris
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.rtems.org/pipermail/devel/attachments/20200916/3fcde0a0/attachment-0001.html>