[RTEMS Project] #2811: More robust thread dispatching on SMP and ARM Cortex-M
trac at rtems.org
Thu Nov 17 01:29:24 UTC 2016
#2811: More robust thread dispatching on SMP and ARM Cortex-M
Reporter: sebastian.huber | Owner: sebastian.huber
Type: enhancement | Status: new
Priority: normal | Milestone: 4.12
Component: cpukit | Version: 4.11
Severity: normal | Resolution:
Comment (by chrisj):
Replying to [comment:6 sebastian.huber]:
> Replying to [comment:5 chrisj]:
> > Replying to [comment:4 sebastian.huber]:
> > > I think a fatal error is more appropriate here.
> > >
> > > * Applications which have this usage error needs to be fixed at
compile-time. It makes no sense to ship an SMP application with this bug.
> > A fatal error is still run-time and not a compile time error so you
have lost me here.
> It is an error that must be fixed during development. Otherwise you
have a broken product.
This goes for checking returned error codes as well.
The logical end to this path of discussion is to remove all fatal error
checks from the kernel in production because they should never appear. I
am sure you would agree this is not practical therefore there is always a
finite chance of a fatal error happening and robust systems need to take
this issue seriously.
In the case of this specific check or test we need to find what is
practical. The issues are how it effects a user who encounters it, what it
does to the kernel in terms of overhead and complexity, and what can be
implemented now and and what could be implemented given the time and
> > How many fatal errors instance are there in RTEMS in the kernel? Not
the number of error code, but the specific locations a fatal error can
appear, ie code/line pairs? I have never audited this.
> See Internal_errors_Core_list, we have a test for every fatal internal
Where is the user documentation? The C User guide is lacking detail for
this and the other generic errors cases.
Looking further into this we have around 300 calls to
`rtems_fatal_error_occurred` less some for noise due to an approximate
count. Of this around 200 are in the `c/src` tree with about 50 under the
`cpukit`. For example we have `rtems_fatal_error_occurred(0xdeadbeef);` in
The we have the family of `*_Fatal*` calls. Getting decent counts is not
easy so I will not put them there. We have `_CPU_Fatal_halt`,
`_BSP_Fatal_error`, ` _POSIX_Fatal_error`, `MPCI_Fatal`, and `_SMP_Fatal`.
Without decent control we end up with
`_CPU_Fatal_halt(RTEMS_FATAL_SOURCE_EXCEPTION, 0xECC0);` which is in the
NIOS2 interrupt handler.
I am concerned about adding to this without a suitable means to provide
accessible user documentation on what we do.
> > This implies testing will highlight the issue because you have a
debugger to give you this data. Currently RTEMS standard or default stack
traces that get called on a fatal error provide little if any information
that could be used to resolve the exact source, eg the thread id executing
or even better an unwinder (dreaming here). Better support for tier 1
archs would help.
> Improved fatal error diagnostics is a different topic.
I do not agree. If you argue a case for using fatal errors then I see it
as reasonable we discuss how this impacts users.
> With a debugger is a matter of seconds to figure out the problem spot
of a fatal error.
Yes, however the time needed to understand the issue depends on the person
looking and you and I are not suitable to judge this.
A fatal error could occur in production simply due to the fact there is
code in the build that can generate them, and they do happen when
performing system integration when there are no debuggers connected. How
the error gets translated by a user with no knowledge of the RTEMS
internals is the question being proposed. The more fatal errors we add the
more of a problem this becomes and this ticket wants to add another.
> > >
> > > * This is a new constraint specific to SMP. Existing software may be
simply unaware of this issue. However, its important to detect this
> > I agree it is important.
> > > * _Thread_Do_dispatch() has no return value. Adding this check to
other places would be much more difficult, error prone. with more space
and time overhead, and labour intensive to test.
> > There are no other similar tests happening now on the blocking paths?
> No, this is a weak area in RTEMS. For example call
rtems_task_wake_after() in an interrupt service routine. You don't get
any status information that this is stupid.
Yes we have a few cases like this.
> For now, I think a fatal error is sufficient.
Ok, however there is a need for documentation on the error aimed at the
user being added to our user manuals.
I would like to see better default exception output and after this
discussion I can see a real need for the debug server needing to catch any
fatal errors and break the system leaving the user in the thread stack
frame with the error.
> In case there is really a problem with this in the field, we can still
In terms of the implementation, sure, in terms of supporting users with
suitable documentation I believe we have a problem now.
> What matters is that this constraint violation gets detected, otherwise
you can spend hours on debugging.
Yes this is the most important item to get resolved.
Ticket URL: <http://devel.rtems.org/ticket/2811#comment:7>
RTEMS Project <http://www.rtems.org/>
More information about the bugs