[RTEMS Project] #2811: More robust thread dispatching on SMP and ARM Cortex-M

Thu Nov 17 01:29:24 UTC 2016

#2811: More robust thread dispatching on SMP and ARM Cortex-M
-----------------------------+------------------------------
 Reporter:  sebastian.huber  |       Owner:  sebastian.huber
     Type:  enhancement      |      Status:  new
 Priority:  normal           |   Milestone:  4.12
Component:  cpukit           |     Version:  4.11
 Severity:  normal           |  Resolution:
 Keywords:                   |
-----------------------------+------------------------------

Comment (by chrisj):

 Replying to [comment:6 sebastian.huber]:
 > Replying to [comment:5 chrisj]:
 > > Replying to [comment:4 sebastian.huber]:
 > > > I think a fatal error is  more appropriate here.
 > > >
 > > > * Applications which have this usage error needs to be fixed at
 compile-time. It makes no sense to ship an SMP application with this bug.
 > >
 > > A fatal error is still run-time and not a compile time error so you
 have lost me here.
 >
 > It is an error that must be fixed during development.  Otherwise you
 have a broken product.

 This goes for checking returned error codes as well.

 The logical end to this path of discussion is to remove all fatal error
 checks from the kernel in production because they should never appear. I
 am sure you would agree this is not practical therefore there is always a
 finite chance of a fatal error happening and robust systems need to take
 this issue seriously.

 In the case of this specific check or test we need to find what is
 practical. The issues are how it effects a user who encounters it, what it
 does to the kernel in terms of overhead and complexity, and what can be
 implemented now and and what could be implemented given the time and
 resources.

 >
 > >
 > > How many fatal errors instance are there in RTEMS in the kernel? Not
 the number of error code, but the specific locations a fatal error can
 appear, ie code/line pairs? I have never audited this.
 >
 > See Internal_errors_Core_list, we have a test for every fatal internal
 error.
 >

 Where is the user documentation? The C User guide is lacking detail for
 this and the other generic errors cases.

 Looking further into this we have around 300 calls to
 `rtems_fatal_error_occurred` less some for noise due to an approximate
 count. Of this around 200 are in the `c/src` tree with about 50 under the
 `cpukit`. For example we have `rtems_fatal_error_occurred(0xdeadbeef);` in
 `cpukit/libfs/src/nfsclient/src/nfs.c`.

 The we have the family of `*_Fatal*` calls. Getting decent counts is not
 easy so I will not put them there. We have `_CPU_Fatal_halt`,
 `_BSP_Fatal_error`, ` _POSIX_Fatal_error`, `MPCI_Fatal`, and `_SMP_Fatal`.
 Without decent control we end up with
 `_CPU_Fatal_halt(RTEMS_FATAL_SOURCE_EXCEPTION, 0xECC0);` which is in the
 NIOS2 interrupt handler.

 I am concerned about adding to this without a suitable means to provide
 accessible user documentation on what we do.

 >
 > > This implies testing will highlight the issue because you have a
 debugger to give you this data. Currently RTEMS standard or default stack
 traces that get called on a fatal error provide little if any information
 that could be used to resolve the exact source, eg the thread id executing
 or even better an unwinder (dreaming here). Better support for tier 1
 archs would help.
 >
 > Improved fatal error diagnostics is a different topic.

 I do not agree. If you argue a case for using fatal errors then I see it
 as reasonable we discuss how this impacts users.

 >  With a debugger is a matter of seconds to figure out the problem spot
 of a fatal error.

 Yes, however the time needed to understand the issue depends on the person
 looking and you and I are not suitable to judge this.

 A fatal error could occur in production simply due to the fact there is
 code in the build that can generate them, and they do happen when
 performing system integration when there are no debuggers connected. How
 the error gets translated by a user with no knowledge of the RTEMS
 internals is the question being proposed. The more fatal errors we add the
 more of a problem this becomes and this ticket wants to add another.

 >
 > >
 > > >
 > > > * This is a new constraint specific to SMP. Existing software may be
 simply unaware of this issue. However, its important to detect this
 constraint violation.
 > >
 > > I agree it is important.
 > >
 > > > * _Thread_Do_dispatch() has no return value.  Adding this check to
 other places would be much more difficult, error prone. with more space
 and time overhead, and labour intensive to test.
 > >
 > > There are no other similar tests happening now on the blocking paths?
 >
 > No, this is a weak area in RTEMS.  For example call
 rtems_task_wake_after() in an interrupt service routine.  You don't get
 any status information that this is stupid.

 Yes we have a few cases like this.

 > For now, I think a fatal error is sufficient.

 Ok, however there is a need for documentation on the error aimed at the
 user being added to our user manuals.

 I would like to see better default exception output and after this
 discussion I can see a real need for the debug server needing to catch any
 fatal errors and break the system leaving the user in the thread stack
 frame with the error.

 > In case there is really a problem with this in the field, we can still
 improve things.

 In terms of the implementation, sure, in terms of supporting users with
 suitable documentation I believe we have a problem now.

 > What matters is that this constraint violation gets detected, otherwise
 you can spend hours on debugging.

 Yes this is the most important item to get resolved.

--
Ticket URL: <http://devel.rtems.org/ticket/2811#comment:7>
RTEMS Project <http://www.rtems.org/>
RTEMS Project