Complete stall: root causes and diagnosis

Wed Mar 27 02:02:11 UTC 2019

Dear all,

We are running RTEMS 4.10.2 on a MVME2700 and on a MVME6100 CPU.
The system is up and running (we are using EPICS and several VME interface
cards) for a long period, up to 2 full months, with no faults, but suddenly
it completely stalls - no error message, no stacktrace, nothing, I cannot
connect even on the serial port. This happens at a very irregular rate,
sometimes once a month and sometimes 5 stalls in a couple of hours. The
only way to recover is doing a reset of the CPU, this happens on both CPU

I have 2 questions about this issue:
A) Did something like this happen to any of you? What was the root cause of
the stalls and how did you figure it out?

B) Is there a way we can somehow get out of this situation to diagnose it,
ideally getting the stacktrace of the halted threads? Is there some kind of
non-maskable interrupt I can send  to make a postmortem diagnosis w/o
needing to reboot?

Any ideas are welcome!

With best regards,

Tim D. Gaggstatter
Software Engineer
Gemini Observatory - AURA
