P4 dual xeon problems

Joel Sherrill joel.sherrill at OARcorp.com
Thu Sep 27 18:47:13 UTC 2001


Gunter Cieters wrote:
> 
> My apologies for posting my first mail with an incorrect
> subject.
> 
> > > We are running the RTEMS native host environment (Linux
> > > target) on a standard Linux Redhat 7.1 distribution. Lateley,
> > > we are bugged by some RTEMS crashes on a dual P4
> > > xeon system. The crashes disappear when we boot Linux
> > > in uniprocessor mode.
> > >
> > > Is there anybody who would know a reason why the RTEMS
> > > host environment would not run on the P4 SMP system ? (we
> > > would like to rule out RTEMS and proceed in upgrading the Linux
> > > kernel but it is a little bit weird that the crashes only
> > > appear when we are doing RTEMS runs).
> >
> >
> > This is an interesting one that I have never heard of before.  The
> > RTEMS unix port (aka synthetic target) runs completely in user space as
> > a normal Linux process.  It only uses a handful of Linux system operations
> > like sigalarm and signal processing for a clock tick and setjmp/longjmp for
> > context switches.  I don't know anything it can do that should trip up the
> > kernel directly UNLESS ...
> >
> > <hypothesis mode on>
> >
> > RTEMS could be using some service heavily that is not truly MP safe.
> > For example, we repeatedly fire the clock tick.  Say that there is a bug
> > in the signal processing code that our heavy use of SIGALRM is tripping.
> >
> > My gut feeling is that RTEMS is somehow tripping a bug where something
> > is not properly protected in Linux.
> >
> > <hypothesis mode off>
> >
> > Does the kernel crash or just the RTEMS application?
> 
> The RTEMS application halts with segfault or sigill (the place
> where this happens is a bit random although there may be a relationship with
> floating point usage when looking to the stack traces).
> If RTEMS is MP safe then it is possible that the Linux kernel has a
> problem with correct save/restore of the floating point registers. We have
> ran in this before when using the MMX extensions on Linux (we got around
> that particular problem by applying a kernel patch).

IMO this is very likely to be the problem.  setjmp/longjmp do not
touch the FPU AFAIK.  It might be worth it to take the FPU save/restore
code from the regular x86 port and take care of the FPU in addition
to the integer context.

> I'll keep this mailing list informed if we find anything related to kernel
> problems.
> 
> thanks for your answer,
> Cieters Gunter

-- 
Joel Sherrill, Ph.D.             Director of Research & Development
joel at OARcorp.com                 On-Line Applications Research
Ask me about RTEMS: a free RTOS  Huntsville AL 35805
   Support Available             (256) 722-9985



More information about the users mailing list