Fwd: Re: Re: Fwd: Problem report: Struct aliasing problem causes Thread_Ready_Chain corruption in 4.6.99.3

Peer Stritzinger peerst at gmail.com
Tue Nov 21 16:19:32 UTC 2006


Ralf,

On 11/21/06, Ralf Corsepius <ralf.corsepius at rtems.org> wrote:
> Sorry, but this is a no-go. strict-aliasing is the default in GCC for a
> very long time (IIRC, since 4.0.0) and must work.

I think its actually on by default for -O2 since gcc-3 (checked in the
gcc-3.4 documentation) but then gcc wasn't smart enough when
optimizing to make it a problem.

Strict aliasing is a declaration about certain properties of the C Code compiled
to make the business of optimizing easier for the compiler.  One can argue that
because it's on by default in GCC that all code should behave like this but as
we see it doesn't.

> If some code breaks, this code has to be considered broken and must be
> fixed. I.e. we'd have to have a concrete piece of code to be able to
> analyze and fix what might be going wrong.

I agree.  I looked at the chain code for a while to find a way to fix
it without changing too much things in the system in order to avoid
-fno-strict-aliasing.  The fundamental problem is breaks the
assumption that (gcc manual):

     In particular, an object of one type is assumed never to reside at
     the same address as an object of a different type, unless the
     types are almost the same.  For example, an `unsigned int' can
     alias an `int', but not a `void*' or a `double'.

I'm not sure there is an easy way to fix this without the perril that
next time gcc improves its optimizations it will break again.

A concrete piece of code is _Thread_Reset_timeslice() however only in
conjunction with all its inlined functions and their inlined
functions.  I can prove that when compiled with
strict-aliasing active it will produce invalid PowerPC code (at least
on MPC860 probably
on all).  The bug is VERY hard to observe an triggers onl a heavyly
loaded production
system in between 4 hours and 12 hours.  I had to build a coredump
system into a lovlevel exception handler that could be loaded and into
gdb. Also I had to use all registers on the hardware breakpoints of
the CPU on order to track it back to the real
cause.  Oh yes and 3 weeks debugging on a project with tight schedule ;-)

The trouble is that I was in quite a hurry and did lots of analysis on paper.

So in order to give you prove I need to dig out my old paperwork try
to understand it
and type it up again or to redo the analysis.

I will do this but this will take some time.

Another question is: What else is broken?

It looks like the assumption "an object of one type is assumed never
to reside at
 the same address as an object of a different type" might be broken in
several parts of
RTEMS.

For our production systems I will set -fno-strict-aliasing for a while
until I'm convinced
that the code will be safe in with strict-aliasing.

Regards,
-- Peer



More information about the users mailing list