Problem report: Struct aliasing problem causes Thread_Ready_Chain corruption in 4.6.99.3

Wed Nov 29 18:34:41 UTC 2006

On Wed, 2006-11-29 at 10:05 -0600, Joel Sherrill wrote:
> Thomas Doerfler wrote:
> > Ralf,
> >
> > Ralf Corsepius schrieb:
> >   
> >> On Wed, 2006-11-29 at 08:49 +0100, Thomas Doerfler wrote:
> >>
> >>     
> >>> Ralf Corsepius schrieb:
> >>>       
> >>>> * HUGE projects such as Fedora and OpenSuSE are able to compile 1000's
> >>>> of source tarballs and millions of lines of code with it enabled and are
> >>>> only facing very few packages to break?
> >>>>
> >>>> * GCC and newlib can be compiled with it enabled for RTEMS?
> >>>>         
> >>> Oh, please note: the RTEMS kernel and packages can also be compiled with
> >>> this option. And they also work MOST of the time. But this is not
> >>> sufficent for a reliable RTOS.
> >>>       
> >> Are you seriously trying to say, such fundamental bugs would not be
> >> found in those 1000's of source tarballs, in all those _years_
> >> -fstrict-aliasing is effective, if this was a real problem?
> >>     
> >
> > No, I would not dare to set such a silly statement.
> > Just for the records:
> >
> > - Software that has been designed in a "cleaner" way concerning its data
> > structures and their usage surely has no problems with the strict
> > aliasing rules.
> >
> >   
> The core of RTEMS code was started 10 years before C99 was approved. 
> Was strict aliasing as an valid optimization even addressed before C99?
I don't know, but ... I've strict-aliasing had been active in GCC at
least since gcc-3.4 (!).

> So those 1000s of packages referred to are likely not even to get close to
> problems with aliasing. 
Check X11, gtk, qt, they are full of it (I recall one point, many years
ago, strict-aliasing had broken X11).

> > - RTEMS definitively has problems.
> >
> >   
> And as always we want RTEMS to be as clean as possible.
In this particular case, I don't see how to achieve this without a
fundamental redesign.

> >>> What is your suggestion to find other potential problem areas?
> >>>       
> >> I can tell you what I've been trying so far (but I am at just at the
> >> very beginning):
> >>
> >> Compile RTEMS with and with out -fno-strict-aliasing, disassemble the
> >> object files and compare the disassembly. If these disassembled files
> >> differ, this a files is qualified to be candidate to be examined.
> >>     
> >
> > This is a good aproach. It will show us, which modules might be
> > sensitive for aliasing issues.
> >
> >   
> This is definitely a good approach. 
ATM, I am at ca 300 suspected files ;)

> It is also possible (and likely) that the code in question is either in 
> low level code code
> with rippling effects -- like the chain -- or device drivers where it is 
> impacts only a specific target.

That's what I suspect. I am currently having all "chain" and
"chain-like" structures (Heap, Imfs etc.) on my radar. But so far, this
not much more than a "suspicion" and "wild guess".

> >>>> Our problem is lack of testing (primary cause: way too long release
> >>>> cycles). 
> >>>>         
> >>> Here I must totally disagree.
> >>>       
> >> Face it: RTEMS users are still using ancient tools with ancient version
> >> of RTEMS, therefore rtems-4.7 and its toolchain has hardly seen any
> >> public exposure and testing at all.
> >>     
> >
> > I think this is partly due to the fact, that RTEMS is used in embedded
> > devices. When I start a development based on RTEMS, I may be open to use
> > a non-stable version, but when my product finalizes, I need a stable
> > version. This may be a big difference compared with most of the other
> > open source projects.
> >
> >   
> Agreed.  There are active projects today still using 4.5 for development.
Well, ... their problem.

> >>>  You will never fix this problem by
> >>> testing. The effort to track down ONE error has been significantly high.
> >>>       
> >> Yes, and? How many errors are there? 1 ... 10 ... 100s?
> >>
> >> I suspect very few, with most of them orbiting around "Chains" and
> >> "Object", due to their working principle (based on aliasing types).
> >>     
> >
> > How about the network stack, the web server, the filesystem,
> > malloc/free... and the individual BSPs.
The networking stack triggers (un-)surprisingly few "suspects". The web
server is a different story - It even triggers "punned pointer" warnings
and almost any file it consists of, appears in my list.

> Ralf's diff'ing of assembly at least narrows down the candidates 
> significantly.
ATM (without having tried to eliminate false positive) about 1/3 of all
*.c files get listed.

> Hopefully,
> we can pick a single CPU to analyze on first that we think is very likely to
> have these optimization problems.
To getting started, I'd suggest to try the posix BSP under Linux.
This only uses a very limited part of the RTEMS sources, and uses a
native Linux-gcc, which can be assumed to be in far better shape than a
standard FSF-gcc.

My knowledge on i386 is poor, but unless I am completely wrong,
Peer's/Thomas issue is visible under i386-FC6 and Cygwin.

>   Fix them there and have less to look at
> on other architectures.
> 
> When analyzing the questionable output, adding "-g -Wa,-aldh" to intersperse
> C with assembly might be helpful.  I assume you are just running a 
> script which objdump's each .o in two build trees and compares them now.

Yes, what I am basically doing.

> >> We must provoke these bugs to be able to "nail them down" and not pamper
> >> them with "-fno-strict-aliasing".
> >>     
> >
> > Maybe the following steps would make sense:
> >
> > - Somebody (Ralf?) might track down the suspect modules by Ralfs method
> > to compare the compiler output (using an archtiecture with MANY
> > optimization headroom. PPC is not so bad due to its many general purpose
> > registers, but maybe another architecture is better)

> - Verify difference is a breakage. :)
That's a real issue. I need to think about how we could try to approach
this problem.

Having a list is one thing, working it off and documenting steps having
performed (e.g. files having been identified as false positives) is
different one.

If we had a functional bugtracking system, I'd file PRs on serious
suspects such that others who are more familiar with a particular
architecture can have a look into it.

I could also add my list to CVS-HEAD, where it could be reformated into
a table ...

> > - The various suspect packages could be redesigned by some suitable
> > persons (I would volunteer for some of the code)
> >
> >   
> - Once a problem is fixed, we may need to regenerate suspect list in 
> case it is
> a side-effect of inlining.
> > - In parallel, 4.7 will be cut with -fno-strict-aliasing
> >
> > - the 4.8 development branch will temporarily use -fno-strict-aliasing
> > aswell, until the code has been revised
This is completely non-acceptable to me.

> I don't know how Ralf is going to add this compiler flag.
I don't know either. As I've said before on PM (wrt. -g), this is not as
simple as it might appear.

> Random thought -- could it be turned on/off with a configuration flag which
> defaults on no strict aliasing in 4.7 and strict aliasing in 4.8?
No.

> > - Then, the 4.8 development branch will switch back to -fstrict-aliasing
> > AND enable more aliasing warnings (there was some GCC switch to do this)
> >
> > What do all of you think of this?
> >
> >   
> 4.7 needs strict aliasing disabled.  I am prone to leaving it on for 4.8 
> but also
> turn that extra warning on. 
> 
> I am afraid that if we let it get turned off by default on 4.8, it will 
> never get turned back on.
Exactly - Therefore I refuse to accept turning it on 4.8.

Ralf