Problem report: Struct aliasing problem causes Thread_Ready_Chain corruption in 184.108.40.206
ralf.corsepius at rtems.org
Wed Nov 29 18:34:41 UTC 2006
On Wed, 2006-11-29 at 10:05 -0600, Joel Sherrill wrote:
> Thomas Doerfler wrote:
> > Ralf,
> > Ralf Corsepius schrieb:
> >> On Wed, 2006-11-29 at 08:49 +0100, Thomas Doerfler wrote:
> >>> Ralf Corsepius schrieb:
> >>>> * HUGE projects such as Fedora and OpenSuSE are able to compile 1000's
> >>>> of source tarballs and millions of lines of code with it enabled and are
> >>>> only facing very few packages to break?
> >>>> * GCC and newlib can be compiled with it enabled for RTEMS?
> >>> Oh, please note: the RTEMS kernel and packages can also be compiled with
> >>> this option. And they also work MOST of the time. But this is not
> >>> sufficent for a reliable RTOS.
> >> Are you seriously trying to say, such fundamental bugs would not be
> >> found in those 1000's of source tarballs, in all those _years_
> >> -fstrict-aliasing is effective, if this was a real problem?
> > No, I would not dare to set such a silly statement.
> > Just for the records:
> > - Software that has been designed in a "cleaner" way concerning its data
> > structures and their usage surely has no problems with the strict
> > aliasing rules.
> The core of RTEMS code was started 10 years before C99 was approved.
> Was strict aliasing as an valid optimization even addressed before C99?
I don't know, but ... I've strict-aliasing had been active in GCC at
least since gcc-3.4 (!).
> So those 1000s of packages referred to are likely not even to get close to
> problems with aliasing.
Check X11, gtk, qt, they are full of it (I recall one point, many years
ago, strict-aliasing had broken X11).
> > - RTEMS definitively has problems.
> And as always we want RTEMS to be as clean as possible.
In this particular case, I don't see how to achieve this without a
> >>> What is your suggestion to find other potential problem areas?
> >> I can tell you what I've been trying so far (but I am at just at the
> >> very beginning):
> >> Compile RTEMS with and with out -fno-strict-aliasing, disassemble the
> >> object files and compare the disassembly. If these disassembled files
> >> differ, this a files is qualified to be candidate to be examined.
> > This is a good aproach. It will show us, which modules might be
> > sensitive for aliasing issues.
> This is definitely a good approach.
ATM, I am at ca 300 suspected files ;)
> It is also possible (and likely) that the code in question is either in
> low level code code
> with rippling effects -- like the chain -- or device drivers where it is
> impacts only a specific target.
That's what I suspect. I am currently having all "chain" and
"chain-like" structures (Heap, Imfs etc.) on my radar. But so far, this
not much more than a "suspicion" and "wild guess".
> >>>> Our problem is lack of testing (primary cause: way too long release
> >>>> cycles).
> >>> Here I must totally disagree.
> >> Face it: RTEMS users are still using ancient tools with ancient version
> >> of RTEMS, therefore rtems-4.7 and its toolchain has hardly seen any
> >> public exposure and testing at all.
> > I think this is partly due to the fact, that RTEMS is used in embedded
> > devices. When I start a development based on RTEMS, I may be open to use
> > a non-stable version, but when my product finalizes, I need a stable
> > version. This may be a big difference compared with most of the other
> > open source projects.
> Agreed. There are active projects today still using 4.5 for development.
Well, ... their problem.
> >>> You will never fix this problem by
> >>> testing. The effort to track down ONE error has been significantly high.
> >> Yes, and? How many errors are there? 1 ... 10 ... 100s?
> >> I suspect very few, with most of them orbiting around "Chains" and
> >> "Object", due to their working principle (based on aliasing types).
> > How about the network stack, the web server, the filesystem,
> > malloc/free... and the individual BSPs.
The networking stack triggers (un-)surprisingly few "suspects". The web
server is a different story - It even triggers "punned pointer" warnings
and almost any file it consists of, appears in my list.
> Ralf's diff'ing of assembly at least narrows down the candidates
ATM (without having tried to eliminate false positive) about 1/3 of all
*.c files get listed.
> we can pick a single CPU to analyze on first that we think is very likely to
> have these optimization problems.
To getting started, I'd suggest to try the posix BSP under Linux.
This only uses a very limited part of the RTEMS sources, and uses a
native Linux-gcc, which can be assumed to be in far better shape than a
My knowledge on i386 is poor, but unless I am completely wrong,
Peer's/Thomas issue is visible under i386-FC6 and Cygwin.
> Fix them there and have less to look at
> on other architectures.
> When analyzing the questionable output, adding "-g -Wa,-aldh" to intersperse
> C with assembly might be helpful. I assume you are just running a
> script which objdump's each .o in two build trees and compares them now.
Yes, what I am basically doing.
> >> We must provoke these bugs to be able to "nail them down" and not pamper
> >> them with "-fno-strict-aliasing".
> > Maybe the following steps would make sense:
> > - Somebody (Ralf?) might track down the suspect modules by Ralfs method
> > to compare the compiler output (using an archtiecture with MANY
> > optimization headroom. PPC is not so bad due to its many general purpose
> > registers, but maybe another architecture is better)
> - Verify difference is a breakage. :)
That's a real issue. I need to think about how we could try to approach
Having a list is one thing, working it off and documenting steps having
performed (e.g. files having been identified as false positives) is
If we had a functional bugtracking system, I'd file PRs on serious
suspects such that others who are more familiar with a particular
architecture can have a look into it.
I could also add my list to CVS-HEAD, where it could be reformated into
a table ...
> > - The various suspect packages could be redesigned by some suitable
> > persons (I would volunteer for some of the code)
> - Once a problem is fixed, we may need to regenerate suspect list in
> case it is
> a side-effect of inlining.
> > - In parallel, 4.7 will be cut with -fno-strict-aliasing
> > - the 4.8 development branch will temporarily use -fno-strict-aliasing
> > aswell, until the code has been revised
This is completely non-acceptable to me.
> I don't know how Ralf is going to add this compiler flag.
I don't know either. As I've said before on PM (wrt. -g), this is not as
simple as it might appear.
> Random thought -- could it be turned on/off with a configuration flag which
> defaults on no strict aliasing in 4.7 and strict aliasing in 4.8?
> > - Then, the 4.8 development branch will switch back to -fstrict-aliasing
> > AND enable more aliasing warnings (there was some GCC switch to do this)
> > What do all of you think of this?
> 4.7 needs strict aliasing disabled. I am prone to leaving it on for 4.8
> but also
> turn that extra warning on.
> I am afraid that if we let it get turned off by default on 4.8, it will
> never get turned back on.
Exactly - Therefore I refuse to accept turning it on 4.8.
More information about the users