Problem report: Struct aliasing problem causes Thread_Ready_Chain corruption in 4.6.99.3

Thomas Doerfler Thomas.Doerfler at embedded-brains.de
Wed Nov 29 07:49:08 UTC 2006


Ralf,

Ralf Corsepius schrieb:
> On Tue, 2006-11-28 at 11:14 -0600, Joel Sherrill wrote:
> 
>>Eric Norum wrote:
>>
> 
>>+ It is an OPTIMIZATION and an optional one at that.
>>This isn't a test of manhood.  There isn't any shame in
>>disabling it. 
> 
> Then you might be able to explain why
> 
> * HUGE projects such as Fedora and OpenSuSE are able to compile 1000's
> of source tarballs and millions of lines of code with it enabled and are
> only facing very few packages to break?
> 
> * GCC and newlib can be compiled with it enabled for RTEMS?

Oh, please note: the RTEMS kernel and packages can also be compiled with
this option. And they also work MOST of the time. But this is not
sufficent for a reliable RTOS.

Please note also: Peer was the only guy who found this bug in his
application. And I have coached him a bit when he was tracking down the
problem. Actually he got a system crash (the RTEMS kernel dumped a task
context to NULL) only in a heavily loaded customer (!) system once per
some operating hours. And he literally used weeks to instrument his
software so he could track down the problem.

I think we all agree that we cannot cut RTEMS-4.7, when similar problems
might still slumber in the kernel and the other packages like networking
etc.

What is your suggestion to find other potential problem areas?

- Perform a complete code review? Who has the time to do it?

- Wait, until some other guy tracks down a similar problem in another
package?

- Or just have a look at the compiler warnings? Obviously, the compiler
does not warn at every critical code fragment, and some of the code
pieces seem to have been worked on, so that warnings go away but the
code itself is not realy aliasing-proof.

> 
> Our problem is lack of testing (primary cause: way too long release
> cycles). 

Here I must totally disagree. You will never fix this problem by
testing. The effort to track down ONE error has been significantly high.
 When we got our code aliasing-proof, we surely will have to test it.
But IMHO we will have to review the code, module by module, to identify
possible problems.

> 
> Instead the RTEMS community seems to prefer to "blindly shoot into the
> crowd" on "hear/say" and to play with symptoms, but to fix causes.

OK, can you already identify the causes? or do you have a suggestion how
we all can do that?

> 
> Me suspects very few, but central points in RTEMS to be broken and
> needing to be fundamentally redesigned.

Please list them.

> 
> 
>>Bottom line is that if we want strict-aliasing on for 4.7, we
>>will be delaying the release.  This is a very bad thing.  I
>>am torn between Thomas' suggestions 2 and 3
>>
>>
>>>2.) We set "-fno-strict-aliasing" now and forever
> 
> 
> With all due respect, but to me, this would be "plain stupid".

Ralf, again with due respect, can you please explain me why it is stupid?

> 
> 
>>>3.) we use "-fno-strict-aliasing" for RTEMS 4.7 and, ASAP we build a
>>>strategy on how to get ALL code aliasing clean.
> 
> 
> This would be a _temporary_ compromise, I could live with.
> Nevertheless, we need to identify the broken pieces and not to play it
> nice nor to play these breakdowns low.

Ralf, this sound good, in this point we totally agree. I think my
suggestion 3.) has the danger in it, that things work fine again and
nobody would talk about aliasing-proof code again. So it will be
important to build a strategy how we can make our source code more reliable.

> 
> Peer's report (which I seem to have missed initially) and Thomas'
> followup to it are a points to getting started.
> 

I agree, we can start there, but remember, how we found this bug. When
Peer first reported his suspicion, I was convinced that the Chain code
can't be broken, because it worked reliable for so long.

Ralf, I agree with you that it would be nicer to have aliasing-proof code
from the start, but I see no easy way to get it soon. Therefore
I really vote to get rtems-4.7 out with -fno-strict-aliasing, because
the risk to either delay the 4.7 relase or to have it unstable is
much too high in my opinion.


wkr,
Thomas.


> 
> Embarrassed,
> 
> 	Ralf
> 
> 
> _______________________________________________
> rtems-users mailing list
> rtems-users at rtems.com
> http://rtems.rtems.org/mailman/listinfo/rtems-users


-- 
--------------------------------------------
Embedded Brains GmbH
Thomas Doerfler           Obere Lagerstr. 30
D-82178 Puchheim          Germany
email: Thomas.Doerfler at embedded-brains.de
Phone: +49-89-18908079-2
Fax:   +49-89-18908079-9



More information about the users mailing list