Problem report: Struct aliasing problem causes Thread_Ready_Chain corruption in 4.6.99.3

Joel Sherrill joel.sherrill at oarcorp.com
Wed Nov 29 16:05:47 UTC 2006


Thomas Doerfler wrote:
> Ralf,
>
> Ralf Corsepius schrieb:
>   
>> On Wed, 2006-11-29 at 08:49 +0100, Thomas Doerfler wrote:
>>
>>     
>>> Ralf Corsepius schrieb:
>>>       
>>>> * HUGE projects such as Fedora and OpenSuSE are able to compile 1000's
>>>> of source tarballs and millions of lines of code with it enabled and are
>>>> only facing very few packages to break?
>>>>
>>>> * GCC and newlib can be compiled with it enabled for RTEMS?
>>>>         
>>> Oh, please note: the RTEMS kernel and packages can also be compiled with
>>> this option. And they also work MOST of the time. But this is not
>>> sufficent for a reliable RTOS.
>>>       
>> Are you seriously trying to say, such fundamental bugs would not be
>> found in those 1000's of source tarballs, in all those _years_
>> -fstrict-aliasing is effective, if this was a real problem?
>>     
>
> No, I would not dare to set such a silly statement.
> Just for the records:
>
> - Software that has been designed in a "cleaner" way concerning its data
> structures and their usage surely has no problems with the strict
> aliasing rules.
>
>   
The core of RTEMS code was started 10 years before C99 was approved. 
Was strict aliasing as an valid optimization even addressed before C99?  
And it
is really only in the past couple of years that gcc has optimized well 
enough
to even introduce problems.

Most code is not going to be impacted before most code -- even in RTEMS --
does not do the kind of memory overlay tricks that lead to this 
optimization.
So those 1000s of packages referred to are likely not even to get close to
problems with aliasing. 

The big example here is the Linux kernel.  From what I understand (which
is limited), a core problem is that the device driver specific 
information is
tacked onto the end of a common structure.  This is similar in concept
to what RTEMS does with chain_node, object_control, and each API
object class. 

> - RTEMS definitively has problems.
>
>   
And as always we want RTEMS to be as clean as possible.
> - Some people on the RTEMS list have stated, that they are not sure
> whether their application/BSP/OS code will always honor the strict
> aliasing rules.
>
>   
This is definitely a problem.  There are 92 custom files so nearly that many
BSP variations.  That ignores custom BSPs based on libcpu code we don't
have a real BSP for.
>> Unfortunately RTEMS is one of these!
>>     
>
> Yes.
>
>   
>>> What is your suggestion to find other potential problem areas?
>>>       
>> I can tell you what I've been trying so far (but I am at just at the
>> very beginning):
>>
>> Compile RTEMS with and with out -fno-strict-aliasing, disassemble the
>> object files and compare the disassembly. If these disassembled files
>> differ, this a files is qualified to be candidate to be examined.
>>     
>
> This is a good aproach. It will show us, which modules might be
> sensitive for aliasing issues.
>
>   
This is definitely a good approach. 
>> This results into a list of candidate files to be examined (in the order
>> of 100). It definitely contains many false positives, due
>> -fno-strict-aliasing affecting ordering of asm-instructions,
>> nevertheless this list is better than nothing.
>>     
>
> Keep in mind: It will only list the modules which generate different
> code with the default -fstrict-aliasing on the current GCC version. We
> should track this in future releases to ensure, that better optimizers
> will not bring up new issues.
>
>   
Agreed but likely only on second digit changes to gcc.  gcc 4.1 -> 4.2 
probably not 4.1 .0
to 4.1.1.

It is also possible (and likely) that the code in question is either in 
low level code code
with rippling effects -- like the chain -- or device drivers where it is 
impacts only a
specific target.  It would be probably be more effective to address low 
SuperCore
issues first and see how the list drops.
>>>> Our problem is lack of testing (primary cause: way too long release
>>>> cycles). 
>>>>         
>>> Here I must totally disagree.
>>>       
>> Face it: RTEMS users are still using ancient tools with ancient version
>> of RTEMS, therefore rtems-4.7 and its toolchain has hardly seen any
>> public exposure and testing at all.
>>     
>
> I think this is partly due to the fact, that RTEMS is used in embedded
> devices. When I start a development based on RTEMS, I may be open to use
> a non-stable version, but when my product finalizes, I need a stable
> version. This may be a big difference compared with most of the other
> open source projects.
>
>   
Agreed.  There are active projects today still using 4.5 for development.


> But you are right, a broader test community (and more snapshots) would
> be desirable.
>
>   
I think Ralf's work to have individual tool versions for individual 
targets is important
for speeding things up here.  Before we waited until a tool version 
looked good for
all targets and RTEMS was at a stable point.  This was like herding cats 
or worms.

And we know that gcc 3.4 and 4.0 weren't good on many targets so that 
didn't help
find a nice tool point. :(

With strict-aliasing disabled, 4.7 looks to be pretty stable.  I have 
tested on simulators
where I can and a number of people have run tests on their own 
hardware.  There will
be bugs but 4.7 looks good so far.
>>>  You will never fix this problem by
>>> testing. The effort to track down ONE error has been significantly high.
>>>       
>> Yes, and? How many errors are there? 1 ... 10 ... 100s?
>>
>> I suspect very few, with most of them orbiting around "Chains" and
>> "Object", due to their working principle (based on aliasing types).
>>     
>
> How about the network stack, the web server, the filesystem,
> malloc/free... and the individual BSPs.
>
>   
Ralf's diff'ing of assembly at least narrows down the candidates 
significantly.

One big concern is that this could have large multiples: number of target
CPUs and multilib options.  There are 69 libc.a's installed for 4.7.  
Hopefully,
we can pick a single CPU to analyze on first that we think is very likely to
have these optimization problems.  Fix them there and have less to look at
on other architectures.

When analyzing the questionable output, adding "-g -Wa,-aldh" to intersperse
C with assembly might be helpful.  I assume you are just running a 
script which
objdump's each .o in two build trees and compares them now.
>>>>>           
>>>>>> 2.) We set "-fno-strict-aliasing" now and forever
>>>>>>             
>>>> With all due respect, but to me, this would be "plain stupid".
>>>>         
>>> Ralf, again with due respect, can you please explain me why it is stupid?
>>>       
>> The ... forever ... is stupid. 
>>
>> RTEMS code is dirty and needs to be cleaned up, that's the point.
>>     
>
> Agreed. I also would like to have strict aliasing-proof code as a future
> goal. And I see that setting -fno-strict-aliasing temporarily will put
> the pressure out of this goal (which is a benefit for the users but a
> bad thing to reach this goal).
>
>   
>>> Ralf, I agree with you that it would be nicer to have aliasing-proof code
>>> from the start, but I see no easy way to get it soon.
>>>       
>> Therefore _temporary_, therefore NO -fno-strict-aliasing in rtems-4.8.
>>     
>
> Ok, this is sort of a compromise.
>
>   
I think this is the only way we will see a 4.7.0 in the near future.
>> We must provoke these bugs to be able to "nail them down" and not pamper
>> them with "-fno-strict-aliasing".
>>     
>
> Maybe the following steps would make sense:
>
> - Somebody (Ralf?) might track down the suspect modules by Ralfs method
> to compare the compiler output (using an archtiecture with MANY
> optimization headroom. PPC is not so bad due to its many general purpose
> registers, but maybe another architecture is better)
>   
- Verify difference is a breakage. :)
> - The various suspect packages could be redesigned by some suitable
> persons (I would volunteer for some of the code)
>
>   
- Once a problem is fixed, we may need to regenerate suspect list in 
case it is
a side-effect of inlining.
> - In parallel, 4.7 will be cut with -fno-strict-aliasing
>
> - the 4.8 development branch will temporarily use -fno-strict-aliasing
> aswell, until the code has been revised
>   
I don't know how Ralf is going to add this compiler flag.  It might go 
in as global
or not.  If it gets added individually in Makefile.am's then it will be 
possible to
slowly take it out.  If it is in a single place, it is either in or out.

Random thought -- could it be turned on/off with a configuration flag which
defaults on no strict aliasing in 4.7 and strict aliasing in 4.8?

I kind of lean to fixing the chain problem and then turning it on in 
4.8.  It is CVS
and I want people to catch issues.  Providing a configuration flag let's 
people
turn it off if they have to.
> - Then, the 4.8 development branch will switch back to -fstrict-aliasing
> AND enable more aliasing warnings (there was some GCC switch to do this)
>
> What do all of you think of this?
>
>   
4.7 needs strict aliasing disabled.  I am prone to leaving it on for 4.8 
but also
turn that extra warning on. 

I am afraid that if we let it get turned off by default on 4.8, it will 
never get
turned back on.

And we dig at Ralf's reports. :-)
> wkr,
> Thomas.
>
>
>
>   
>> Ralf
>>
>>
>> _______________________________________________
>> rtems-users mailing list
>> rtems-users at rtems.com
>> http://rtems.rtems.org/mailman/listinfo/rtems-users
>>     
>
>
>   




More information about the users mailing list