Problem report: Struct aliasing problem causes Thread_Ready_Chain corruption in 4.6.99.3

Joel Sherrill joel.sherrill at oarcorp.com
Wed Nov 29 20:33:44 UTC 2006


Ralf Corsepius wrote:
> On Wed, 2006-11-29 at 10:05 -0600, Joel Sherrill wrote:
>   
>> Thomas Doerfler wrote:
>>     
>>> Ralf,
>>>
>>> Ralf Corsepius schrieb:
>>>   
>>>       
>>>> On Wed, 2006-11-29 at 08:49 +0100, Thomas Doerfler wrote:
>>>>
>>>>     
>>>>         
>>>>> Ralf Corsepius schrieb:
>>>>>       
>>>>>           
>>>>>> * HUGE projects such as Fedora and OpenSuSE are able to compile 1000's
>>>>>> of source tarballs and millions of lines of code with it enabled and are
>>>>>> only facing very few packages to break?
>>>>>>
>>>>>> * GCC and newlib can be compiled with it enabled for RTEMS?
>>>>>>         
>>>>>>             
>>>>> Oh, please note: the RTEMS kernel and packages can also be compiled with
>>>>> this option. And they also work MOST of the time. But this is not
>>>>> sufficent for a reliable RTOS.
>>>>>       
>>>>>           
>>>> Are you seriously trying to say, such fundamental bugs would not be
>>>> found in those 1000's of source tarballs, in all those _years_
>>>> -fstrict-aliasing is effective, if this was a real problem?
>>>>     
>>>>         
>>> No, I would not dare to set such a silly statement.
>>> Just for the records:
>>>
>>> - Software that has been designed in a "cleaner" way concerning its data
>>> structures and their usage surely has no problems with the strict
>>> aliasing rules.
>>>
>>>   
>>>       
>> The core of RTEMS code was started 10 years before C99 was approved. 
>> Was strict aliasing as an valid optimization even addressed before C99?
>>     
> I don't know, but ... I've strict-aliasing had been active in GCC at
> least since gcc-3.4 (!).
>
>   
And IMO gcc 3.4 was not stable on enough RTEMS supported CPUs to justify 
releasing it.

gcc 4.0 was better but not good across all architectures we needed.  It 
is only really with
gcc 4.1 that the RTEMS targets have been in good enough shape to begin 
to get feedback
on them.


>> So those 1000s of packages referred to are likely not even to get close to
>> problems with aliasing. 
>>     
> Check X11, gtk, qt, they are full of it (I recall one point, many years
> ago, strict-aliasing had broken X11).
>
>   
And all of those examples are graphics packages likely using the same 
type of memory
overlay tricks as RTEMS.   Just like RTEMS, the original X11 code base 
very old.

I'm not trying to apologize just saying that this type of coding is 
common in low level code
especially if it dates back far enough.  For higher level application 
code, there is normally
no need to resort to this level of coding.
>>> - RTEMS definitively has problems.
>>>
>>>   
>>>       
>> And as always we want RTEMS to be as clean as possible.
>>     
> In this particular case, I don't see how to achieve this without a
> fundamental redesign.
>
>   
If the fact we take a Chain_Node off a list generically and can "upcast" 
it all the way
to an API object breaks with strict aliasing, then yes.  But as this 
link shows, there
are ways around the strict aliasing:

  http://gcc.gnu.org/ml/gcc/2006-11/msg00880.html

So we could use memcpy when we really need to alias memory. 

My hope is that we use the most effective technique to address each 
particular
situation.  The memcpy method above is simple, still represents what you
wanted, and optimizes quite well with gcc.  Not the most elegant solution
possibly but it is a true representation and valid.  It would certainly be
a good solution to apply mechanically.  Especially if we wrapped it in
a macro that made it obvious we were doing an alias copy.

If that doesn't cause aliasing problems, then it is OK. 



>>>>> What is your suggestion to find other potential problem areas?
>>>>>       
>>>>>           
>>>> I can tell you what I've been trying so far (but I am at just at the
>>>> very beginning):
>>>>
>>>> Compile RTEMS with and with out -fno-strict-aliasing, disassemble the
>>>> object files and compare the disassembly. If these disassembled files
>>>> differ, this a files is qualified to be candidate to be examined.
>>>>     
>>>>         
>>> This is a good aproach. It will show us, which modules might be
>>> sensitive for aliasing issues.
>>>
>>>   
>>>       
>> This is definitely a good approach. 
>>     
> ATM, I am at ca 300 suspected files ;)
>
>   
I count 2200 in the entire tree and 1740 if you ignore tests.  So you 
narrowed it down considerably.
>> It is also possible (and likely) that the code in question is either in 
>> low level code code
>> with rippling effects -- like the chain -- or device drivers where it is 
>> impacts only a specific target.
>>     
>
> That's what I suspect. I am currently having all "chain" and
> "chain-like" structures (Heap, Imfs etc.) on my radar. But so far, this
> not much more than a "suspicion" and "wild guess".
>
>   
I think this has to drive the order things are examined and info 
regenerated.
>>>>>> Our problem is lack of testing (primary cause: way too long release
>>>>>> cycles). 
>>>>>>         
>>>>>>             
>>>>> Here I must totally disagree.
>>>>>       
>>>>>           
>>>> Face it: RTEMS users are still using ancient tools with ancient version
>>>> of RTEMS, therefore rtems-4.7 and its toolchain has hardly seen any
>>>> public exposure and testing at all.
>>>>     
>>>>         
>>> I think this is partly due to the fact, that RTEMS is used in embedded
>>> devices. When I start a development based on RTEMS, I may be open to use
>>> a non-stable version, but when my product finalizes, I need a stable
>>> version. This may be a big difference compared with most of the other
>>> open source projects.
>>>
>>>   
>>>       
>> Agreed.  There are active projects today still using 4.5 for development.
>>     
> Well, ... their problem.
>
>   
Yes but just a reminder that people lock down RTEMS versions for a LONG 
time and
invest large amounts of time and money into ensuring that it is stable 
on their hardware.
So they are not prone to upgrade. 
>>>>>  You will never fix this problem by
>>>>> testing. The effort to track down ONE error has been significantly high.
>>>>>       
>>>>>           
>>>> Yes, and? How many errors are there? 1 ... 10 ... 100s?
>>>>
>>>> I suspect very few, with most of them orbiting around "Chains" and
>>>> "Object", due to their working principle (based on aliasing types).
>>>>     
>>>>         
>>> How about the network stack, the web server, the filesystem,
>>> malloc/free... and the individual BSPs.
>>>       
> The networking stack triggers (un-)surprisingly few "suspects". The web
> server is a different story - It even triggers "punned pointer" warnings
> and almost any file it consists of, appears in my list.
>
>   
Put the GoAhead webserver last on your list.  Steven Johnson is having 
success with
shttpd and the maintainer wants the RTEMS port.  So GoAhead will at 
least become
a secondary httpd if not dropped.  Probably before we sort this mess out.
>> Ralf's diff'ing of assembly at least narrows down the candidates 
>> significantly.
>>     
> ATM (without having tried to eliminate false positive) about 1/3 of all
> *.c files get listed.
>
>   
Ahhh... so you are only looking at cpukit that certainly makes the 
percentage worse. :(
>> Hopefully,
>> we can pick a single CPU to analyze on first that we think is very likely to
>> have these optimization problems.
>>     
> To getting started, I'd suggest to try the posix BSP under Linux.
> This only uses a very limited part of the RTEMS sources, and uses a
> native Linux-gcc, which can be assumed to be in far better shape than a
> standard FSF-gcc.
>
> My knowledge on i386 is poor, but unless I am completely wrong,
> Peer's/Thomas issue is visible under i386-FC6 and Cygwin.
>
>   
I thought their analysis was on PowerPC code.  But it may show up on x86.
>>   Fix them there and have less to look at
>> on other architectures.
>>
>> When analyzing the questionable output, adding "-g -Wa,-aldh" to intersperse
>> C with assembly might be helpful.  I assume you are just running a 
>> script which objdump's each .o in two build trees and compares them now.
>>     
>
> Yes, what I am basically doing.
>   
Well it does seem to narrow the field and it's the only automated way we 
know of.
>   
>>>> We must provoke these bugs to be able to "nail them down" and not pamper
>>>> them with "-fno-strict-aliasing".
>>>>     
>>>>         
>>> Maybe the following steps would make sense:
>>>
>>> - Somebody (Ralf?) might track down the suspect modules by Ralfs method
>>> to compare the compiler output (using an archtiecture with MANY
>>> optimization headroom. PPC is not so bad due to its many general purpose
>>> registers, but maybe another architecture is better)
>>>       
>    
>   
>> - Verify difference is a breakage. :)
>>     
> That's a real issue. I need to think about how we could try to approach
> this problem.
>   
I cc'ed you on the gcc list asking that person who said they had a patch 
to improve
the warnings.  Don't know if it will help us or not but it doesn't hurt 
to ask.

It might also be interesting to see if we can get a warning out of the code
we know breaks. Should gcc have generated a warning?
> Having a list is one thing, working it off and documenting steps having
> performed (e.g. files having been identified as false positives) is
> different one.
>
> If we had a functional bugtracking system, I'd file PRs on serious
> suspects such that others who are more familiar with a particular
> architecture can have a look into it.
>
>   
Chris is working. 
> I could also add my list to CVS-HEAD, where it could be reformated into
> a table ...
>
>   
The Wiki should be useful for this.  You want a table and everyone can 
edit it
whether or not they have CVS write access. 

Did you see the table I put together for newlib memory allocations (that no
one else has editted :( )?  That is probably a good model for this table.
>>> - The various suspect packages could be redesigned by some suitable
>>> persons (I would volunteer for some of the code)
>>>
>>>   
>>>       
>> - Once a problem is fixed, we may need to regenerate suspect list in 
>> case it is
>> a side-effect of inlining.
>>     
>>> - In parallel, 4.7 will be cut with -fno-strict-aliasing
>>>
>>> - the 4.8 development branch will temporarily use -fno-strict-aliasing
>>> aswell, until the code has been revised
>>>       
> This is completely non-acceptable to me.
>
>   
>> I don't know how Ralf is going to add this compiler flag.
>>     
> I don't know either. As I've said before on PM (wrt. -g), this is not as
> simple as it might appear.
>
>   
>> Random thought -- could it be turned on/off with a configuration flag which
>> defaults on no strict aliasing in 4.7 and strict aliasing in 4.8?
>>     
> No.
>
>   
>>> - Then, the 4.8 development branch will switch back to -fstrict-aliasing
>>> AND enable more aliasing warnings (there was some GCC switch to do this)
>>>
>>> What do all of you think of this?
>>>
>>>   
>>>       
>> 4.7 needs strict aliasing disabled.  I am prone to leaving it on for 4.8 
>> but also
>> turn that extra warning on. 
>>
>> I am afraid that if we let it get turned off by default on 4.8, it will 
>> never get turned back on.
>>     
> Exactly - Therefore I refuse to accept turning it on 4.8.
>
>   
At the current time, we have EXACTLY ONE!!! proven case of breakage from 
strict-aliasing.
Anything else is pure speculation.  So let's get the chain fix in -- 
Thomas' patch and Pavel's
check for further degradation.  Both are in hand and improve the 
situation.  That by itself
might be enough to lower your count some.

cpukit/score is the most important place to review first from a general 
viewpoint.

If someone wants to review a particular BSP, then we will need to 
provide instructions for that.
And a place to document that it has been reviewed.

--joel

> Ralf
>
>
>
>   




More information about the users mailing list