Problem report: Struct aliasing problem causes Thread_Ready_Chain corruption in 4.6.99.3

Tue Nov 21 23:50:13 UTC 2006

FYI

linux (2.6.18) is built with -fno-strict-aliasing and I believe that's 
what we
should do. It is extremely hard to find all problematic cases of aliasing
and as someone pointed out, as gcc gets smarter we'll run into more
problems.

Look e.g., at networking code where a checksum of a header is computed.
Type-punning is pretty common in these situations and I wouldn't trust
old BSD code to be written with strict aliasing in mind.

-- Till

Joel Sherrill wrote:
> Thomas Doerfler (nt) wrote:
>   
>> Joel,
>>
>> I am not convinced. I have looked at your patch, and I think it just
>> overlays two "Chain_Node" structures to the existing "Chain_Control"
>> structure, but as far as I see, all accesses to the Chain_Control are
>> still performed directly to the three pointers first/permanent_null/last
>> and since you did NOT change the accesses to Chain_Control, the aliasing
>> problem will still be there.
>>
>>   
>>     
> Which version?  #1 or #2? 
>
> You are right for version #1.  All it did was try to address permanent 
> head/tail nodes.
>
> Version #2 put the current structure in an anonymous structure which 
> should have
> improved the overlay.  But I didn't try to eliminate the use of 
> first/last as you did.
>
>   
>> Let me try to describe the problem the compiler will have with this code
>> when optimizing.
>>
>> Assume we have a chain with two linked nodes:
>>
>> Chain_Control *the_chain;
>> Chain_Node *node1,*node2;
>> Chain_Node *node_ptr;
>>
>>   node1     = malloc(sizeof(*node1));
>>   node2     = malloc(sizeof(*node2));
>>   the_chain = malloc(sizeof(*the_chain));
>>
>> Lets say, that the nodes are properly linked:
>>
>>   the_chain->first          = node1;
>>   the_chain->permanent_null = NULL;
>>   the_chain->last           = node2;
>>
>>   node1->previous = (Chain_Node *)&(the_chain->first);
>>   node1->next     = node2;
>>   node2->previous = node1;
>>   node2->next      = (Chain_Node *)&(the_chain->permanent_null);
>>
>> So, now everything is set up.
>>
>> Now, let's assume that in some piece of code we want to remove node2
>> from the chain and directly afterwards we want to check, whether the
>> chain is empty. The code might look like this:
>>
>>   node_ptr->previous->next = node_ptr->next;
>>   node_ptr->next->previous = node_ptr->previous;
>>   chain_is_empty=(the_chain->last == (Chain_Node *)&(the_chain->first));
>>
>> And as far as I understood, here we MAY run into a problem: For the
>> compiler, the operations done via the node_ptr have absolutely nothing
>> to do with the operations done via the "the_chain" pointer. The logic
>> behind this is as follows:
>>
>>   1.) node_ptr is of type "Chain_Node *".
>>   2.) the_chain is of type "Chain_Control *".
>>   3.) Both data types are totally distinct
>>   4.) Strict type control makes sure, that any modification done using
>> the "node_ptr" will make absolutely NO modification in the
>> "Chain_Control" data structure, because the node_ptr can only point to a
>> Chain_Node data structure.
>>   5.) And here we are: As a consequence of Rule 4.) the compiler is
>> allowed to change the order of the three lines of code, because
>> apparently they have nothing to do with each other. It is legal to keep
>> the modifications via node_ptr in registers (that's what actually
>> happens with GCC4 according to Peer) and therefore the code breaks.
>>
>>
>> What can we do?
>>
>> I have the impression, that we ust tell the compiler, that Chain_Control
>> actually consists of two (or at least one and a half) Chain_Node
>> structures, and (!!!) that we are actually accessing these structures
>> when we work with Chain_Control (!!!). Therefore, in your patch, all
>> accesses that currently still use Chain_Control.first MUST be redirected
>> to Chain_Control.Permanent_Head.Node.next. Only then the compiler has
>> the knowledge that an access using a node_ptr MAY influence the elements
>> in Chain_Control that are acutally used.
>>
>> I have worked out a patch that implements this, I have also added two
>> macros _Chain_First(Chain_Control *) and _Chain_last(Chain_Control *) to
>> encapsulate all accesses to "first" and "last".
>>
>> And I have gone through about 100 code lines that were accessing these
>> elements directly and (hopefully) replaced them with the proper macros.
>>
>> Now I can compile the MBX8xx BSP again, but I have not yet tried out to
>> run it.
>>
>>   
>>     
> I would rather have this as the fix but was trying to avoid modifying 
> all the code
> you had to on my first pass.  I believe it is needed as well. 
>
> I think this is better from a maintenance and compiler correctness 
> viewpoint.
>
> But I am still generally concerned that something else will break with 
> strict-alias.
> Will the heap code break?  It is certainly abusive of pointers but maybe 
> those
> are all void * conversions to a block pointer.
>
> Are there BSPs or device drivers which will break?
>
> I would technically prefer to be able to have strict-alias on but the 
> release
> manager side of me says to turn it off so we can get stable and have a 
> release.
>
> I agree that completely technically correct code should be OK with 
> strict aliasing.
> But I don't believe that every part of RTEMS including BSPs, device drivers,
> etc are strict aliasing clean.  So it is a pragmatic  matter to turn it 
> off. 
> If Linux and/or *BSD is currently compiled with strict alias'ing disabled,
> then that would encourage me to leave it off even for 4.8.
>
> For 4.8, we can at least turn on more strict alias warnings.  Turn it on 
> or off in 4.8 is a
> community decision but if we leave it on in 4.8 and there is a BSP issue,
> I would be prone to tell people to turn it off and try it.  That narrows 
> down the
> problem.  But again as Peer's problem indicates, it can be VERY 
> insidious and
> that bothers me.  He has spent a lot of debug time and it could be this 
> hard to
> find each individual place strict aliasing breaks.
>
>   
>> ------------------
>>
>> Now to Pavel's hint and your comments:
>>
>> Joel Sherrill schrieb:
>>   
>>     
>>> Thomas Doerfler (nt) wrote:
>>>
>>> shared between the nodes), but maybe it is not SO important. In that
>>> case, Chain_Control might also be defined as follows:
>>>
>>> typedef union {
>>>   struct Chain_Node_struct head;
>>>   struct Chain_Node_struct tail;
>>> } Chain_Control;
>>>
>>> What would get broken if we change to a 16 byte Chain_Control?
>>>   
>>>
>>>     
>>>       
>>>> My first thought is that it would break every operation involving the
>>>> first or last element on the list.
>>>>       
>>>>         
>> I don't see this: in the 12 byte version, Chain_Control.head.previous
>> and Chain_Control.tail.next share the same memory location (which is
>> permanently NULL). In a 16 byte version, this location would not be
>> shared and both pointers must be initialized to NULL idependently. But
>> this should be the only difference. The rest of the operations should be
>> more or less the same, because up to now, the "middle" pointer stays
>> untouched in all situations (that's why it is called "permanent_null")
>> and in the 16 byte version, there are two distinct pointers which are
>> NULL anyway.
>>
>>
>>
>> This was rather lengthy, I hope I could make my thoughts rather clear
>> (although it is getting late here and my head get empty :-)
>>
>> wkr,
>> Thomas.
>>   
>>     
>
> _______________________________________________
> rtems-users mailing list
> rtems-users at rtems.com
> http://rtems.rtems.org/mailman/listinfo/rtems-users
>