Problem report: Struct aliasing problem causes Thread_Ready_Chain corruption in 4.6.99.3

Tue Nov 21 21:30:36 UTC 2006

Joel,

I am not convinced. I have looked at your patch, and I think it just
overlays two "Chain_Node" structures to the existing "Chain_Control"
structure, but as far as I see, all accesses to the Chain_Control are
still performed directly to the three pointers first/permanent_null/last
and since you did NOT change the accesses to Chain_Control, the aliasing
problem will still be there.

Let me try to describe the problem the compiler will have with this code
when optimizing.

Assume we have a chain with two linked nodes:

Chain_Control *the_chain;
Chain_Node *node1,*node2;
Chain_Node *node_ptr;

  node1     = malloc(sizeof(*node1));
  node2     = malloc(sizeof(*node2));
  the_chain = malloc(sizeof(*the_chain));

Lets say, that the nodes are properly linked:

  the_chain->first          = node1;
  the_chain->permanent_null = NULL;
  the_chain->last           = node2;

  node1->previous = (Chain_Node *)&(the_chain->first);
  node1->next     = node2;
  node2->previous = node1;
  node2->next      = (Chain_Node *)&(the_chain->permanent_null);

So, now everything is set up.

Now, let's assume that in some piece of code we want to remove node2
from the chain and directly afterwards we want to check, whether the
chain is empty. The code might look like this:

  node_ptr->previous->next = node_ptr->next;
  node_ptr->next->previous = node_ptr->previous;
  chain_is_empty=(the_chain->last == (Chain_Node *)&(the_chain->first));

And as far as I understood, here we MAY run into a problem: For the
compiler, the operations done via the node_ptr have absolutely nothing
to do with the operations done via the "the_chain" pointer. The logic
behind this is as follows:

  1.) node_ptr is of type "Chain_Node *".
  2.) the_chain is of type "Chain_Control *".
  3.) Both data types are totally distinct
  4.) Strict type control makes sure, that any modification done using
the "node_ptr" will make absolutely NO modification in the
"Chain_Control" data structure, because the node_ptr can only point to a
Chain_Node data structure.
  5.) And here we are: As a consequence of Rule 4.) the compiler is
allowed to change the order of the three lines of code, because
apparently they have nothing to do with each other. It is legal to keep
the modifications via node_ptr in registers (that's what actually
happens with GCC4 according to Peer) and therefore the code breaks.

What can we do?

I have the impression, that we ust tell the compiler, that Chain_Control
actually consists of two (or at least one and a half) Chain_Node
structures, and (!!!) that we are actually accessing these structures
when we work with Chain_Control (!!!). Therefore, in your patch, all
accesses that currently still use Chain_Control.first MUST be redirected
to Chain_Control.Permanent_Head.Node.next. Only then the compiler has
the knowledge that an access using a node_ptr MAY influence the elements
in Chain_Control that are acutally used.

I have worked out a patch that implements this, I have also added two
macros _Chain_First(Chain_Control *) and _Chain_last(Chain_Control *) to
encapsulate all accesses to "first" and "last".

And I have gone through about 100 code lines that were accessing these
elements directly and (hopefully) replaced them with the proper macros.

Now I can compile the MBX8xx BSP again, but I have not yet tried out to
run it.

------------------

Now to Pavel's hint and your comments:

Joel Sherrill schrieb:
> Thomas Doerfler (nt) wrote:
> 
> shared between the nodes), but maybe it is not SO important. In that
> case, Chain_Control might also be defined as follows:
> 
> typedef union {
>   struct Chain_Node_struct head;
>   struct Chain_Node_struct tail;
> } Chain_Control;
> 
> What would get broken if we change to a 16 byte Chain_Control?
>   
> 
>> My first thought is that it would break every operation involving the
>> first or last element on the list.

I don't see this: in the 12 byte version, Chain_Control.head.previous
and Chain_Control.tail.next share the same memory location (which is
permanently NULL). In a 16 byte version, this location would not be
shared and both pointers must be initialized to NULL idependently. But
this should be the only difference. The rest of the operations should be
more or less the same, because up to now, the "middle" pointer stays
untouched in all situations (that's why it is called "permanent_null")
and in the 16 byte version, there are two distinct pointers which are
NULL anyway.

This was rather lengthy, I hope I could make my thoughts rather clear
(although it is getting late here and my head get empty :-)

wkr,
Thomas.
-- 
--------------------------------------------
IMD Ingenieurbuero fuer Microcomputertechnik
Thomas Doerfler           Herbststrasse 8
D-82178 Puchheim          Germany
email:    Thomas.Doerfler at imd-systems.de
PGP public key available at:
     http://www.imd-systems.de/pgpkey_en.html