Memory Barrier (was RE: rtems_semaphore_obtain problems identified)

Mon Sep 10 19:42:15 UTC 2007

Kate Feng wrote:
> Hello Pavel and everyone,
>
> I agree that comipler memory barrier is asm volatile(::: "memory").
> However, I was talking about the run-time  memory barrier to
> prevent aggressive out-of-order and speculative execution in the
> processor.
As Pavel has pointed out that is not necessary because
every CPU has to follow a sequential execution model,
i.e., as seen from the CPU instructions seem to execute
exactly in the sequence they are coded (but this doesn't
necessarily apply to transactions on an external bus,
therefore the need for (PPC) guarded memory attributes
and synchronization instructions etc.)

Assume you have two ('pseudo') instructions

disable_irq
load_register_X

and the CPU loads register X out of order, effectively
before interrupts are disabled.

Then, assume an interrupt happens and alters register X
after it has been loaded.
(something you want to prevent from happening by
programming the 'disable_irq' instruction).

The CPU must (and does) recognize this situation
and will re-load reg. X after the ISR returns. I.e.,
the effective sequence of events are roughly:

load X out-of-order
IRQ happens; execute ISR (alters X)
disable interrupt
load X again

Note that this works fine if X is loaded from ordinary
memory but could have unwanted side-effects
if the address in question was mapped to a device
(e.g., a FIFO).
However, device-addresses usually are
marked 'guarded + cache-inhibited' (PPC) so that
out-of-order accesses do not happen (a bit simplified).

T.

>
> I understand that "sync" is expensive that it's better to be used at
> the application level only when necessary according to the flow
> of the applcaition.  However, what is important as well
> is  the effective location where 'sync' should be applied.
> Actaully POWER4 and up (e.g. POWER5) processors
> have 'lwsync', which one can consider to use at the
> O.S. level as a memory barrier that provides the same
> ordering function as the sync instruction, except that a load
> caused by an instruction following the |lwsync| may be performed
> before a store caused by an instruction that precedes the |lwsync|,
> and the ordering does not apply to accesses to I/O memory 
> (memory-mapped I/O).
>
> Thus,  I proposed  lwsync for the above porcessors as
> memory barrier at the OS level.  Thus, users can decide
> the locaiton of I/O memory barrier (e.g. eieio for PPC) according
> to their own applcaition.  For processors which do not support lwsync
> , lwsync is treated  as sync.
>
> Back to the principal, where is the effective location for lwsync  or 
> sync ?
> More below.
>
> Pavel Pisa wrote:
>> On Thursday 06 September 2007 11:43, Feng, Kate wrote:
>>   
>>> Joel Sherrill wrote :
>>>     
>>>> The memory barrier patch was PR 866 and was merged in March 2006.  It is
>>>> in all 4.7 versions.  It first appeared in 4.6.6.
>>>>       
>>> It looks like it, but RTEMS4.7.x still needs patches.
>>> This is not even fixed in 4.77.99.2.
>>> The memory barrier definitely should be fixed in RTEMS4.7.x
>>> before jumping to RTEMS4.8.
>>>
>>> Suggestions follow, except I hope I do not miss anything
>>> since I came up with this a while ago.
>>>
>>> 1) In cpukit/score/include/rtems/system.h:
>>>
>>> #define RTEMS_COMPILER_MEMORY_BARRIER() asm volatile(::: "memory")
>>>
>>> seems to be wrong and misplaced.
>>>
>>> The memory barrier is processor dependent.
>>> For example, the memory barrier for PowerPC is "sync".
>>>
>>> Thus, for PPC,  it would seem more functional to place
>>> #define RTEMS_COMPILER_MEMORY_BARRIER() asm volatile("sync"::: "memory")
>>>
>>> in cpukit/score/cpu/powerpc/system.h
>>> or somewhere in the processor branch.
>>>     
>>
>> Hello Kate and others,
>>
>> I would like to react there, because I think, that proposed
>> addition of "sync" is move into really bad direction.
>>
>> RTEMS_COMPILER_MEMORY_BARRIER is and should remain what it
>> is, I believe. It is barrier against compiler optimizer
>> caused reordering of instruction over the barrier.
>> This does not try to declare/cause any globally visible
>> ordering guarantee, by name and anything else.
>>
>> Each architecture 'X' conforming CPU has to guarantee,
>> that even after complex CPU level instruction reordering,
>> register renaming and transfers delaying an sequence
>> of instruction would result in same state (all viewed
>> from CPU POV) as if instructions has been processed
>> in sequential order one by one.
>> This does not mean anything about external memory transfers
>> order at all (at least for PPC, there are some special rules
>> for x86 caches for compatibility with old programs).
>>
>> The macro ensures only ordering of memory transfers
>> from actual CPU POV/perspective. But this is enough
>> even for POV of normal mode and consecutively invoked
>> exception handler working with same data.
>> Even if exception handler starts and CPU does not finish
>> transfers caused by previously initiated operations, reads
>> from exception on same!!! CPU would read back data from
>> write buffer if the address corresponds to previously written
>> data. So preemption or CPU IRQ flags manipulation in scope
>> of the actual CPU does not need enforcing ordering of real
>> memory by very expensive "sync" instruction. It only needs to
>> be sure, that CPU accounts/is aware of the value write transfer
>> before at correct point in the instruction sequence.
>>
>> On the other hand, there could be other reasons and situations
>> requesting correct ordering of externally visible transfers.
>> For example, if IRQ controller is mapped as peripheral into
>> external memory/IO space and CPU IRQ is disabled, than some
>> mask is changed in the controller to disable one of external
>> sources and it is expected, that after IRQ enabling on CPU
>> level there cannot arrive event from that source, ordering
>> of reads and writes to the controller has to be synchronized
>> with CPU ("eieio" has to be used in the PPC case). But it
>> is not task for CPU level IRQ state manipulation. The ordering
>> should and in RTEMS case is ensured by IO access routines
>> which include "eieio" instruction. On the other hand, if
>> some external device is accessed through overlay structures
>> (even volatile), then ordering could be broken without
>> explicitly inserted "eieio".
>> Other legitimate requirement for strict ordering/barrier for
>> external accesses are the cases, where external device/DMA/coprocessor
>> accesses/shares data in system/main memory with CPU.
> The share data among multi-threads needs memory barrier
> as well. Thus, the semaphore used for synchronization between two 
> different
> thread needs it as well.  At cpukit/rtems/src/semrelease.c, the 4.7..x 
> OS did not wish
> the compiler to be out-of-order at that point before 
> _Thread_Enable_dispatch().
> However, does it make sense to allow the run-time system memory access
> out-of-order until the code reach the 'sync' or 'lwsync' at the user 
> level ?  Perhaps,
> those who understand  all levels of  OS will know better about the 
> answer to this.
> Logically, I  am a little bit confused.
>>  The "sync"/cache range invalidation/flushing is required before
>> and after external memory accesses (the exact details depend on
>> transfers directions and other parameters).
>>
>>   
>>> 3) Among PPC shared/irq/irq.c and other PPCs,
>>> _ISR_Disable( _level ), and _ISR_Disable( _level )
>>> should be used instead of _CPU_ISR_Disable(level) and
>>> _CPU_ISR_Enable( _level )
>>>     
>>
>>
>> But I fully agree with you, that sequences like following
>> one are fatally broken
>>
>>      _CPU_ISR_Disable(level);
>>      *irq = rtems_hdl_tbl[irq->name];
>>      _CPU_ISR_Enable(level);
>>
>> There is no guarantee, that operation which has been expected
>> to be protected would not be moved outside of protection sequence.
>> Explicit or implicit RTEMS_COMPILER_MEMORY_BARRIER is missing there.
>>
>>   
>>> Actually, I think a better one should be rtems_interrupt_disable(level)
>>> and rtems_interrupt_enable(level).
>>>     
>>
>> The code should be changed according to one of your proposals.
>>
>>   
>>> 2) In order for the inline to work, the
>>> CPU_INLINE_ENABLE_DISPATCH should be defined to be TRUE.
>>>
>>> Thus,
>>> in cpukit/score/cpu/powerpc/rtems/score/cpu.h:
>>>
>>> -#define CPU_INLINE_ENABLE_DISPATCH       FALSE
>>> +#define CPU_INLINE_ENABLE_DISPATCH       TRUE
>>>     
>>
>> As for the non-inlined version of _Thread_Enable_dispatch,
>> there should not be problem. Calling function without static
>> or inline attributes is considered as full compiler memory ordering
>> barrier point. So no explicit compiler barrier should be needed
>> there.
> I agree with Sergei's vote.
>
> PS. Does anyone know how to find the number of OPcodes for all the
> PPC assembly code ?
>
> Reagrds,
> Kate
>> All this is based upon my understanding of code and computer
>> systems principles. There is no doubt, that there could be
>> many other problems and errors. But if there are problems
>> with IRQs behavior on PPC, then the checking, that sequences
>> like above one do not exist. The _ISR_Disable()/_ISR_Disable()
>> or higher level rtems_ variants should be used in noticed source
>> file. Else bad things could happen.
>>
>> Excuse me for long answer, but I wanted to clarify things
>> as good way as I could.
>>
>> Best wishes
>>
>>             Pavel
>>   
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> rtems-users mailing list
> rtems-users at rtems.com
> http://rtems.rtems.org/mailman/listinfo/rtems-users
>