PowerPC exceptions and critical interrupts

Tue Jul 8 08:35:11 UTC 2008

Till Straumann wrote:
> Sorry for my late answer. I just didn't have time :-(
>
> Sebastian Huber wrote:
>   
>> Hi,
>> sorry for the long silence but the critical exceptions troubled me a 
>> lot in the last days.
>>
>> Till Straumann wrote:
>>     
>>> Sebastian Huber wrote:
>>>  
>>>       
>>>> [...]
>>>> Yes, but this is yet another magic place that the BSP developer has 
>>>> to be aware of.
>>>>     
>>>>         
>>> No. The cpu support takes care of this. If the CPU is one of the 
>>> 'BookE'-style CPUs then
>>> the mask is changed to include CE and DE.
>>>   
>>>       
>> My experience covers only two BSPs and for them were CPU ==  BSP ;-)
>>     
> ?? I don't understand this comment. What I tried to say was that 
> (provided that your BSP
> uses a CPU flavor that is already supported) the correct IRQ mask bits 
> should be selected
> by the CPU support. If you want to support a new flavor then you have to 
> add to the different
> bits and pieces to CPU support.
>   

This comment was intended as a joke.

>>>> Here are some constrains that I assume:
>>>> C1: Three asynchronous exception types ME, CE and EE with priorities 
>>>> ME > CE > EE. (Debug exceptions DE?)
>>>>     
>>>>         
>>> Yes.
>>>  
>>>       
>>>> C2: The exception handler code must not be interrupted by a lower 
>>>> priority exception (mostly due to the thread dispatch disable level).
>>>>     
>>>>         
>>> I don't think the thread-dispatch disable level is an issue here.
>>> IIRC what can get messed up are the SRRs, i.e., it could be that
>>>
>>> 1)  EE saves IR and MSR into SRRs
>>> 2)  CE happens, saves IR and MSR into critical SRRs
>>> 3)  handler is dispatched; if the handler re-enables EE then
>>>      the values saved in SRRs at 1) could be overwritten by a
>>>      second EE.
>>>  
>>>       
>>>> C3: To support RTEMS functions and the dispatcher the critical 
>>>> sections in the system core must be protected by 
>>>> rtems_interrupt_{enable, disable}. Thus if we want to use RTEMS 
>>>> within machine check exceptions we have to add MSR_ME to the disable 
>>>> mask. This may result in check stops. This is an inacceptable tradeoff.
>>>>     
>>>>         
>>> I agree. MSR_ME should always be enabled. This severely limits what a 
>>> ME handler
>>> is allowed to do but IMO in many cases we don't have to recover from 
>>> a ME which
>>> can be considered fatal but we do want print diagnostic output. A 
>>> check-stop is most
>>> undesirable.
>>>  
>>>       
>>>> C4: The asynchronous high level handler share a common stack.
>>>>     
>>>>         
>>> Just so we can leverage the existing support code (and the whole 
>>> configuration-table
>>> business etc).
>>>  
>>>       
>>>> The user of critical interrupts has to be very careful not to enable 
>>>> external exceptions since a
>>>>
>>>> uint32_t level = _ISR_Get_level();
>>>> _ISR_Set_level( 0);
>>>> _ISR_Set_level( level);
>>>>
>>>> will not work.
>>>>     
>>>>         
>>> If we support critical interrupts then the inline routine in irq_supp.h
>>> could be modified accordingly. All BSPs using the inline routine would
>>> not have to care.
>>>   
>>>       
>> I don't know how to handle this proper. For my tests I changed it to:
>>
>> static inline uint32_t   _CPU_ISR_Get_level( void )
>> {
>>  register unsigned int msr;
>>  _CPU_MSR_GET(msr);
>>  if (msr & MSR_EE) {
>>    return 0;
>>  } else {
>>    return 1;
>>  }
>> }
>>     
> Not sure that's the right thing. We now have effectively three levels
>  0 all exceptions enabled
>  1 only critical exceptions enabled
>  2 all exceptions disabled
> (assuming - as we had agreed earlier -  that machine-check is never 
> disabled,
> not even in level 2)
>
> Get_level and Set_level and probably MODES_INTERRUPT_MASK etc.
> would have to be adapted accordingly.
>
> The ISR wrapper code would only re-enable exceptions at the level
> of the currently handled exception.
>   

Ok.

>> static inline void _CPU_ISR_Set_level( uint32_t   level )
>> {
>>  register unsigned int msr;
>>  _CPU_MSR_GET(msr);
>>  if (!(level & CPU_MODES_INTERRUPT_MASK)) {
>>    msr |= MSR_EE | MSR_E300_CE | MSR_ME;
>>     
> This is BAD because it re-introduces PPC flavor-specific bits into
> cpukit. The exact bit combination should be stored in a variable which
> is set up by CPU support code (libcpu). The most critical value
> (i.e., the mask that is used to totally disable interrupts) could be cached
> in a SPRG.
>   

Yes, this was only a temporary hack.

>>  }  else {
>>    msr &= ~MSR_EE;
>>  }
>>  _CPU_MSR_SET(msr);
>> }
>>
>> I added MSR_E300_CE and MSR_ME only because _CPU_ISR_Set_level() will 
>> be used to set the initial task MSR. Is there another way to set such 
>> bits?
>>
>>     
>>>> The ppc_exc_msr_irq_mask is currently used to protect the stack 
>>>> switch and ISR nest level.
>>>>     
>>>>         
>>> But the ultimate purpose was using that same mask for 
>>> rtems_interrupt_enable/disable.
>>> For efficiency reasons I'd suggest to cache the mask in a SPRG.
>>>  
>>>       
>>>> Machine checks are not included in the disable mask (MSR_ME) that is 
>>>> an inconsistency the current implementation for C4.     
>>>>         
>>> Yes.
>>>  
>>>       
>>>> After a review of the code I think that if C2 is true the disabling 
>>>> of all asynchronous exceptions is superfluous. A stack switch would 
>>>> be problematic only if we increment the ISR nest level and perform 
>>>> the stack switch afterwards, because a higher priority exception may 
>>>> assume that a stack switch is not necessary but actually has not 
>>>> occured.
>>>>     
>>>>         
>>> Not sure I understand how you would make the operation of testing 
>>> whether a switch is
>>> necessary and doing the switch an atomic operation (you also have to 
>>> take in consideration
>>> the case where CE tries to switch just while EE switches back).
>>>
>>> I believe I thought about it carefully and came to the conclusion 
>>> that disabling all
>>> interrupts was necessary (but I could, of course be proven wrong). OTOH,
>>> I don't see any harm because the critical section is only 7 
>>> instructions long.
>>>   
>>>       
>> Yes you are right. If we don't disable the exceptions then a high 
>> priority exception may overwrite its own stack if the stack is 
>> switched but the lower priority exception was interrupted before it 
>> could write the new nest level.
>>
>> I store now the interrupt stack end in SPR2 and check if the exception 
>> stack is in the interrupt stack area. If not then the stack will be 
>> switched:
>>
>>    /* Switch stack if necessary */
>>    mfspr    SCRATCH_REGISTER_0, SPRG1
>>    cmpw    SCRATCH_REGISTER_0, r1
>>    blt    wrap_stack_switch_\_FLVR
>>    mfspr    SCRATCH_REGISTER_1, SPRG2
>>    cmpw    SCRATCH_REGISTER_1, r1
>>    blt    wrap_stack_switch_done_\_FLVR
>>
>> wrap_stack_switch_\_FLVR:
>>
>>    mr    r1, SCRATCH_REGISTER_0
>>
>> wrap_stack_switch_done_\_FLVR:
>>
>> The switch back is only (FRAME_REGISTER is the r1 of the exception 
>> prologue which points to the exception frame):
>>
>>    mr    r1, FRAME_REGISTER
>>     
>
> Here I'm lost - why do you want to change the current algorithm? It 
> already takes care of switching
> the stack if necessary in a safe way. At least that is what it should do 
> -- if you think there is an error
> then please try to explain.
>   

The current algorithm cannot cope with machine checks (they are not in 
the interrupt disable mask) and has to disable interrupts.

>> But if old or new the critical exceptions don't work for me. One bug 
>> was in the epilogue. You have to disable all exceptions which may 
>> cause a context switch between the restore of the SRRs and the RFI.
>>     
> Not sure there is a bug.During the epilogue the MSR setting should be 
> the same as when the
> exception was taken (otherwise there is a bug).
> Therefore, during the epilogue of a non-critical exception, EE should 
> already be
> disabled, during the epilogue of a critical interrupt CE and EE should 
> be disabled.
>
> Because there are two sets of SRRs it is OK if a critical exception 
> happens during the execution
> of the epilogue of a non-critical one.
>
> If you think there is a bug please describe a detailed scenario of a 
> race condition.
>   

Suppose we are in the epilogue code of an EE between the move to SRRs 
and the RFI. Here EE is disabled but CE is enabled. Now a CE happens. 
The handler decides that a thread dispatch is necessary. The CE checks 
if this is possible:
   o The thread dispatch disable level is 0, because the EE has already 
decremented it.
   o The EE lock variable is cleared.
   o The EE executes not the first instruction.
Hence a thread dispatch is allowed. The CE issues a context switch to a 
task with EE enabled (for example a task waiting for a semaphore). Now a 
EE happens and the current content of the SRRs is lost.

> [...]

I tried to get the critical interrupts working for a couple of days on 
the MPC8313ERDB but there is still an error. I added much debug code:
   o The interrupts (= EE and CE) are disabled completely within the 
thread dispatch function.
   o The interrupts will be disabled in the prologue before the 
allocation of the exception stack frame.
   o The task stacks will be checked with a MD5 checksum.
   o A monitor task checks any registers except three scratch registers.
   o Any EE, CE and thread dispatch events are stored in a ring buffer 
with various information.
   o All synchronous exceptions are disabled through an infinite loop.
The system still crashes sometimes in such a way that the CPU is no more 
fully accessible by the Lauterbach debugger. Sometimes the monitor task 
detects an inconsistent comparison register CR. If I don't use critical 
interrupts everything seems to be fine. Maybe this is due to the 
exception code that I use (it is a heavily modified CVS version). I 
switched to the CVS version, but encountered similar problems. Do we 
have a system that works with critical interrupts?

In order to enable the operating system support for critical interrupts 
you have to disable critical interrupts around critical sections. But 
what is now the benefit of critical interrupts? It may be better to drop 
the direct operating system support for critical interrupts. This would 
simplify the exception code greatly. We wouldn't have to add the 
critical interrupts to the interrupt disable mask so they can happen 
anytime (except within critical interrupt exception handler code). If 
someone wants operating system support he can trigger an external 
exception within the critical interrupt handler code (two step handler). 
With the current approach the critical interrupts can only interrupt 
small parts of the external exception handler code and are everywhere 
else like normal external exceptions.

-- 
Sebastian Huber, Embedded Brains GmbH

Address : Obere Lagerstr. 30, D-82178 Puchheim, Germany
Phone   : +49 89 18 90 80 79-6
Fax     : +49 89 18 90 80 79-9
E-Mail  : sebastian.huber at embedded-brains.de
PGP     : Public key available on request

Diese Nachricht ist keine geschäftliche Mitteilung im Sinne des EHUG.