PowerPC exceptions and critical interrupts

Tue Jul 8 18:47:24 UTC 2008

Sebastian Huber wrote:
> Till Straumann wrote:
>> Sorry for my late answer. I just didn't have time :-(
>>
>> Sebastian Huber wrote:
>>  
>>> Hi,
>>> sorry for the long silence but the critical exceptions troubled me a 
>>> lot in the last days.
>>>
>>> Till Straumann wrote:
>>>    
>>>> Sebastian Huber wrote:
>>>>  
>>>>      
>>>>> [...]
>>>>> Yes, but this is yet another magic place that the BSP developer 
>>>>> has to be aware of.
>>>>>             
>>>> No. The cpu support takes care of this. If the CPU is one of the 
>>>> 'BookE'-style CPUs then
>>>> the mask is changed to include CE and DE.
>>>>         
>>> My experience covers only two BSPs and for them were CPU ==  BSP ;-)
>>>     
>> ?? I don't understand this comment. What I tried to say was that 
>> (provided that your BSP
>> uses a CPU flavor that is already supported) the correct IRQ mask 
>> bits should be selected
>> by the CPU support. If you want to support a new flavor then you have 
>> to add to the different
>> bits and pieces to CPU support.
>>   
>
> This comment was intended as a joke.
>
>>>>> Here are some constrains that I assume:
>>>>> C1: Three asynchronous exception types ME, CE and EE with 
>>>>> priorities ME > CE > EE. (Debug exceptions DE?)
>>>>>             
>>>> Yes.
>>>>  
>>>>      
>>>>> C2: The exception handler code must not be interrupted by a lower 
>>>>> priority exception (mostly due to the thread dispatch disable level).
>>>>>             
>>>> I don't think the thread-dispatch disable level is an issue here.
>>>> IIRC what can get messed up are the SRRs, i.e., it could be that
>>>>
>>>> 1)  EE saves IR and MSR into SRRs
>>>> 2)  CE happens, saves IR and MSR into critical SRRs
>>>> 3)  handler is dispatched; if the handler re-enables EE then
>>>>      the values saved in SRRs at 1) could be overwritten by a
>>>>      second EE.
>>>>  
>>>>      
>>>>> C3: To support RTEMS functions and the dispatcher the critical 
>>>>> sections in the system core must be protected by 
>>>>> rtems_interrupt_{enable, disable}. Thus if we want to use RTEMS 
>>>>> within machine check exceptions we have to add MSR_ME to the 
>>>>> disable mask. This may result in check stops. This is an 
>>>>> inacceptable tradeoff.
>>>>>             
>>>> I agree. MSR_ME should always be enabled. This severely limits what 
>>>> a ME handler
>>>> is allowed to do but IMO in many cases we don't have to recover 
>>>> from a ME which
>>>> can be considered fatal but we do want print diagnostic output. A 
>>>> check-stop is most
>>>> undesirable.
>>>>  
>>>>      
>>>>> C4: The asynchronous high level handler share a common stack.
>>>>>             
>>>> Just so we can leverage the existing support code (and the whole 
>>>> configuration-table
>>>> business etc).
>>>>  
>>>>      
>>>>> The user of critical interrupts has to be very careful not to 
>>>>> enable external exceptions since a
>>>>>
>>>>> uint32_t level = _ISR_Get_level();
>>>>> _ISR_Set_level( 0);
>>>>> _ISR_Set_level( level);
>>>>>
>>>>> will not work.
>>>>>             
>>>> If we support critical interrupts then the inline routine in 
>>>> irq_supp.h
>>>> could be modified accordingly. All BSPs using the inline routine would
>>>> not have to care.
>>>>         
>>> I don't know how to handle this proper. For my tests I changed it to:
>>>
>>> static inline uint32_t   _CPU_ISR_Get_level( void )
>>> {
>>>  register unsigned int msr;
>>>  _CPU_MSR_GET(msr);
>>>  if (msr & MSR_EE) {
>>>    return 0;
>>>  } else {
>>>    return 1;
>>>  }
>>> }
>>>     
>> Not sure that's the right thing. We now have effectively three levels
>>  0 all exceptions enabled
>>  1 only critical exceptions enabled
>>  2 all exceptions disabled
>> (assuming - as we had agreed earlier -  that machine-check is never 
>> disabled,
>> not even in level 2)
>>
>> Get_level and Set_level and probably MODES_INTERRUPT_MASK etc.
>> would have to be adapted accordingly.
>>
>> The ISR wrapper code would only re-enable exceptions at the level
>> of the currently handled exception.
>>   
>
> Ok.
>
>>> static inline void _CPU_ISR_Set_level( uint32_t   level )
>>> {
>>>  register unsigned int msr;
>>>  _CPU_MSR_GET(msr);
>>>  if (!(level & CPU_MODES_INTERRUPT_MASK)) {
>>>    msr |= MSR_EE | MSR_E300_CE | MSR_ME;
>>>     
>> This is BAD because it re-introduces PPC flavor-specific bits into
>> cpukit. The exact bit combination should be stored in a variable which
>> is set up by CPU support code (libcpu). The most critical value
>> (i.e., the mask that is used to totally disable interrupts) could be 
>> cached
>> in a SPRG.
>>   
>
> Yes, this was only a temporary hack.
>
>>>  }  else {
>>>    msr &= ~MSR_EE;
>>>  }
>>>  _CPU_MSR_SET(msr);
>>> }
>>>
>>> I added MSR_E300_CE and MSR_ME only because _CPU_ISR_Set_level() 
>>> will be used to set the initial task MSR. Is there another way to 
>>> set such bits?
>>>
>>>    
>>>>> The ppc_exc_msr_irq_mask is currently used to protect the stack 
>>>>> switch and ISR nest level.
>>>>>             
>>>> But the ultimate purpose was using that same mask for 
>>>> rtems_interrupt_enable/disable.
>>>> For efficiency reasons I'd suggest to cache the mask in a SPRG.
>>>>  
>>>>      
>>>>> Machine checks are not included in the disable mask (MSR_ME) that 
>>>>> is an inconsistency the current implementation for C4.             
>>>> Yes.
>>>>  
>>>>      
>>>>> After a review of the code I think that if C2 is true the 
>>>>> disabling of all asynchronous exceptions is superfluous. A stack 
>>>>> switch would be problematic only if we increment the ISR nest 
>>>>> level and perform the stack switch afterwards, because a higher 
>>>>> priority exception may assume that a stack switch is not necessary 
>>>>> but actually has not occured.
>>>>>             
>>>> Not sure I understand how you would make the operation of testing 
>>>> whether a switch is
>>>> necessary and doing the switch an atomic operation (you also have 
>>>> to take in consideration
>>>> the case where CE tries to switch just while EE switches back).
>>>>
>>>> I believe I thought about it carefully and came to the conclusion 
>>>> that disabling all
>>>> interrupts was necessary (but I could, of course be proven wrong). 
>>>> OTOH,
>>>> I don't see any harm because the critical section is only 7 
>>>> instructions long.
>>>>         
>>> Yes you are right. If we don't disable the exceptions then a high 
>>> priority exception may overwrite its own stack if the stack is 
>>> switched but the lower priority exception was interrupted before it 
>>> could write the new nest level.
>>>
>>> I store now the interrupt stack end in SPR2 and check if the 
>>> exception stack is in the interrupt stack area. If not then the 
>>> stack will be switched:
>>>
>>>    /* Switch stack if necessary */
>>>    mfspr    SCRATCH_REGISTER_0, SPRG1
>>>    cmpw    SCRATCH_REGISTER_0, r1
>>>    blt    wrap_stack_switch_\_FLVR
>>>    mfspr    SCRATCH_REGISTER_1, SPRG2
>>>    cmpw    SCRATCH_REGISTER_1, r1
>>>    blt    wrap_stack_switch_done_\_FLVR
>>>
I also thought about this approach at one time but the IRQ stack size
wasn't readily available (it was hardcoded into bspstart of different BSPs,
IIRC). That has changed since (again IIRC) so this implementation could
be an alternative.
>>> wrap_stack_switch_\_FLVR:
>>>
>>>    mr    r1, SCRATCH_REGISTER_0
>>>
>>> wrap_stack_switch_done_\_FLVR:
>>>
>>> The switch back is only (FRAME_REGISTER is the r1 of the exception 
>>> prologue which points to the exception frame):
>>>
>>>    mr    r1, FRAME_REGISTER
>>>     
>>
>> Here I'm lost - why do you want to change the current algorithm? It 
>> already takes care of switching
>> the stack if necessary in a safe way. At least that is what it should 
>> do -- if you think there is an error
>> then please try to explain.
>>   
>
> The current algorithm cannot cope with machine checks (they are not in 
> the interrupt disable mask) and has to disable interrupts.
I thought we agreed that machine checks must never be disabled and that 
there
will be no OS support for machine-check handlers (meaning: they must not
call OS primitives such as rtems_semaphore_release() etc.). In particular,
a machine-check handler must never cause thread dispatching.

Is there - under this assumption - still a problem with machine-checks which
the alternative algorithm (switch stack if SP not yet pointing into IRQ 
stack)
would solve?

I don't perceive disabling interrupts during the stack switch as a problem:
the critical section is extremely short.
>
>>> But if old or new the critical exceptions don't work for me. One bug 
>>> was in the epilogue. You have to disable all exceptions which may 
>>> cause a context switch between the restore of the SRRs and the RFI.
>>>     
>> Not sure there is a bug.During the epilogue the MSR setting should be 
>> the same as when the
>> exception was taken (otherwise there is a bug).
>> Therefore, during the epilogue of a non-critical exception, EE should 
>> already be
>> disabled, during the epilogue of a critical interrupt CE and EE 
>> should be disabled.
>>
>> Because there are two sets of SRRs it is OK if a critical exception 
>> happens during the execution
>> of the epilogue of a non-critical one.
>>
>> If you think there is a bug please describe a detailed scenario of a 
>> race condition.
>>   
>
> Suppose we are in the epilogue code of an EE between the move to SRRs 
> and the RFI. Here EE is disabled but CE is enabled. Now a CE happens. 
> The handler decides that a thread dispatch is necessary. The CE checks 
> if this is possible:
>   o The thread dispatch disable level is 0, because the EE has already 
> decremented it.
>   o The EE lock variable is cleared.
>   o The EE executes not the first instruction.
> Hence a thread dispatch is allowed. The CE issues a context switch to 
> a task with EE enabled (for example a task waiting for a semaphore). 
> Now a EE happens and the current content of the SRRs is lost.
Good catch. Will fix.
>
>> [...]
>
> I tried to get the critical interrupts working for a couple of days on 
> the MPC8313ERDB but there is still an error. I added much debug code:
>   o The interrupts (= EE and CE) are disabled completely within the 
> thread dispatch function.
>   o The interrupts will be disabled in the prologue before the 
> allocation of the exception stack frame.
>   o The task stacks will be checked with a MD5 checksum.
>   o A monitor task checks any registers except three scratch registers.
>   o Any EE, CE and thread dispatch events are stored in a ring buffer 
> with various information.
>   o All synchronous exceptions are disabled through an infinite loop.
> The system still crashes sometimes in such a way that the CPU is no 
> more fully accessible by the Lauterbach debugger.
I suppose by 'crash' you mean 'freeze'. You don't have any register dump 
etc.

Can you find out if the CPU is in checkstop? Can you still access memory 
from the debugger?
> Sometimes the monitor task detects an inconsistent comparison register 
> CR. If I don't use critical interrupts everything seems to be fine. 
> Maybe this is due to the exception code that I use (it is a heavily 
> modified CVS version). I switched to the CVS version, but encountered 
> similar problems. Do we have a system that works with critical 
> interrupts?
Hmm - I played a bit with CE but w/o calling OS primitives. I didn't 
disable/enable CE
from rtems_interrupt_disable/rtems_interrupt_enable but left them on all 
the time.
I didn't see a problem but that doesn't mean anything, of course.

I'll try to do some more testing -- what does your CE handler do ?
>
> In order to enable the operating system support for critical 
> interrupts you have to disable critical interrupts around critical 
> sections. But what is now the benefit of critical interrupts? It may 
> be better to drop the direct operating system support for critical 
> interrupts. This would simplify the exception code greatly.
That's true. But IMO the current implementation of the exception 
handling code
(with all bugs fixed, that is) gives the user both options:

 a) use CE w/o what you call "OS support". In this case, CE is not 
included in the
     mask used by rtems_interrupt_disable. CEs may happen anytime but 
handlers
     cannot use OS primitives.
 b) use CE with "OS support" (behaving essentially like non-critical 
interrupts). In
     this case, CE has to be included in the mask used by 
rtems_interrupt_disable.
     CEs are masked during critical sections in the OS but CE handlers 
may use
     OS primitives.

The exact semantics could be set by a configuration variable and defined by
the application (implementation simply checks for config var at startup and
adds/removes MSR_CE from the mask that is cached in SPRGx).

Note: ME is always on; machine-checks can happen anytime and must never use
          OS services (probably except for printk and the like).

-- T.
> We wouldn't have to add the critical interrupts to the interrupt 
> disable mask so they can happen anytime (except within critical 
> interrupt exception handler code). If someone wants operating system 
> support he can trigger an external exception within the critical 
> interrupt handler code (two step handler). With the current approach 
> the critical interrupts can only interrupt small parts of the external 
> exception handler code and are everywhere else like normal external 
> exceptions.