PowerPC exceptions and critical interrupts

Thu Jul 10 09:26:00 UTC 2008

Sebastian Huber wrote:
> Till Straumann wrote:
>> [...]
>>>>>
>>>>>
>>>>> I store now the interrupt stack end in SPR2 and check if the 
>>>>> exception stack is in the interrupt stack area. If not then the 
>>>>> stack will be switched:
>>>>>
>>>>>    /* Switch stack if necessary */
>>>>>    mfspr    SCRATCH_REGISTER_0, SPRG1
>>>>>    cmpw    SCRATCH_REGISTER_0, r1
>>>>>    blt    wrap_stack_switch_\_FLVR
>>>>>    mfspr    SCRATCH_REGISTER_1, SPRG2
>>>>>    cmpw    SCRATCH_REGISTER_1, r1
>>>>>    blt    wrap_stack_switch_done_\_FLVR
>>>>>
>> I also thought about this approach at one time but the IRQ stack size
>> wasn't readily available (it was hardcoded into bspstart of different 
>> BSPs,
>> IIRC). That has changed since (again IIRC) so this implementation could
>> be an alternative.
>>>>> wrap_stack_switch_\_FLVR:
>>>>>
>>>>>    mr    r1, SCRATCH_REGISTER_0
>>>>>
>>>>> wrap_stack_switch_done_\_FLVR:
>>>>>
>>>>> The switch back is only (FRAME_REGISTER is the r1 of the exception 
>>>>> prologue which points to the exception frame):
>>>>>
>>>>>    mr    r1, FRAME_REGISTER
>>>>>     
>>>>
>>>> Here I'm lost - why do you want to change the current algorithm? It 
>>>> already takes care of switching
>>>> the stack if necessary in a safe way. At least that is what it 
>>>> should do -- if you think there is an error
>>>> then please try to explain.
>>>>   
>>>
>>> The current algorithm cannot cope with machine checks (they are not 
>>> in the interrupt disable mask) and has to disable interrupts.
>> I thought we agreed that machine checks must never be disabled and 
>> that there
>> will be no OS support for machine-check handlers (meaning: they must not
>> call OS primitives such as rtems_semaphore_release() etc.). In 
>> particular,
>> a machine-check handler must never cause thread dispatching.
> We agree on this.
>>
>> Is there - under this assumption - still a problem with 
>> machine-checks which
>> the alternative algorithm (switch stack if SP not yet pointing into 
>> IRQ stack)
>> would solve?
>
> Yes. The current algorithm protects its critical section (used by EE, 
> CE and ME) via disabling of interrupts (EE and CE). So any machine 
> check may be nonrecoverable due to a corrupted ME exception frame:
>
>     mfspr    r1, SPRG1
>
>     /* ME happens here -> exception and ISR frame pointer have later 
> equal values and the ME exception frame will moved and overwritten */
Sure - but that requires an ME to happen during this section of code. If 
exception handling
is correct (recoverable) machine-checks should not happen here. Also, 
currently
there are no asynchronous machine-checks.

Anyways, it seems that the alternative implementation has already been there
(#ifdef clause) so it can be reactivated easily [as I said: when I first 
implemented it
the size of the ISR stack wasn't readily available but it is now...]
>
> no_r1_reload_\FLVR:
>     addi    \RA, \RA, 1
>     stw        \RA, _ISR_Nest_level at sdarel(r13)
>
>
>>
>> I don't perceive disabling interrupts during the stack switch as a 
>> problem:
>> the critical section is extremely short.
>
> Ok, not a problem but a shortcoming. You may consider the usage of an 
> additional SPRG a shortcoming too, but the SPRGs are dedicated to 
> exception handler usage.
Doesn't have to be a SPRG - a variable will do, too.
>
>>>
>>>>> But if old or new the critical exceptions don't work for me. One 
>>>>> bug was in the epilogue. You have to disable all exceptions which 
>>>>> may cause a context switch between the restore of the SRRs and the 
>>>>> RFI.
>>>>>     
>>>> Not sure there is a bug.During the epilogue the MSR setting should 
>>>> be the same as when the
>>>> exception was taken (otherwise there is a bug).
>>>> Therefore, during the epilogue of a non-critical exception, EE 
>>>> should already be
>>>> disabled, during the epilogue of a critical interrupt CE and EE 
>>>> should be disabled.
>>>>
>>>> Because there are two sets of SRRs it is OK if a critical exception 
>>>> happens during the execution
>>>> of the epilogue of a non-critical one.
>>>>
>>>> If you think there is a bug please describe a detailed scenario of 
>>>> a race condition.
>>>>   
>>>
>>> Suppose we are in the epilogue code of an EE between the move to 
>>> SRRs and the RFI. Here EE is disabled but CE is enabled. Now a CE 
>>> happens. The handler decides that a thread dispatch is necessary. 
>>> The CE checks if this is possible:
>>>   o The thread dispatch disable level is 0, because the EE has 
>>> already decremented it.
>>>   o The EE lock variable is cleared.
>>>   o The EE executes not the first instruction.
>>> Hence a thread dispatch is allowed. The CE issues a context switch 
>>> to a task with EE enabled (for example a task waiting for a 
>>> semaphore). Now a EE happens and the current content of the SRRs is 
>>> lost.
>> Good catch. Will fix.
>>>
>>>> [...]
>>>
>>> I tried to get the critical interrupts working for a couple of days 
>>> on the MPC8313ERDB but there is still an error. I added much debug 
>>> code:
>>>   o The interrupts (= EE and CE) are disabled completely within the 
>>> thread dispatch function.
>>>   o The interrupts will be disabled in the prologue before the 
>>> allocation of the exception stack frame.
>>>   o The task stacks will be checked with a MD5 checksum.
>>>   o A monitor task checks any registers except three scratch registers.
>>>   o Any EE, CE and thread dispatch events are stored in a ring 
>>> buffer with various information.
>>>   o All synchronous exceptions are disabled through an infinite loop.
>>> The system still crashes sometimes in such a way that the CPU is no 
>>> more fully accessible by the Lauterbach debugger.
>> I suppose by 'crash' you mean 'freeze'. You don't have any register 
>> dump etc.
>>
>> Can you find out if the CPU is in checkstop? Can you still access 
>> memory from the debugger?
>
> The debugger doesn't display that the CPU is in checkstop state, so I 
> don't know. The debugger can access the GPRs and memory but not the 
> SPRGs (including MSR etc.).
>
>>> Sometimes the monitor task detects an inconsistent comparison 
>>> register CR. If I don't use critical interrupts everything seems to 
>>> be fine. Maybe this is due to the exception code that I use (it is a 
>>> heavily modified CVS version). I switched to the CVS version, but 
>>> encountered similar problems. Do we have a system that works with 
>>> critical interrupts?
>> Hmm - I played a bit with CE but w/o calling OS primitives. I didn't 
>> disable/enable CE
>> from rtems_interrupt_disable/rtems_interrupt_enable but left them on 
>> all the time.
>> I didn't see a problem but that doesn't mean anything, of course.
>>
>> I'll try to do some more testing -- what does your CE handler do ?
>
> From the OS side only a rtems_semaphore_release(), but I have the 
> crashes even if I don't call any OS routines in the critical handler.
>
>>>
>>> In order to enable the operating system support for critical 
>>> interrupts you have to disable critical interrupts around critical 
>>> sections. But what is now the benefit of critical interrupts? It may 
>>> be better to drop the direct operating system support for critical 
>>> interrupts. This would simplify the exception code greatly.
>> That's true. But IMO the current implementation of the exception 
>> handling code
>> (with all bugs fixed, that is) gives the user both options:
>>
>> a) use CE w/o what you call "OS support". In this case, CE is not 
>> included in the
>>     mask used by rtems_interrupt_disable. CEs may happen anytime but 
>> handlers
>>     cannot use OS primitives.
In this scenario, any calls to _Thread_Dispatch() after handling a CE 
must be prevented
(see below).
>> b) use CE with "OS support" (behaving essentially like non-critical 
>> interrupts). In
>>     this case, CE has to be included in the mask used by 
>> rtems_interrupt_disable.
>>     CEs are masked during critical sections in the OS but CE handlers 
>> may use
>>     OS primitives.
>>
>>
>> The exact semantics could be set by a configuration variable and 
>> defined by
>> the application (implementation simply checks for config var at 
>> startup and
>> adds/removes MSR_CE from the mask that is cached in SPRGx).
>>
>> Note: ME is always on; machine-checks can happen anytime and must 
>> never use
>>          OS services (probably except for printk and the like).
>>
>> -- T.
>>> We wouldn't have to add the critical interrupts to the interrupt 
>>> disable mask so they can happen anytime (except within critical 
>>> interrupt exception handler code). If someone wants operating system 
>>> support he can trigger an external exception within the critical 
>>> interrupt handler code (two step handler). With the current approach 
>>> the critical interrupts can only interrupt small parts of the 
>>> external exception handler code and are everywhere else like normal 
>>> external exceptions.
>>
>>
>
> We plan to add the modified code to the CVS during the week. This code 
> is not working with critical interrupts at least on the MPC8313ERDB 
> (this is also true for the current CVS version). It would be nice if 
> you find the time to look at it and test it with your BSPs.
>

I ran more tests today (mvme3100 / mcp8540) using CEs w/o calling OS 
services:

  1) I enabled CEs early during bspstart so all tasks have CE enabled
  2) CEs have NOT been added to the general IRQ mask, i.e., they can 
happen anytime and are
      always enabled (except while servicing a CE).
  3) Under these premises a modification was necessary so that handling 
a CE never
      calls _Thread_Dispatch() (from ppc_exc_wrapup())
      NOTE: this is contrary to what I wrote yesterday -- if CEs are to 
remain always enabled
      then thread-dispatching after returning from a CE MUST be 
prevented, i.e., a further
      test needs to be added to the code.
  4) I use the watchdog timer to generate a CE interrupt every 1024 
timebase tick (~ 40kHz)
  5) I use an openpic timer to generate a normal/classic interrupt every 
1023 timebase tick
      (so that it beats agains the CE rate; this way, a CE hits the 
'normal' exception at different
      places)
  6) I flood-ping the board from a linux host

The system under test has so far handled more than a billion CE IRQs 
(~7h) and is still running fine.

In your case - are you sure that *all* critical sections disable MSR_CE?
I'll try to do some more testing tomorrow where I include MSR_CE in the 
general mask
(used by rtems_interrupt_disable) but it is hard to identify sections of 
code
which do not use rtems_interrupt_disable()/enable() but flip MSR_EE 
directly...

WKR
-- Till