PowerPC exceptions and critical interrupts

Wed Jul 9 14:06:59 UTC 2008

Till Straumann wrote:
> [...]
>>>>
>>>>
>>>> I store now the interrupt stack end in SPR2 and check if the 
>>>> exception stack is in the interrupt stack area. If not then the 
>>>> stack will be switched:
>>>>
>>>>    /* Switch stack if necessary */
>>>>    mfspr    SCRATCH_REGISTER_0, SPRG1
>>>>    cmpw    SCRATCH_REGISTER_0, r1
>>>>    blt    wrap_stack_switch_\_FLVR
>>>>    mfspr    SCRATCH_REGISTER_1, SPRG2
>>>>    cmpw    SCRATCH_REGISTER_1, r1
>>>>    blt    wrap_stack_switch_done_\_FLVR
>>>>
> I also thought about this approach at one time but the IRQ stack size
> wasn't readily available (it was hardcoded into bspstart of different 
> BSPs,
> IIRC). That has changed since (again IIRC) so this implementation could
> be an alternative.
>>>> wrap_stack_switch_\_FLVR:
>>>>
>>>>    mr    r1, SCRATCH_REGISTER_0
>>>>
>>>> wrap_stack_switch_done_\_FLVR:
>>>>
>>>> The switch back is only (FRAME_REGISTER is the r1 of the exception 
>>>> prologue which points to the exception frame):
>>>>
>>>>    mr    r1, FRAME_REGISTER
>>>>     
>>>
>>> Here I'm lost - why do you want to change the current algorithm? It 
>>> already takes care of switching
>>> the stack if necessary in a safe way. At least that is what it 
>>> should do -- if you think there is an error
>>> then please try to explain.
>>>   
>>
>> The current algorithm cannot cope with machine checks (they are not 
>> in the interrupt disable mask) and has to disable interrupts.
> I thought we agreed that machine checks must never be disabled and 
> that there
> will be no OS support for machine-check handlers (meaning: they must not
> call OS primitives such as rtems_semaphore_release() etc.). In 
> particular,
> a machine-check handler must never cause thread dispatching.
We agree on this.
>
> Is there - under this assumption - still a problem with machine-checks 
> which
> the alternative algorithm (switch stack if SP not yet pointing into 
> IRQ stack)
> would solve?

Yes. The current algorithm protects its critical section (used by EE, CE 
and ME) via disabling of interrupts (EE and CE). So any machine check 
may be nonrecoverable due to a corrupted ME exception frame:

	mfspr	r1, SPRG1

	/* ME happens here -> exception and ISR frame pointer have later equal values and the ME exception frame will moved and overwritten */

no_r1_reload_\FLVR:
	addi	\RA, \RA, 1
	stw		\RA, _ISR_Nest_level at sdarel(r13)

>
> I don't perceive disabling interrupts during the stack switch as a 
> problem:
> the critical section is extremely short.

Ok, not a problem but a shortcoming. You may consider the usage of an 
additional SPRG a shortcoming too, but the SPRGs are dedicated to 
exception handler usage.

>>
>>>> But if old or new the critical exceptions don't work for me. One 
>>>> bug was in the epilogue. You have to disable all exceptions which 
>>>> may cause a context switch between the restore of the SRRs and the 
>>>> RFI.
>>>>     
>>> Not sure there is a bug.During the epilogue the MSR setting should 
>>> be the same as when the
>>> exception was taken (otherwise there is a bug).
>>> Therefore, during the epilogue of a non-critical exception, EE 
>>> should already be
>>> disabled, during the epilogue of a critical interrupt CE and EE 
>>> should be disabled.
>>>
>>> Because there are two sets of SRRs it is OK if a critical exception 
>>> happens during the execution
>>> of the epilogue of a non-critical one.
>>>
>>> If you think there is a bug please describe a detailed scenario of a 
>>> race condition.
>>>   
>>
>> Suppose we are in the epilogue code of an EE between the move to SRRs 
>> and the RFI. Here EE is disabled but CE is enabled. Now a CE happens. 
>> The handler decides that a thread dispatch is necessary. The CE 
>> checks if this is possible:
>>   o The thread dispatch disable level is 0, because the EE has 
>> already decremented it.
>>   o The EE lock variable is cleared.
>>   o The EE executes not the first instruction.
>> Hence a thread dispatch is allowed. The CE issues a context switch to 
>> a task with EE enabled (for example a task waiting for a semaphore). 
>> Now a EE happens and the current content of the SRRs is lost.
> Good catch. Will fix.
>>
>>> [...]
>>
>> I tried to get the critical interrupts working for a couple of days 
>> on the MPC8313ERDB but there is still an error. I added much debug code:
>>   o The interrupts (= EE and CE) are disabled completely within the 
>> thread dispatch function.
>>   o The interrupts will be disabled in the prologue before the 
>> allocation of the exception stack frame.
>>   o The task stacks will be checked with a MD5 checksum.
>>   o A monitor task checks any registers except three scratch registers.
>>   o Any EE, CE and thread dispatch events are stored in a ring buffer 
>> with various information.
>>   o All synchronous exceptions are disabled through an infinite loop.
>> The system still crashes sometimes in such a way that the CPU is no 
>> more fully accessible by the Lauterbach debugger.
> I suppose by 'crash' you mean 'freeze'. You don't have any register 
> dump etc.
>
> Can you find out if the CPU is in checkstop? Can you still access 
> memory from the debugger?

The debugger doesn't display that the CPU is in checkstop state, so I 
don't know. The debugger can access the GPRs and memory but not the 
SPRGs (including MSR etc.).

>> Sometimes the monitor task detects an inconsistent comparison 
>> register CR. If I don't use critical interrupts everything seems to 
>> be fine. Maybe this is due to the exception code that I use (it is a 
>> heavily modified CVS version). I switched to the CVS version, but 
>> encountered similar problems. Do we have a system that works with 
>> critical interrupts?
> Hmm - I played a bit with CE but w/o calling OS primitives. I didn't 
> disable/enable CE
> from rtems_interrupt_disable/rtems_interrupt_enable but left them on 
> all the time.
> I didn't see a problem but that doesn't mean anything, of course.
>
> I'll try to do some more testing -- what does your CE handler do ?

 From the OS side only a rtems_semaphore_release(), but I have the 
crashes even if I don't call any OS routines in the critical handler.

>>
>> In order to enable the operating system support for critical 
>> interrupts you have to disable critical interrupts around critical 
>> sections. But what is now the benefit of critical interrupts? It may 
>> be better to drop the direct operating system support for critical 
>> interrupts. This would simplify the exception code greatly.
> That's true. But IMO the current implementation of the exception 
> handling code
> (with all bugs fixed, that is) gives the user both options:
>
> a) use CE w/o what you call "OS support". In this case, CE is not 
> included in the
>     mask used by rtems_interrupt_disable. CEs may happen anytime but 
> handlers
>     cannot use OS primitives.
> b) use CE with "OS support" (behaving essentially like non-critical 
> interrupts). In
>     this case, CE has to be included in the mask used by 
> rtems_interrupt_disable.
>     CEs are masked during critical sections in the OS but CE handlers 
> may use
>     OS primitives.
>
> The exact semantics could be set by a configuration variable and 
> defined by
> the application (implementation simply checks for config var at 
> startup and
> adds/removes MSR_CE from the mask that is cached in SPRGx).
>
> Note: ME is always on; machine-checks can happen anytime and must 
> never use
>          OS services (probably except for printk and the like).
>
> -- T.
>> We wouldn't have to add the critical interrupts to the interrupt 
>> disable mask so they can happen anytime (except within critical 
>> interrupt exception handler code). If someone wants operating system 
>> support he can trigger an external exception within the critical 
>> interrupt handler code (two step handler). With the current approach 
>> the critical interrupts can only interrupt small parts of the 
>> external exception handler code and are everywhere else like normal 
>> external exceptions.
>
>

We plan to add the modified code to the CVS during the week. This code 
is not working with critical interrupts at least on the MPC8313ERDB 
(this is also true for the current CVS version). It would be nice if you 
find the time to look at it and test it with your BSPs.

-- 
Sebastian Huber, Embedded Brains GmbH

Address : Obere Lagerstr. 30, D-82178 Puchheim, Germany
Phone   : +49 89 18 90 80 79-6
Fax     : +49 89 18 90 80 79-9
E-Mail  : sebastian.huber at embedded-brains.de
PGP     : Public key available on request

Diese Nachricht ist keine geschäftliche Mitteilung im Sinne des EHUG.