SEVERE Bug in mc68360 _ISR_Handler???

Tue Jul 17 12:16:53 UTC 2001

Quick answer to long detailed analysis.   The short answer
is that on m68k architectures where there are separate
stacks, we check the ISF (F/VO) in m68k parlance) as a hardware
means to determine if we are nested or not.

#if ( M68K_HAS_SEPARATE_STACKS == 1 )
        movew   #0xf000,d0               | isolate format nibble
        andw    a7@(SAVED+FVO_OFFSET),d0 | get F/VO
        cmpiw   #0x1000,d0               | is it a throwaway isf?
        bne     exit                     | NOT outer level, so branch
#endif

Is there any indication in HARDWARE that this is a nested interrupt?

If not, are you assured that the first instruction of the outer
ISR is or is not executed?  The m68k _ISR_Handler code increments
_Thread_Dispatch_disable_level as the first instruction on the CPU32.

SYM (_ISR_Handler):
        addql   #1,SYM (_Thread_Dispatch_disable_level) | disable
multitasking

If the architecture guarantees the 1st instruction of an ISR is
executed, 
then this would be sufficient to precent this scenario. 

I am not trying to argue you out of what is happening, only that
we need to be 100% sure that the 360 does not guarantee the execution
of the 1st instruction of an ISR in the case of nested interrupts of
a particular priority sequence.   This is basically arguing over 
precisely where the transition between (2) and (3) below occur.

--joel

Thomas Doerfler wrote:
> 
> Hello,
> 
> i address this list to get some help concerning the behaviour of the
> _ISR_Handler used for the MC68360 in rtems-4.5.0. I think there is a
> very small chance, that lower-level interrupts get lost (or delayed
> forever), when a higher level interrupt comes up at a critical point.
> 
> This mail is going to be a bit long, but the issue is rather
> complicated aswell.
> 
> SYSTEM BACKGROUND
> ===================
> I have designed a system based on the MC68360 (and the gen68360 BSP),
> which is heavily working with Ethernet and TCP/IP. Ethernet works
> with built-in SCC1, all CPM interrupt sources are handled on IRQ
> Level 4. I use the PIT as system clock timer, working on IRQ Level 6
> (so it is higher than the CPM IRQ level).
> 
> All in all the system works fine, but in very rare occasions the
> system communication interfaces got stuck. Last week I succeeded to
> find out why. I built a test environment and sent UDP packets to the
> system with almost all the ethernet bandwidth, adding a flood ping to
> the network load. In that environment it took between 1 and 4 hours
> until the system got stuck, and then I found that the "In-Service-
> Bit" of SCC1 in the CPM Interrupt Controller was set although the
> core did not execute the corresponding interrupt function.
> 
> This bit gets set whenever the CPM Interrupt Controller sends the
> SCC1 vector number to the CPU and must be cleared in software. As
> long as this bit is set, no other CPM interrupts will be issued.
> NOTE: Even the SCC1 interrupt request will no longer be asserted
> until this bit gets cleared.
> 
> The code of the SCC1 interrupt handler
> 
> "m360Enet_interrupt_handler (rtems_vector_number v)"
> 
> is correct, whenever this handler gets called, the ISR bit is
> definitively cleared. So my assumption is, that:
> 
> 1) a SCC1 interrupt gets asserted,
> 
> 2) then the CPU performs the corresponding vector fetch
> 
> 3) but in rare conditions the corresponding handler will not get
> called
> 
> By the way: I lowered the PIT IRQ request level to 3, then the system
> worked fine....
> 
> STRUCTURE OF _ISR_HANDLER
> =========================
> For the MC68360 target, the following Preprocessor options are
> defined:
> 
> M68K_COLDFIRE_ARCH=0
> CPU_HAS_SOFTWARE_INTERRUPT_STACK=1
> M68K_HAS_PREINDEXING=1
> M68K_HAS_SEPARATE_STACKS=0
> M68K_HAS_VBR=1
> 
> The function "_ISR_Handler" in exec/score/cpu/m68k/cpu_asm.S performs
> the following basic steps:
> 
> A) Increment _Thread_Dispatch_disable_level
> 
> B) disable all interrupts
> 
> C) If _ISR_Nest_level==0: switch from task stack to interrupt stack
> 
> D) Increment _ISR_Nest_level
> 
> E) reenable higher interrupts
> 
> F) call user interrupt handler
> 
> G) disable all interrupts
> 
> H) Decrement _ISR_Nest_level
> 
> I) If _ISR_Nest_level==0: switch back from int stack to task stack
> 
> J) reenable higher interrupts
> 
> K) Decrement _Thread_Dispatch_disable_level
> 
> L) If _Thread_Dispatch_disable_level==0 and Context switch needed:
> switch to new context (using _Thread_Dispatch)
> 
> M) return to interrupted code
> 
> ASSUMED BUG SEQUENCE
> ====================
> 
> I assume, that the following events may loose the Level4 SCC1
> Interrupt:
> 
> 1) A SCC1 IRQ4 occures, the CPU performs a vector fetch, the CPM
> Interrupt controller supplies the corresponding vector and sets the
> SCC1-In-Service-Bit
> 
> 2) The CPU enters _ISR_Handler for Level 4/SCC1 Interrupt.
> 
> 3) Before any real code gets executed, the PIT times out, issueing a
> Level 6 Interrupt, so the CPU stores its basic context on the current
> (task) stack and reenters _ISR_Handler for Level 6/PIT. Please note,
> that _ISR_Nest_level and _Thread_Dispatch_disable_level have not yet
> been intcremented for the SCC1 Interrupt.
> 
> 4) The PIT Interrupt Handler executes and requests a context switch
> (wakes up some task or so).
> 
> 5) the general _ISR_Handler for Level 6/PIT then finds out, that it
> was the only instance of _ISR_Handler running (because
> _Thread_Dispatch_disable_level was 0) and therefore it performs a
> context switch according to step L). This will make the corresponding
> "woken" task to be executed, not the SCC1 interrupt handler.
> 
> So what do we have now:
> 
> - the SCC1 driver's interrupt handler has not yet been executed
> 
> - the physical SCC1 interrupt request signal is not applied to the
> CPU, because it is locked out due to the still-set "SCC1 In-Service"
> bit
> 
> - Any further CPM interrupts are blocked
> 
> - the CPU executes the woken task, not knowing that it should resume
> executing the SCC1 interrupt function
> 
> The SCC1 interrupt function might resume, when RTEMS switches back to
> the suspended task, but this does not seem to happen
> 
> NOTE:
> =====
> At the head of the _ISR_Handler code, a comment states:
> 
> /*
>  *  With this approach, lower priority interrupts may
>  *  execute twice if a higher priority interrupt is
>  *  acknowledged before _Thread_Dispatch_disable is
>  *  incremented and the higher priority interrupt
>  *  performs a context switch after executing. The lower
>  *  priority interrupt will execute (1) at the end of the
>  *  higher priority interrupt in the new context if
>  *  permitted by the new interrupt level mask, and (2) when
>  *  the original context regains the cpu.
>  */
> 
> The statement itself was very suprising for me. And from my point of
> view, case (1) is not true for hardware, that negates the interrupt
> request as soon as the CPU has performed the vector fetch (which is
> absolutely legal according to the M68K architecture).
> 
> It may take a LONG time until case (2) occures. In my situation I
> assume that this doesn't occure at all :-((
> 
> SOME QUESTIONS
> ==============
> 1) I don't understand, why the suspended context does not get
> executed again.
> 
> 2) I don't have a better solution for _ISR_Handler. Any ideas?
> 
> 3) I can't belive, that I would be the first one to find that problem?
> 
> 4) I don't know, whether I am on the right track at all...
> 
> So here we are. I hope I could make my ideas clear in this mail. Any
> hints welcome....
> 
> Bye
>         Thomas.
> 
> --------------------------------------------
> IMD Ingenieurbuero fuer Microcomputertechnik
> Thomas Doerfler           Herbststrasse 8
> D-82178 Puchheim          Germany
> email:    Thomas.Doerfler at imd-systems.de
> PGP public key available at: http://www.imd-systems.de/pgp_key.htm

-- 
Joel Sherrill, Ph.D.             Director of Research & Development
joel at OARcorp.com                 On-Line Applications Research
Ask me about RTEMS: a free RTOS  Huntsville AL 35805
   Support Available             (256) 722-9985