SEVERE Bug in mc68360 _ISR_Handler???

Tue Jul 17 13:29:12 UTC 2001

Greetings from the UK.
I remember there was an interrupt problem with the '360 that we came across
when we were trying to test the TCP stack about 9 months ago. All I can
remember is that to fix it, the "tick" timer had to be on the same (or
lower?) interrupt priority than the network to prevent the very occasional
scc lockup. At the time Joel and Eric couldn't see what was going wrong -
and as far as I know it was never tracked down - it was well beyond my
capability!. I believe there is a comment in a "readme" file enclosed with
the current 68360 distribution explaining the workaround.
Am I off track here or could this be the same problem resurfacing?
Bob Wisdom

----- Original Message -----
From: "Thomas Doerfler" <Thomas.Doerfler at imd-systems.de>
To: <rtems-users at oarcorp.com>
Sent: Tuesday, July 17, 2001 12:47 PM
Subject: SEVERE Bug in mc68360 _ISR_Handler???

> Hello,
>
> i address this list to get some help concerning the behaviour of the
> _ISR_Handler used for the MC68360 in rtems-4.5.0. I think there is a
> very small chance, that lower-level interrupts get lost (or delayed
> forever), when a higher level interrupt comes up at a critical point.
>
> This mail is going to be a bit long, but the issue is rather
> complicated aswell.
>
> SYSTEM BACKGROUND
> ===================
> I have designed a system based on the MC68360 (and the gen68360 BSP),
> which is heavily working with Ethernet and TCP/IP. Ethernet works
> with built-in SCC1, all CPM interrupt sources are handled on IRQ
> Level 4. I use the PIT as system clock timer, working on IRQ Level 6
> (so it is higher than the CPM IRQ level).
>
> All in all the system works fine, but in very rare occasions the
> system communication interfaces got stuck. Last week I succeeded to
> find out why. I built a test environment and sent UDP packets to the
> system with almost all the ethernet bandwidth, adding a flood ping to
> the network load. In that environment it took between 1 and 4 hours
> until the system got stuck, and then I found that the "In-Service-
> Bit" of SCC1 in the CPM Interrupt Controller was set although the
> core did not execute the corresponding interrupt function.
>
> This bit gets set whenever the CPM Interrupt Controller sends the
> SCC1 vector number to the CPU and must be cleared in software. As
> long as this bit is set, no other CPM interrupts will be issued.
> NOTE: Even the SCC1 interrupt request will no longer be asserted
> until this bit gets cleared.
>
> The code of the SCC1 interrupt handler
>
> "m360Enet_interrupt_handler (rtems_vector_number v)"
>
> is correct, whenever this handler gets called, the ISR bit is
> definitively cleared. So my assumption is, that:
>
> 1) a SCC1 interrupt gets asserted,
>
> 2) then the CPU performs the corresponding vector fetch
>
> 3) but in rare conditions the corresponding handler will not get
> called
>
> By the way: I lowered the PIT IRQ request level to 3, then the system
> worked fine....
>
> STRUCTURE OF _ISR_HANDLER
> =========================
> For the MC68360 target, the following Preprocessor options are
> defined:
>
> M68K_COLDFIRE_ARCH=0
> CPU_HAS_SOFTWARE_INTERRUPT_STACK=1
> M68K_HAS_PREINDEXING=1
> M68K_HAS_SEPARATE_STACKS=0
> M68K_HAS_VBR=1
>
> The function "_ISR_Handler" in exec/score/cpu/m68k/cpu_asm.S performs
> the following basic steps:
>
> A) Increment _Thread_Dispatch_disable_level
>
> B) disable all interrupts
>
> C) If _ISR_Nest_level==0: switch from task stack to interrupt stack
>
> D) Increment _ISR_Nest_level
>
> E) reenable higher interrupts
>
> F) call user interrupt handler
>
> G) disable all interrupts
>
> H) Decrement _ISR_Nest_level
>
> I) If _ISR_Nest_level==0: switch back from int stack to task stack
>
> J) reenable higher interrupts
>
> K) Decrement _Thread_Dispatch_disable_level
>
> L) If _Thread_Dispatch_disable_level==0 and Context switch needed:
> switch to new context (using _Thread_Dispatch)
>
> M) return to interrupted code
>
> ASSUMED BUG SEQUENCE
> ====================
>
> I assume, that the following events may loose the Level4 SCC1
> Interrupt:
>
> 1) A SCC1 IRQ4 occures, the CPU performs a vector fetch, the CPM
> Interrupt controller supplies the corresponding vector and sets the
> SCC1-In-Service-Bit
>
> 2) The CPU enters _ISR_Handler for Level 4/SCC1 Interrupt.
>
> 3) Before any real code gets executed, the PIT times out, issueing a
> Level 6 Interrupt, so the CPU stores its basic context on the current
> (task) stack and reenters _ISR_Handler for Level 6/PIT. Please note,
> that _ISR_Nest_level and _Thread_Dispatch_disable_level have not yet
> been intcremented for the SCC1 Interrupt.
>
> 4) The PIT Interrupt Handler executes and requests a context switch
> (wakes up some task or so).
>
> 5) the general _ISR_Handler for Level 6/PIT then finds out, that it
> was the only instance of _ISR_Handler running (because
> _Thread_Dispatch_disable_level was 0) and therefore it performs a
> context switch according to step L). This will make the corresponding
> "woken" task to be executed, not the SCC1 interrupt handler.
>
> So what do we have now:
>
> - the SCC1 driver's interrupt handler has not yet been executed
>
> - the physical SCC1 interrupt request signal is not applied to the
> CPU, because it is locked out due to the still-set "SCC1 In-Service"
> bit
>
> - Any further CPM interrupts are blocked
>
> - the CPU executes the woken task, not knowing that it should resume
> executing the SCC1 interrupt function
>
> The SCC1 interrupt function might resume, when RTEMS switches back to
> the suspended task, but this does not seem to happen
>
> NOTE:
> =====
> At the head of the _ISR_Handler code, a comment states:
>
> /*
>  *  With this approach, lower priority interrupts may
>  *  execute twice if a higher priority interrupt is
>  *  acknowledged before _Thread_Dispatch_disable is
>  *  incremented and the higher priority interrupt
>  *  performs a context switch after executing. The lower
>  *  priority interrupt will execute (1) at the end of the
>  *  higher priority interrupt in the new context if
>  *  permitted by the new interrupt level mask, and (2) when
>  *  the original context regains the cpu.
>  */
>
> The statement itself was very suprising for me. And from my point of
> view, case (1) is not true for hardware, that negates the interrupt
> request as soon as the CPU has performed the vector fetch (which is
> absolutely legal according to the M68K architecture).
>
> It may take a LONG time until case (2) occures. In my situation I
> assume that this doesn't occure at all :-((
>
> SOME QUESTIONS
> ==============
> 1) I don't understand, why the suspended context does not get
> executed again.
>
> 2) I don't have a better solution for _ISR_Handler. Any ideas?
>
> 3) I can't belive, that I would be the first one to find that problem?
>
> 4) I don't know, whether I am on the right track at all...
>
>
> So here we are. I hope I could make my ideas clear in this mail. Any
> hints welcome....
>
> Bye
> Thomas.
>
> --------------------------------------------
> IMD Ingenieurbuero fuer Microcomputertechnik
> Thomas Doerfler           Herbststrasse 8
> D-82178 Puchheim          Germany
> email:    Thomas.Doerfler at imd-systems.de
> PGP public key available at: http://www.imd-systems.de/pgp_key.htm
>