[GSoC 2014] Paravirtualization layer in RTEMS

Mon Mar 10 22:48:32 UTC 2014

On 03/10/2014 04:24 PM, Youren Shen wrote:
> What make me confused is the relation
> between pok_arch_event_register and pok_meta_handler_init. It seems you
> divided the irq vector to two parts in pok_arch_event_register, Less 32
> or more than 32. It looks like you have already design some hypercall
> interface. (just like pok_irq_prologue_0 for clock?)  But what's
> the meaning of pok_meta_handler_init? I still can't understand it very
> clearly.Could you give me some outline about IRQ handlind in POK which
> invoke this two functions?
> 
> If you can provide me a brief overview about the way how you consider
> this Issues and a brief description about your design,  it will be
> really helpful to me.

There are 16 (0 - 15)  interrupt lines for hardware interrupts on x86.
If a line is triggered, the PIC will send an interrupt to the CPU.
If interrupts are enabled the CPU will ask for the interrupt number and
looks up this number in the Interrupt Descriptor Table (IDT).
The IDT for HW interrupts looks like this:
32 | clock  ISR (Interrupt Service Routine)
33 | keyboard ISR
34 | ...
...
47 | ...

INTEL reserved the first 32 (0-31) IRQ lines, so we start at 32 and go
to 47. 32 corresponds to IRQ line 0, which is the clock interrupt. 33,
is 1 is the keyboard (if I can trust my memory).

Now the CPU never tells you which IRQ line fires. Therefore, we register
the prologue functions with the IDT, which knows its line number, pushes
it on the stack and calls a general ISR handler.
This general ISR handler checks the line number and calls the handler
registered for this line. Therefore the general ISR handler maintains
its own IDT, a software IDT.
This enables us to register more than one ISR handler function for one
interrupt line. For example, to handle the clock tick in the kernel and
tell the guest system(RTEMS) running in a partition, that a clock tick
occurred (two handlers).

But, we don't want the POK kernel to wait until the partition handled
the interrupt.  So we acknowledge the interrupt with the PIC and then
send the partition the soft-interrupt. And here we go from kernel to
user space and this is the point, where I left of.

To be more specific in terms of source code.
'pok_arch_event_register' is called, if you want to register any kind of
interrupt with the IDT. If this happens to be in the hardware interrupt
range [32-47], it registers a prologue handler with the IDT.

all pok_irq_prologue functions call _ISR_Handler, which in turn calls
_C_isr_handler. This is the general handler, first the asm part and
second the C part.
The _C_isr_handler  checks if the kernel has registered a handler for
this IRQ number and calls it.
Then it checks if the current partition has interrupts enabled, if yes,
if there is a handler registered and if the partition isn't already
servicing an earlier interrupt.
If so, the registered handler is invoked.

If I am talking about 'registered handler' I am talking about the
software IDT the kernel is maintaining.
The software IDT for hardware interrupts is a static table consisting of
16 entries of the type 'meta_handler'.
'meta_handler' is a struct consisting of a vector number, and two tables
of the size "kernel + configured number of partitions".
The first table is for function pointers pointing to the
partition's/kernel's hander function, the
what-to-do-if-IRQ-occurrs-function.
The second table flags if the partition is ready for an interrupt.

So for each interrupt entry in our software-IDT, we get a 'meta_handler'
encapsulating a line number, atables with up to one handler per
partition and a table if the partition is ready for interrupts.

Next to this software IDT, there is a table 'partition_irq_enabled',
which has one flag per partition and is the software replacement for
CLI/STI.

'pok_meta_handler_init' sets up the software-IDT and fills all fields
with start values (magic unused vector number, no handler present, but
waiting)
'pok_partition_irq_init' sets up partition_irq_enabled table with the
value for disabled (0), so initially no partition gets interrupts until
it asks for them.

How can partitions talk to the software-IDT?
POK consists of kernel and partitions. Each partition has a libpok part.
Libpok is the library that enables the partition to talk to other
partitions and the kernel.
An RTEMS guest has a POK partition part (libpart) and the RTEMS part.
Libpart implements the communication with the POK kernel. So when RTEMS
calls some virtualization layer function, the implementation present in
libpart will emit a syscall to the pok kernel and pass along the IRQ
callback function or it just tells to unregister, to
enable/disable/acknowledge interrupts.
Have a look at the virtualization layer functiosn in RTEMS's virtualpok
BSP and examples/rtems-guest/ in POK.
The syscall handling then forwards the request to the e.g.
'pok_bsp_irq_register_hw'.

I hope that fits into your definition of 'briefly explain'. But it
should give you enough background and explanation to follow the code and
understand the design.

The really nasty bit happens in the '_C_isr_handler' function in
x86-qemu/bsp.c.
This is explained in my RTLWS'13 paper.
In short: Each IRQ entry builds a stack frame, which saves the registers
values on the stack, when the interrupt occurs, so we can continue
execution at the same point.
To handle the IRQ in user space and to return to the point of
interruption, the user space handler needs this data. So the interrupt
frame is copied from the kernel stack to the user stack. Then 'iret'
makes the kernel-space to user-space transition. And that's where we get
a GeneralProtectionFault.

Have also a look at the interrupt_middleman function in
rtems-guest/hello.c. This is the user space recovery code of the stack
frame.

Cheers,
Philipp

p.s.
This page has a couple of good tutorials for low level OS programming:
http://www.brokenthorn.com/Resources/OSDev15.html