[PATCH 2/4] score: SMP initialization and shutdown changes

Wed Feb 26 08:43:33 UTC 2014

First some words to this patch set.  It is a gradual improvement of the 
existing SMP low-level initialization and shutdown procedure.

The existing solution was able to start an SMP system.  Only atomic 
reads/writes were used.  It is impossible to implement a controlled shutdown 
only with atomic reads/writes.  See also "The Art of Multiprocessor 
Programming", chapter 5, "The Relative Power of Primitive Synchronization 
Operations".

This patch introduces a statically initialized global SMP lock to manage the 
per-CPU state changes (_Per_CPU_State_lock).  It changes also the states and 
documents the state machine.

It is now also possible to use fatal error handlers to control the shutdown on 
the lowest-level.

Without this patch on NGMP the situation is like this:

***  SMP03 TEST ***
   CPU 3 running task Init
   CPU 2 running task TA1
   CPU 1 running task TA2
   CPU 0 running task TA3
   CPU 0 running task TA4
*** END OF TEST SMP03 ***
CPU 0:  Unknown watchpoint hit
         0x0000b1c4: 30800000  ba,a  0x0000B1C4  <rtems_smp_process_interrupt+172>
CPU 1:  Unknown watchpoint hit
         0x0000b1c4: 30800000  ba,a  0x0000B1C4  <rtems_smp_process_interrupt+172>
CPU 2:  Unknown watchpoint hit
         0x0000b1c4: 30800000  ba,a  0x0000B1C4  <rtems_smp_process_interrupt+172>
CPU 3:  Program exited normally.

Now I have this:

***  SMP03 TEST ***
   CPU 3 running task Init
   CPU 2 running task TA1
   CPU 1 running task TA2
   CPU 0 running task TA3
   CPU 0 running task TA4
*** END OF TEST SMP03 ***

   CPU 0:  Program exited normally.
   CPU 1:  Power down mode
   CPU 2:  Power down mode
   CPU 3:  Power down mode

I think we should try to discuss problems with a specific patch in one thread 
and general problems in a separate thread in the future.

On 2014-02-25 00:02, Chris Johns wrote:
> On 21/02/2014 6:48 pm, Sebastian Huber wrote:
>> On 2014-02-21 01:31, Chris Johns wrote:
>>> On 21/02/2014 3:31 am, Sebastian Huber wrote:
>>>> Hello Chris,
>>>>
>>>> On 2014-02-20 03:50, Chris Johns wrote:
>>>>> On 20/02/2014 12:42 am, Sebastian Huber wrote:
>>>>>> +/**
>>>>>> + * @brief State of a processor.
>>>>>> + *
>>>>>> + * @dot
>>>>>> + * digraph states {
>>>>>> + *   bi [label="PER_CPU_STATE_BEFORE_INITIALIZATION"];
>>>>>> + *   rsm [label="PER_CPU_STATE_READY_TO_START_MULTITASKING"];
>>>>>> + *   sm [label="PER_CPU_STATE_START_MULTITASKING"];
>>>>>> + *   ds [label="PER_CPU_STATE_DO_SHUTDOWN"];
>>>>>> + *   u [label="PER_CPU_STATE_UP"];
>>>>>> + *   s [label="PER_CPU_STATE_SHUTDOWN"];
>>>>>> + *   bi -> rsm [label="secondary processor\ncompleted
>>>>>> initialization"];
>>>>>> + *   bi -> u [label="main processor\nstarts multitasking"];
>>>>>> + *   rsm -> sm [label="main processor\ncompleted initialization"];
>>>>>> + *   rsm -> ds [label="a fatal error occurred"];
>>>>>> + *   ds -> s [label="do shutdown\nstate observed"];
>>>>>> + *   sm -> u [label="secondary processor\nstarts multitasking"];
>>>>>> + *   u -> s [label="shutdown initiated"];
>>>>>> + * }
>>>>>> + * @enddot
>>>>>
>>>>> I do not see sm to ds if the main core goes to s after going from bi
>>>>> to u ?
>>>>>
>>>>> I also see u to s and not to ds; why as the state is called "do
>>>>> shutdown" ?
>>>>> Should this state be PER_CPU_STATE_SHUTTING_DOWN ?
>>>>
>>>> I had to change the procedure considerably since I noticed unsolvable
>>>> problems with the previous approach based only on atomic read/write
>>>> operations. Attached is the new state diagram.
>>>>
>>>
>>> I do not follow this picture. I am assume there is one state machine
>>> per CPU
>>> and so I do not see where the state "starting multitasking" is. I see
>>> a state
>>> called "request start multitasking" then "up" and I am wondering what
>>> the state
>>> is after the request has happened and before up.
>>
>> The application defines a maximum count of processors (NP).  The
>> configuration will then define the _Per_CPU_Information[NP] table.  Each
>> table entry has one per-CPU state entry.
>
> Does the BSP define the max number of possible cores the hardware can support ?

Yes, via the return value of _CPU_SMP_Initialize().  This patch changes nothing 
in this area.

> I see later in this email you talk about AP being available processors and
> defined by the BSP. I think I am getting confused with the term "maximum count
> of processors". Does it mean the max count the application may require or the
> max count supported by the hardware ?

It is a mix of both.  The application configures a desired maximum count and 
the BSP may reduce it to match the actual hardware.

>
> There is another issue here I am struggling with and while specifically not on
> topic for this patch it relates to the configuration of processor count so
> please indulge me. The application having to define the number of cores means
> there is no default position and that means we have this ...
>
> http://git.rtems.org/rtems/tree/testsuites/smptests/smp01/system.h#n28
>
> which is based on this ...
>
> http://git.rtems.org/rtems/tree/cpukit/sapi/include/confdefs.h#n205
>
> A user builds RTEMS with --enable-smp and then has another gate in
> CONFIGURE_SMP_APPLICATION. This does not make sense. Why build for SMP and then
> never use it yet incur lots of overhead ? If I build RTEMS with SMP enabled for
> a BSP that supports SMP I would expect it to work with SMP and not default to
> some half way point. For example if you build the Zync A9 qemu BSP with SMP
> enabled and build all the tests then run them on a patched qemu with SMP
> support you happily see about 10 maybe 15 failures for over 480 tests however
> these result are meaningless because CONFIGURE_SMP_APPLICATION is only defined
> for a few SMP specific tests and incorrectly for the Zync because it only has 2
> cores and not 4. I feel we should not go to a release with the tests in this
> state even with clear documentation. If the tests fail with SMP we need to have
> the test results show this and document what the failures are so users are left
> with a clear picture of the state of SMP support. For example with the Zynq A9
> qemu example none of the block (libbd) tests should pass as libbd uses disable
> pre-emption and this should generate an error if used.
>
> The CONFIGURE_SMP_APPLICATION needs to go however there is no default core
> count defined by the BSP that can be used to allow this. IMO if the application
> does not define the core count it needs, ie less than the max supported by the
> hardware, the max possible should be used.

I didn't invent this CONFIGURE_SMP_APPLICATION.  At the moment it is useful 
since we lack important features like thread deletion.  With the introduction 
of clustered/partitioned scheduling the configuration will completely change. 
So I would like to postpone this discussion a couple of weeks.

>
>>  Since this table is in the BSS
>> section, all states start with PER_CPU_STATE_INITIAL.
>
> Who is clearing the BSS and how is this managed ? There seems to be some
> implicit requirements here. The requirement being "RTEMS requires all
> processors to run in an SMP environment must be started externally to RTEMS".
> Is this requirement what we really want ?

This is not the case.  The low-level start-up is highly BSP specific.  The 
leon3 BSP for example starts secondary processors on its own.  The QorIQ relies 
on U-Boot.  Our Altera Cyclone-V (ARM Cortex-A9 MPCore based) will also start 
secondary processors on its own.

>
> On the ARM (well the zynq I have used) starting processors is easy and the code
> needed is tiny and could be made available in the BSP shared area, while I
> understand on the PPC it is complex. I suppose the question is adding this
> support to RTEMS what we want to have and worth it verses adding this code
> gives us complete control over initialisation and makes the boot monitor simpler.
>
> Given I currently only use the ARM which is easy and I will not ship uboot due
> to licensing reason I favour adding support to RTEMS. :)
>
>>
>> The boot processor calls boot_card() and all other processors do what
>> they want, but they must wait until the boot processor gives the go
>> (explained later).
>>
>> The boot processor calls eventually:
>>
>>    /**
>>     * @brief Performs CPU specific SMP initialization in the context of
>> the boot
>>     * processor.
>>     *
>>     * This function is invoked on the boot processor by RTEMS during
>>     * initialization.  All interrupt stacks are allocated at this point
>> in case
>>     * the CPU port allocates the interrupt stacks.
>>     *
>>     * The CPU port should start secondary processors now.
>
> What does this mean ? I am not sure if you mean the boot monitor does this or
> this is handled in RTEMS in the BSP or related CPU code.
>
>>     *
>>     * @param[in] configured_cpu_count The count of processors requested
>> by the
>>     * application configuration.
>>     *
>>     * @return The count of processors available for the application in
>> the system.
>>     * This value is less than or equal to the configured count of
>> processors.
>>     */
>>    uint32_t _CPU_SMP_Initialize( uint32_t configured_cpu_count );
>>
>> So here has the CPU port (or BSP in most cases) the chance to reduce the
>> configured count of processors to the actually available (AP).  We have
>> AP <= NP.  In case this function returns "you have three processors",
>> then these three processors MUST start properly or terminate the
>> system.  If some processors are not always available or otherwise
>> unreliable this must be dealt with in _CPU_SMP_Initialize().
>
> This is confusing me. The application controls the max number of cores (NP) a
> system can have and the BSP (or CPU port which knows the max that can exist)
> can configure that down to the actual number. Does this mean I could define 32
> cores at the application and Zynq BSP would say, sorry only 2 here so this is
> what you get ?

Yes.  Its done in some tests, e.g.

http://git.rtems.org/rtems/tree/testsuites/smptests/smplock01/init.c#n28

Does this waste RAM ?

Yes, but its not much, e.g. the _Per_CPU_Information table is a bit bigger than 
necessary.

>
> What about an application defining 2 on a 4 core system ?

It gets two.

>
>>
>> System termination can happen at anytime on any processor.  So you can
>> change from every state into PER_CPU_STATE_SHUTDOWN.  This ability is
>> the core of this change set.  In the previous implementation this was
>> not guaranteed.
>>
>
> Did the state diagram show this ? I am sorry if I missed this.

Yes, you have arrows from all other states to PER_CPU_STATE_SHUTDOWN.

>
>> Now lets look at the normal start-up (no shutdown).  The next state
>> after PER_CPU_STATE_INITIAL is PER_CPU_STATE_READY_TO_START_MULTITASKING.
>>
>>    /**
>>     * @brief Processor is ready to start multitasking.
>>     *
>>     * The secondary processor performed its basic initialization and is
>> ready to
>>     * receive inter-processor interrupts.  Interrupt delivery must be
>> disabled
>>     * in this state, but requested inter-processor interrupts must be
>> recorded
>>     * and must be delivered once the secondary processor enables
>> interrupts for
>>     * the first time.  The boot processor will wait for all secondary
>> processors
>>     * to change into this state.  In case a secondary processor does not
>> reach
>>     * this state the system will not start.  The secondary processors
>> wait now
>>     * for a change into the PER_CPU_STATE_REQUEST_START_MULTITASKING
>> state set
>>     * by the boot processor once all secondary processors reached the
>>     * PER_CPU_STATE_READY_TO_START_MULTITASKING state.
>>     */
>>    PER_CPU_STATE_READY_TO_START_MULTITASKING,
>>
>> The key point for a per-CPU state and not a global state is that every
>> processor must perform some initialization steps to set up the "I can
>> receive inter-processor interrupts (IPI)" state.  Before IPIs are
>> possible the only why are spin variables (e.g. this per-CPU state
>> variables).
>>
>> This PER_CPU_STATE_READY_TO_START_MULTITASKING is a synchronization
>> barrier.
>>
>> The boot processor will then set the state to
>> PER_CPU_STATE_REQUEST_START_MULTITASKING on all processors configured
>> and available (AP).  Once the secondary processor observes this state
>> change (it spins on its state variable), it will go into the
>> PER_CPU_STATE_UP state and perform a context switch to the first thread.
>>
>
> If this is about getting to a suitable IPI state on each core then why not have
> this in the states and functions rather then the global state type names of
> "start multitasking" etc.

I don't think we should be too specific about the IPI here.  The basic 
statement is that a processor is ready to start multitasking.

>
> I see the issue of needing a clear path from no IPI to having IPI however I am
> still wondering why the need to have the synch point and all cores to be there
> before moving on. A new core in the system should be able to see the system
> state and IPIs are active and then enable its support then add itself to the
> system.

Then you have to check each time which method you need to send a message to 
another processor.  I don't see a benefit here.

>
> If there is a global state variable with suitable spinning locks and atomics
> that direct the per cpu paths taken I fail to see why we need this sync
> barrier. The first core to a given global state makes the transition and the
> other cores need to wait or move to a different state. This assumes the BSS
> init issue is resolved.

I fail to see why this barrier is a problem.  The processors in an SMP system 
MUST work reliable if now or later it doesn't matter.  Do you really want to 
work an a system that works under this condition: "Sometimes scheduling actions 
take place, but you cannot assume that a thread will execute as planned.".

>
>>>
>>>>>
>>>>> I do not see why we have main and secondary processors ?
>>>>
>>>> I changed "main processor" into "boot processor" to highlight that this
>>>> is only the case during system boot.
>>>>
>>>
>>> Is there any code checking the processor number ? I see it in the
>>>
>>>>> This is symmetric
>>>>> multiprocessing which means each core is the same therefore capable of
>>>>> completing any required task. I understand there are paths which
>>>>> need to
>>>>> complete so if we have states for these phases as gates then any
>>>>> processor that
>>>>> arrives should be able to enter the gate (spinning lock for those that
>>>>> need to
>>>>> wait) and complete the work. This means a degraded state can exist and
>>>>> things
>>>>> at least start. The application would need to detect and manage the
>>>>> degraded
>>>>> state and so RTEMS should not be concerned with this condition other
>>>>> than doing
>>>>> its best to run where possible.
>>>>
>>>> In case the boot procedure of the system is unstable then it makes no
>>>> sense to run an application.
>>>>
>>>
>>> This depends on the application and its system requirements and how the
>>> hardware is constructed. A boot monitor can see cores are not present and
>>> decide not to start RTEMS or it can decide to start a single core.
>>> This relates
>>> to the system's requirement and can vary and has little to do with
>>> RTEMS. If a
>>> boot monitor passes control to RTEMS then it should run and only fail
>>> if it's
>>> integrity it not correct. RTEMS does should not move into the area of
>>> system
>>> requirements or constraints. Adding constraints like this in the
>>> operating
>>> system does not seem a good idea to me.
>>
>> I see no problem here.  See _CPU_SMP_Initialize() above.
>>
>
> I do not understand. Are you saying the design will run with a core count that
> does not match NP or AP ?

It is min { NP, AP }.

>
>>>
>>>>>
>>>>> I assume if we have n cores where n is 1..cpus available we enter the
>>>>> static
>>>>> constructors once and 'main' [1] once and this independent of the
>>>>> number of
>>>>> defined and/or operating cores. If an application's static constructor
>>>>> starts
>>>>> further threads it needs to manage the concurrency issues and main is
>>>>> only
>>>>> entered once.
>>>>
>>>> This change is about the low-level boot procedure.  High level concepts
>>>> like threads are not an issue here.
>>>>
>>>
>>> They are if you remove the all cores needing to sync plus using cpu
>>> numbers to
>>> control which cores go through to the static constructors and main and
>>> which
>>> cores go direct to the score thread code.
>>
>> If you remove this synchronization barrier via
>> PER_CPU_STATE_READY_TO_START_MULTITASKING, then you mandate that there
>> is boot loader that does these steps.  With the current set-up this boot
>> loader is optional.
>>
>
> Or the boot loading is not involved.
>
>>> It is this sort of code that concerns
>>> me ...
>>>
>>> http://git.rtems.org/rtems/tree/c/src/lib/libbsp/arm/shared/include/arm-a9mpcore-start.h#n88
>>>
>>>
>>
>> Yes, this is a dirty hack.  It works at the moment due to our immature
>> SMP state, but this must definitely change.
>>
>>>
>>>
>>> http://git.rtems.org/rtems/tree/c/src/lib/libbsp/sparc/shared/start/start.S#n226
>>>
>>>
>>
>> Yes, exactly the same issue.  This processor index 0 == boot processor
>> is a bad assumption.
>>
>
> I feel the single major issue here is a lack of defined requirement for SMP.
> The RTEMS Project and its community need to define the top level requirements
> and along with that tests that check those requirements. For me reviewing
> patches it is not clear what is a hack or temporary and what is long term and
> here for good. Questioning each patch or change to find this out is time
> consuming and not very productive.
[...]

Yes, we have the wiki page for this:

http://www.rtems.org/wiki/index.php?title=SMP

-- 
Sebastian Huber, embedded brains GmbH

Address : Dornierstr. 4, D-82178 Puchheim, Germany
Phone   : +49 89 189 47 41-16
Fax     : +49 89 189 47 41-09
E-Mail  : sebastian.huber at embedded-brains.de
PGP     : Public key available on request.

Diese Nachricht ist keine geschäftliche Mitteilung im Sinne des EHUG.