[PATCH 2/4] score: SMP initialization and shutdown changes

Mon Feb 24 23:02:08 UTC 2014

On 21/02/2014 6:48 pm, Sebastian Huber wrote:
> On 2014-02-21 01:31, Chris Johns wrote:
>> On 21/02/2014 3:31 am, Sebastian Huber wrote:
>>> Hello Chris,
>>>
>>> On 2014-02-20 03:50, Chris Johns wrote:
>>>> On 20/02/2014 12:42 am, Sebastian Huber wrote:
>>>>> +/**
>>>>> + * @brief State of a processor.
>>>>> + *
>>>>> + * @dot
>>>>> + * digraph states {
>>>>> + *   bi [label="PER_CPU_STATE_BEFORE_INITIALIZATION"];
>>>>> + *   rsm [label="PER_CPU_STATE_READY_TO_START_MULTITASKING"];
>>>>> + *   sm [label="PER_CPU_STATE_START_MULTITASKING"];
>>>>> + *   ds [label="PER_CPU_STATE_DO_SHUTDOWN"];
>>>>> + *   u [label="PER_CPU_STATE_UP"];
>>>>> + *   s [label="PER_CPU_STATE_SHUTDOWN"];
>>>>> + *   bi -> rsm [label="secondary processor\ncompleted
>>>>> initialization"];
>>>>> + *   bi -> u [label="main processor\nstarts multitasking"];
>>>>> + *   rsm -> sm [label="main processor\ncompleted initialization"];
>>>>> + *   rsm -> ds [label="a fatal error occurred"];
>>>>> + *   ds -> s [label="do shutdown\nstate observed"];
>>>>> + *   sm -> u [label="secondary processor\nstarts multitasking"];
>>>>> + *   u -> s [label="shutdown initiated"];
>>>>> + * }
>>>>> + * @enddot
>>>>
>>>> I do not see sm to ds if the main core goes to s after going from bi
>>>> to u ?
>>>>
>>>> I also see u to s and not to ds; why as the state is called "do
>>>> shutdown" ?
>>>> Should this state be PER_CPU_STATE_SHUTTING_DOWN ?
>>>
>>> I had to change the procedure considerably since I noticed unsolvable
>>> problems with the previous approach based only on atomic read/write
>>> operations. Attached is the new state diagram.
>>>
>>
>> I do not follow this picture. I am assume there is one state machine
>> per CPU
>> and so I do not see where the state "starting multitasking" is. I see
>> a state
>> called "request start multitasking" then "up" and I am wondering what
>> the state
>> is after the request has happened and before up.
>
> The application defines a maximum count of processors (NP).  The
> configuration will then define the _Per_CPU_Information[NP] table.  Each
> table entry has one per-CPU state entry.

Does the BSP define the max number of possible cores the hardware can 
support ? I see later in this email you talk about AP being available 
processors and defined by the BSP. I think I am getting confused with 
the term "maximum count of processors". Does it mean the max count the 
application may require or the max count supported by the hardware ?

There is another issue here I am struggling with and while specifically 
not on topic for this patch it relates to the configuration of processor 
count so please indulge me. The application having to define the number 
of cores means there is no default position and that means we have this ...

http://git.rtems.org/rtems/tree/testsuites/smptests/smp01/system.h#n28

which is based on this ...

http://git.rtems.org/rtems/tree/cpukit/sapi/include/confdefs.h#n205

A user builds RTEMS with --enable-smp and then has another gate in 
CONFIGURE_SMP_APPLICATION. This does not make sense. Why build for SMP 
and then never use it yet incur lots of overhead ? If I build RTEMS with 
SMP enabled for a BSP that supports SMP I would expect it to work with 
SMP and not default to some half way point. For example if you build the 
Zync A9 qemu BSP with SMP enabled and build all the tests then run them 
on a patched qemu with SMP support you happily see about 10 maybe 15 
failures for over 480 tests however these result are meaningless because 
CONFIGURE_SMP_APPLICATION is only defined for a few SMP specific tests 
and incorrectly for the Zync because it only has 2 cores and not 4. I 
feel we should not go to a release with the tests in this state even 
with clear documentation. If the tests fail with SMP we need to have the 
test results show this and document what the failures are so users are 
left with a clear picture of the state of SMP support. For example with 
the Zynq A9 qemu example none of the block (libbd) tests should pass as 
libbd uses disable pre-emption and this should generate an error if used.

The CONFIGURE_SMP_APPLICATION needs to go however there is no default 
core count defined by the BSP that can be used to allow this. IMO if the 
application does not define the core count it needs, ie less than the 
max supported by the hardware, the max possible should be used.

>  Since this table is in the BSS
> section, all states start with PER_CPU_STATE_INITIAL.

Who is clearing the BSS and how is this managed ? There seems to be some 
implicit requirements here. The requirement being "RTEMS requires all 
processors to run in an SMP environment must be started externally to 
RTEMS". Is this requirement what we really want ?

On the ARM (well the zynq I have used) starting processors is easy and 
the code needed is tiny and could be made available in the BSP shared 
area, while I understand on the PPC it is complex. I suppose the 
question is adding this support to RTEMS what we want to have and worth 
it verses adding this code gives us complete control over initialisation 
and makes the boot monitor simpler.

Given I currently only use the ARM which is easy and I will not ship 
uboot due to licensing reason I favour adding support to RTEMS. :)

>
> The boot processor calls boot_card() and all other processors do what
> they want, but they must wait until the boot processor gives the go
> (explained later).
>
> The boot processor calls eventually:
>
>    /**
>     * @brief Performs CPU specific SMP initialization in the context of
> the boot
>     * processor.
>     *
>     * This function is invoked on the boot processor by RTEMS during
>     * initialization.  All interrupt stacks are allocated at this point
> in case
>     * the CPU port allocates the interrupt stacks.
>     *
>     * The CPU port should start secondary processors now.

What does this mean ? I am not sure if you mean the boot monitor does 
this or this is handled in RTEMS in the BSP or related CPU code.

>     *
>     * @param[in] configured_cpu_count The count of processors requested
> by the
>     * application configuration.
>     *
>     * @return The count of processors available for the application in
> the system.
>     * This value is less than or equal to the configured count of
> processors.
>     */
>    uint32_t _CPU_SMP_Initialize( uint32_t configured_cpu_count );
>
> So here has the CPU port (or BSP in most cases) the chance to reduce the
> configured count of processors to the actually available (AP).  We have
> AP <= NP.  In case this function returns "you have three processors",
> then these three processors MUST start properly or terminate the
> system.  If some processors are not always available or otherwise
> unreliable this must be dealt with in _CPU_SMP_Initialize().

This is confusing me. The application controls the max number of cores 
(NP) a system can have and the BSP (or CPU port which knows the max that 
can exist) can configure that down to the actual number. Does this mean 
I could define 32 cores at the application and Zynq BSP would say, sorry 
only 2 here so this is what you get ? Does this waste RAM ?

What about an application defining 2 on a 4 core system ?

>
> System termination can happen at anytime on any processor.  So you can
> change from every state into PER_CPU_STATE_SHUTDOWN.  This ability is
> the core of this change set.  In the previous implementation this was
> not guaranteed.
>

Did the state diagram show this ? I am sorry if I missed this.

> Now lets look at the normal start-up (no shutdown).  The next state
> after PER_CPU_STATE_INITIAL is PER_CPU_STATE_READY_TO_START_MULTITASKING.
>
>    /**
>     * @brief Processor is ready to start multitasking.
>     *
>     * The secondary processor performed its basic initialization and is
> ready to
>     * receive inter-processor interrupts.  Interrupt delivery must be
> disabled
>     * in this state, but requested inter-processor interrupts must be
> recorded
>     * and must be delivered once the secondary processor enables
> interrupts for
>     * the first time.  The boot processor will wait for all secondary
> processors
>     * to change into this state.  In case a secondary processor does not
> reach
>     * this state the system will not start.  The secondary processors
> wait now
>     * for a change into the PER_CPU_STATE_REQUEST_START_MULTITASKING
> state set
>     * by the boot processor once all secondary processors reached the
>     * PER_CPU_STATE_READY_TO_START_MULTITASKING state.
>     */
>    PER_CPU_STATE_READY_TO_START_MULTITASKING,
>
> The key point for a per-CPU state and not a global state is that every
> processor must perform some initialization steps to set up the "I can
> receive inter-processor interrupts (IPI)" state.  Before IPIs are
> possible the only why are spin variables (e.g. this per-CPU state
> variables).
>
> This PER_CPU_STATE_READY_TO_START_MULTITASKING is a synchronization
> barrier.
>
> The boot processor will then set the state to
> PER_CPU_STATE_REQUEST_START_MULTITASKING on all processors configured
> and available (AP).  Once the secondary processor observes this state
> change (it spins on its state variable), it will go into the
> PER_CPU_STATE_UP state and perform a context switch to the first thread.
>

If this is about getting to a suitable IPI state on each core then why 
not have this in the states and functions rather then the global state 
type names of "start multitasking" etc.

I see the issue of needing a clear path from no IPI to having IPI 
however I am still wondering why the need to have the synch point and 
all cores to be there before moving on. A new core in the system should 
be able to see the system state and IPIs are active and then enable its 
support then add itself to the system.

If there is a global state variable with suitable spinning locks and 
atomics that direct the per cpu paths taken I fail to see why we need 
this sync barrier. The first core to a given global state makes the 
transition and the other cores need to wait or move to a different 
state. This assumes the BSS init issue is resolved.

>>
>>>>
>>>> I do not see why we have main and secondary processors ?
>>>
>>> I changed "main processor" into "boot processor" to highlight that this
>>> is only the case during system boot.
>>>
>>
>> Is there any code checking the processor number ? I see it in the
>>
>>>> This is symmetric
>>>> multiprocessing which means each core is the same therefore capable of
>>>> completing any required task. I understand there are paths which
>>>> need to
>>>> complete so if we have states for these phases as gates then any
>>>> processor that
>>>> arrives should be able to enter the gate (spinning lock for those that
>>>> need to
>>>> wait) and complete the work. This means a degraded state can exist and
>>>> things
>>>> at least start. The application would need to detect and manage the
>>>> degraded
>>>> state and so RTEMS should not be concerned with this condition other
>>>> than doing
>>>> its best to run where possible.
>>>
>>> In case the boot procedure of the system is unstable then it makes no
>>> sense to run an application.
>>>
>>
>> This depends on the application and its system requirements and how the
>> hardware is constructed. A boot monitor can see cores are not present and
>> decide not to start RTEMS or it can decide to start a single core.
>> This relates
>> to the system's requirement and can vary and has little to do with
>> RTEMS. If a
>> boot monitor passes control to RTEMS then it should run and only fail
>> if it's
>> integrity it not correct. RTEMS does should not move into the area of
>> system
>> requirements or constraints. Adding constraints like this in the
>> operating
>> system does not seem a good idea to me.
>
> I see no problem here.  See _CPU_SMP_Initialize() above.
>

I do not understand. Are you saying the design will run with a core 
count that does not match NP or AP ?

>>
>>>>
>>>> I assume if we have n cores where n is 1..cpus available we enter the
>>>> static
>>>> constructors once and 'main' [1] once and this independent of the
>>>> number of
>>>> defined and/or operating cores. If an application's static constructor
>>>> starts
>>>> further threads it needs to manage the concurrency issues and main is
>>>> only
>>>> entered once.
>>>
>>> This change is about the low-level boot procedure.  High level concepts
>>> like threads are not an issue here.
>>>
>>
>> They are if you remove the all cores needing to sync plus using cpu
>> numbers to
>> control which cores go through to the static constructors and main and
>> which
>> cores go direct to the score thread code.
>
> If you remove this synchronization barrier via
> PER_CPU_STATE_READY_TO_START_MULTITASKING, then you mandate that there
> is boot loader that does these steps.  With the current set-up this boot
> loader is optional.
>

Or the boot loading is not involved.

>> It is this sort of code that concerns
>> me ...
>>
>> http://git.rtems.org/rtems/tree/c/src/lib/libbsp/arm/shared/include/arm-a9mpcore-start.h#n88
>>
>
> Yes, this is a dirty hack.  It works at the moment due to our immature
> SMP state, but this must definitely change.
>
>>
>>
>> http://git.rtems.org/rtems/tree/c/src/lib/libbsp/sparc/shared/start/start.S#n226
>>
>
> Yes, exactly the same issue.  This processor index 0 == boot processor
> is a bad assumption.
>

I feel the single major issue here is a lack of defined requirement for 
SMP. The RTEMS Project and its community need to define the top level 
requirements and along with that tests that check those requirements. 
For me reviewing patches it is not clear what is a hack or temporary and 
what is long term and here for good. Questioning each patch or change to 
find this out is time consuming and not very productive.

>>
>> I see no need for this code if the per cpu and the system level state are
>> handled symmetrically. That is all cores enter the same entry point
>> and the
>> state manages where and what they need to do.
>
> Yes, I can imagine that all processors may call boot_card() and then
> something like this
>
> SMP_lock boot_lock
>
> boot_card:
>      boot_lock.acquire()
>      make first processor reaching this the boot processor
>      boot_lock.release()
>
>      bsp_start(is_boot_processor)
>
>      if is_boot_processor:
>          start sequential boot code
>      else:
>          wait for start multitasking request
>

Yeah I see us wanting something like this.

>>
>> By the way is there a LEON4 or LEON5 in the works so is this
>> dependence on
>> LEON3 because it is the first to support SMP or because it is specific
>> to this
>> variant of the cpu ?
>
> I didn't notice any differences between the LEON3 and LEON4 so far.
>

The code has ....

#if defined(START_LEON3_ENABLE_SMP)

Does a LEON3 define START_LEON3_ENABLE_SMP ?

>>
>>>>
>>>> [1] I am not concerned with the Classic API init task tables with more
>>>> than one
>>>> init task and what happens with static constructors as this is not
>>>> defined by
>>>> any standard. I will never directly use init task tables because it is
>>>> not
>>>> portable. All we can do is make sure static constructors and
>>>> destructors are
>>>> only called once.
>>>
>>> Yes, the constructor calls are an open issue.
>>>
>>
>> Well lets get the issue on the table and discuss it. I see it is very
>> much a
>> part of this discussion and the overall design.
>
> Ok, but please open another thread for this.  This is not related to
> this change set.
>

I am only referring to the place the entry points are called and not 
what happens within the constructor calls etc nor am I talking about the 
general C++ threading support.

Chris