SMP Initialization

Fri Mar 7 02:25:06 UTC 2014

On 6/03/2014 9:04 pm, Sebastian Huber wrote:
> Hello,
>
> there is a potential problem in the SMP initialization procedure.
>
> One processor in the system has a special role, the so called boot
> processor. Currently this is the processor with index zero.  The way to
> select the boot processor may change in the future, but what will not
> change is that we have a boot processor.
>
> The boot processors initializes the data and BSS sections.  It performs
> also the sequential part of the RTEMS initialization.
>
> During the sequential initialization the function
>
> /**
>   * @brief Performs CPU specific SMP initialization in the context of
> the boot
>   * processor.
>   *
>   * This function is invoked on the boot processor by RTEMS during
>   * initialization.  All interrupt stacks are allocated at this point in
> case
>   * the CPU port allocates the interrupt stacks.
>   *
>   * The CPU port should start secondary processors now.
>   *
>   * @param[in] configured_cpu_count The count of processors requested by
> the
>   * application configuration.
>   *
>   * @return The count of processors available for the application in the
> system.
>   * This value is less than or equal to the configured count of processors.
>   */
> uint32_t _CPU_SMP_Initialize( uint32_t configured_cpu_count );
>
> called.  This function is currently implemented by the BSPs.  An example
> which starts the processor on its own:
>
> http://git.rtems.org/rtems/tree/c/src/lib/libbsp/sparc/leon3/smp/smp_leon3.c#n38
>
>
> An example which uses U-Boot to start the second processor:
>
> http://git.rtems.org/rtems/tree/c/src/lib/libbsp/powerpc/qoriq/startup/smp.c#n144
>
>
> The return value of _CPU_SMP_Initialize() will tell the RTEMS system how
> many processors are present.
>
> void _SMP_Handler_initialize( void )
> {
>    uint32_t max_cpus = rtems_configuration_get_maximum_processors();
>    uint32_t cpu;
>
> [...]
>
>    /*
>     * Discover and initialize the secondary cores in an SMP system.
>     */
>    max_cpus = _CPU_SMP_Initialize( max_cpus );
>
>    _SMP_Processor_count = max_cpus;
> }
>
> If the BSP says "you have three processors", and one of them is actually
> not available, then we have a problem later.
>
> Before the system starts multitasking there is a synchronization
> barrier.  This synchronization barrier is necessary to have a defined
> starting point for the scheduler.

Just to be clear here I assume this means when the scheduler starts the 
defined resources are present and there is no degraded mode to be 
supported by RTEMS. Getting the processors to the same point means the 
application can assume a specific runtime scheduling profile from the 
start. What happens after this is the application's problem and RTEMS 
makes no further checks. It also means RTEMS cannot be started with 
cores in a powered down state and enable them on demand at the 
application level.

>
> void _SMP_Request_start_multitasking( void )
> {
>    Per_CPU_Control *self_cpu = _Per_CPU_Get();
>    uint32_t ncpus = _SMP_Get_processor_count();
>    uint32_t cpu;
>
>    _Per_CPU_State_change( self_cpu,
> PER_CPU_STATE_READY_TO_START_MULTITASKING );
>
>    for ( cpu = 0 ; cpu < ncpus ; ++cpu ) {
>      Per_CPU_Control *per_cpu = _Per_CPU_Get_by_index( cpu );
>
>      _Per_CPU_State_change( per_cpu,
> PER_CPU_STATE_REQUEST_START_MULTITASKING );
>    }
> }
>
> So before this function returns ALL (!) processors must have changed
> into the PER_CPU_STATE_REQUEST_START_MULTITASKING (or into
> PER_CPU_STATE_SHUTDOWN which will terminate the system right now).
>
> In case one of the processors doesn't start, then we will wait here
> FOREVER (unless a watchdog kill us).

Having RTEMS enter this state is not good. I have seen it happen and 
debugging the code to find out a boot monitor was not configured 
correctly was no fun.

Watchdog management is normally a system level requirement and RTEMS has 
kept clear of it for years.

>
> There are now several ways to deal with this.
>
> 1. You can consider this a BSP bug.  The BSP told the system via
> _CPU_SMP_Initialize() that so many processors are available.  If this is
> not the case then the BSP lied and you should fix the BSP.

The BSP would need to start the cores check them and make them wait at 
another barrier before the cpu count can be returned or generate some 
sort of error.

>
> 2. You can consider this a feature of the BSP that it tells you wrong
> numbers.  So now what to do?
>

If any form of degraded mode is not support this is not a valid solution 
so we can rule this out.

> 2.1. You can install a watchdog driver that kills you no matter what
> corrupt systems state you have.  If you analyze the per-CPU states in
> this case you will notice that some of the processors didn't start.
>
> 2.2. You can limit the time spent waiting.  If a timeout occurs then we
> can issue a fatal error that indicates exactly the problem area.

We should not allow for ever loops without some form of exit.

>
> 2.2.1 Now we need a facility to measure time (e.g. the CPU counter
> introduced recently).
>
> 2.2.2 Now we need a timeout.
>
> 2.2.2.1 The RTEMS kernel cannot know a proper timeout value.
>

Correct.

> 2.2.2.2 The CPU/BSP may know the timeout value.  How can the CPU/BSP
> tell the RTEMS kernel timeout value?
>
> 2.2.2.3 We can add an application configuration item that specifies the
> timeout value and move the responsibility to the application developer.
>

-1

> I am in favor of 1. in combination with 2.1 and 2.2.2.2.  For BSPs with
> unreliably start of secondary processors we should add a support
> function, e.g.
>
> /**
>   * @brief Waits for all other processors to enter the ready to start
>   * multitasking state with a timeout in microseconds.
>   *
>   * In case one processor enters the shutdown state, this function does not
>   * return.
>   *
>   * This function should be called only in _CPU_SMP_Initialize() if
> required by
>   * the CPU port or BSP.
>   *
>   * @param[in] processor_count The processor count which will later
> returned by
>   * _CPU_SMP_Initialize().
>   * @param[in] timeout_in_us The timeout in microseconds.
>   *
>   * @retval true All other processors entered the ready to start
> multitasking
>   * state.
>   * @retval false Not all the other processors entered the ready to start
>   * multitasking state and the timeout expired.
>   */
> bool _Per_CPU_State_wait_for_ready_to_start_multitasking(
>    uint32_t processor_count,
>    uint32_t timeout_in_us
> );
>
> This avoids the burden for the application developer to know about the
> timeout configuration option and to select a proper value.

+1

> It moves the
> responsibility to deal with issue to the BSP which knows best what to
> do.  In case false is returned it can either issue a fatal error or
> reduce the processor count.
>

Can I assume the BSP can use a call to see which cores are present and 
which are not, eg state of a cpu ? Peeking the per CPU struct is not a 
good idea.

The approach seems reasonable.

Chris