Affinity and Scheduler Instance Interaction

Sebastian Huber sebastian.huber at embedded-brains.de
Thu May 22 06:48:33 UTC 2014


Hello Joel,

its good that you finally have the time to discuss high level API parts more 
than one year after project start.  Up to now I had to guess what you actually 
want.

Thanks for the Linux API summary.  I think this makes it easier to find the 
right choice for RTEMS.

On 2014-05-21 19:06, Joel Sherrill wrote:
>
> On 5/21/2014 9:13 AM, Sebastian Huber wrote:
>> On 2014-05-21 16:00, Joel Sherrill wrote:
>>> Hi
>>>
>>> We have an SMP behavioral decision to make and it isn't getting
>>> enough discussion.
>>>
>>> With cluster scheduling, there are potentially multiple scheduler
>>> instances associated with non-overlapping subsets of cores.
>>>
>>> With affinity, a thread can be restricted to execute on a subset
>>> of the cores associated with a scheduler instance.
>>>
>>> There are operations to change the scheduler associated with
>>> a thread and the affinity of a thread.
>>>
>>> The question is whether changing affinity should be able to
>>> implicitly change the scheduler instance?
>>>
>>> I lean to no because affinity and scheduler changes should
>>> should be so rare in practical use that nothing should be
>>> implicit.
>> The pthread_setaffinity_np() is the only way to set the scheduler via the
>> non-portable POSIX API.  So I would keep the ability to change the scheduler
>> with it.
>
> Changing the scheduler via the affinity API violates what it does on Linux
> and I assume *BSD. Linux has the concept of cpuset(7) and the taskset(1)
> service.
> Those are roughly comparable to scheduler instances in that they provide a
> higher level restriction on what the affinity can do. You can't change
> taskset
> with affinity operations. The taskset is an additional filter.
>
> I write a lot below and give options but the bottom line is that when you
> implicitly change scheduler via affinity, you end up with edge cases,
> arbitrary decision making inside RTEMS, and violate the rule of least
> surprise.
>
> IMO we should not overload the affinity APIs to move scheduler instances.
> Ignoring any other reason I think this is bad, the user should be 100%
> aware it is happening.
>> How would you change the scheduler with the non-portable POSIX API?
>
> Since we started with an evaluation of the Linux and *BSD APIs, I think we
> should return to them.
>
> It is an error on Linux for pthread_setaffinity_np when
> "The affinity bit mask contains
> no processors that are currently physically on the system and
> permitted to the thread according to any restrictions that may
> be imposed by the "cpuset" mechanism described in cpuset(7).
>
> I read this as an error for specifying a set where there are no
> processors specified that are available to this thread. Filtered
> by physical available and cpuset(7) restrictions. It is not an
> error to specify processors that are physically present but
> outside the cpuset(7).
>
> But according to cpuset(7):
>
> Cpusets are integrated with the sched_setaffinity(2) scheduling affin-
> ity mechanism and the mbind(2) and set_mempolicy(2) memory-placement
> mechanisms in the kernel. Neither of these mechanisms let a process
> make use of a CPU or memory node that is not allowed by that process’s
> cpuset. If changes to a process’s cpuset placement conflict with these
> other mechanisms, then cpuset placement is enforced even if it means
> overriding these other mechanisms.
>
> If you treat cpuset(7) similar to a scheduler instance, then it is
> an error to specify 0 cores within your scheduler instance.
> Including a processor in the affinity mask outside those owned
> by the taskset (e.g. scheduler instance) is not an error based on
> the Linux man pages. Those extra bits in the affinity mask would
> just be ignored.
>
> On Linux, changing affinity explicitly can't move you outside that
> cpuset(7). It is documented as a restriction on the affinity.
>
> This allows you to have affinity for all processors and use the
> scheduler instance or taskset(1) on Linux to balance.
> Seems reasonable and the normal use case.
>
> ==Options==
>
> As best I can tell, we have a few options.
>
> Option 1. You can't do it via the POSIX API and must use the
> Classic API. Linux certainly has its own unique tool with taskset(1).

The taskset(1) is a simple wrapper to sched_setaffinity()

https://gitorious.org/util-linux-ng/util-linux-ng/source/de878776623b120fc1e96568f4cd69c349ec2677:schedutils/taskset.c

What corresponds more to our clustered scheduling concept is the cpuset(7).  It 
uses a pseudo-file-system interface.

>
> Option 2. Add other POSIX APIs that are _np.
>
> I don't really care if we add POSIX _np methods or say that
> to change schedulers you have to use the Classic API. Since this
> is so far beyond standardization, I leave to saying it is comparable
> to Linux taskset(1) and you must use the Classic API scheduler
> methods. It is tightly tied to OS implementation and system
> configuration. POSIX will never address it.
>
> In both of those options, there currently is some behavior that
> does not match Linux. Since affinity is not stored by schedulers
> without affinity support, there is no way to maintain affinity
> information.

If we want to mimic this cpuset(7) behaviour, then we should store the affinity 
information for all schedulers.

>
> Since the per-scheduler Node information is created when a
> thread moves schedulers, affinity aware schedulers that maintain
> this information only have a couple of options. The thread's
> affinity mask can be implicitly set to all cores or all those owned
> by this instance when it is moved to a scheduler with affinity support.
>
> Any attempt by the application to set an affinity that would be
> meaningful to another scheduler is ignored. This does not
> follow the Linux/taskset behavior.

Yes, to follow  the Linux behaviour we should store the affinity map 
independent of the particular scheduler (e.g. in Scheduler_Node), in the 
cpuset(7) man page is this:

http://man7.org/linux/man-pages/man7/cpuset.7.html

"Every  process  in the system belongs to exactly one cpuset."

-> In RTEMS we can interpret this as "Every task in the system belongs to 
exactly one scheduler instance".

"Cpusets  are  integrated  with  the  sched_setaffinity(2)  scheduling affinity 
mechanism and the mbind(2) and set_mempolicy(2) memory-placement mechanisms in 
the kernel.  Neither of these mechanisms let a process make use of a CPU or 
memory node that is not allowed by that process's cpuset.  If changes to a 
process's cpuset placement conflict with these other mechanisms, then cpuset 
placement is enforced even if it means overriding these other mechanisms.  The 
kernel accomplishes this  overriding  by  silently restricting  the  CPUs  and 
  memory nodes requested by these other mechanisms to those allowed by the 
invoking process's cpuset.  This can result in these other calls returning an 
error, if for example, such a call ends up requesting an empty set of CPUs or 
memory nodes, after that request is restricted to the invoking process's cpuset."

-> In RTEMS we should not change the scheduler with the set affinity operation. 
  We just have to make sure that the affinity map includes at least one 
processor of the current scheduler instance.  If we change the scheduler and 
the affinity map has no processor in the target scheduler, then this should be 
an error.

> Consider this sequence:
>
> Four core system, scheduler A on 0/1, scheduler B on 2/3.
> Application wants thread to have affinity for 0 and 2.
> Thread starts on Scheduler A.
>
> When it changes to B, the process results in thread A having
> affinity for both 2 and 3 and could run on 3. Violating the
> explicit application requested affinity.
>
> I think the user explicitly selecting the scheduler and thread
> affinity being part of the SMP node information is the best option.

Yes, I agree now.

>
> I only see one error condition to consider when explicitly setting
> the scheduler. That is when the thread's affinity mask does not
> include the new scheduler instance's processors. This could easily
> be checked at the "scheduler independent" level if we add a
> cpuset to Scheduler_SMP_Context to indicate which cores the scheduler
> instance owns. Then we could easily error check this with
> the cpuset.h boolean operations.

Ok.

>
> I also am now convinced that the SMP schedulers have
> the option of ignoring affinity but they should honor the
> data values. This was done by our original code but changed
> upon request to not have the thread Affinity as part of
> the basic Scheduler_SMP_Node.

Yes, sorry, I see it now differently.

>
> Hmmm.. you could have an instance of a single CPU
> scheduler in an SMP clustered configuration. This
> means the affinity data really should be in
> Scheduler_Node when SMP is enabled.
>
> Options 3 -5 are implicit changes via set affinity but they
> are not desirable to me. I think we could make them work.
> But I don't like the implications, implicit actions, change from
> Linux behavior, etc.
>
> Option 3 (implicit). Allow setting affinity to no CPUs to remove a
> thread from all schedulers. This would leave it in limbo
> but it would be the caller's obligation to follow up with
> a set affinity. I don't like this one because it trips the error
> for Linux pthread_setaffinity_np() above when there are
> no cores specified.
>
> Option 3 is a no-go to me.
>
> Option 4 (implicit): _Scheduler_Set_affinity should validate a 1->1
> scheduler instance change and set the new affinity in
> the new instance.
>
> Option 5 is an easy optimization of 4. Add a cpuset to
> Scheduler_SMP_Context indicating which cores are
> associated with this scheduler instance.
>
> Implicit scheduler changes break what I think is the
> most useful case of clustered scheduling. Affinity for
> cores in multiple schedulers, move threads dynamically
> to different scheduler instances to perform load balancing.
>
>>> Consider this scenario:
>>>
>>> Scheduler A: cores 0-1
>>> Scheduler B: cores 2-3
>>>
>>> Thread 1 is associated with Scheduler B and with affinity 2-3
>>> can run on either processor scheduled by B.
>>>
>>> Thread 1 changes affinity 1-3. Should this change the scheduler,
>>> be an error, or just have an affinity for a core in the system
>>> that is not scheduled by this scheduler instance?
>> This is currently an error:
>>
>> http://git.rtems.org/rtems/tree/testsuites/smptests/smpscheduler02/init.c#n141
>
> OK. If I am reading this correctly, the destination affinity must be
> within a single scheduler instance?
>
> If so, then that is not compatible with the Linux use of affinity and
> task sets. But I agree that it is the only safe use of implicit scheduler
> changes.
>
> It is only a problem when the scheduler logic moved it to another
> scheduler instance.
>
>>> If you look at the current code for _Scheduler_Set_affinity(),
>>> it looks like the current behavior is none of the above and
>>> appears to just be broken. Scheduler A's set affinity operation
>>> is invoked and Scheduler B is not informed that it no longer
>>> has control of Thread 1.
>>>
>> It is informed in case _Scheduler_default_Set_affinity_body() is used.  What is
>> broken is the _Scheduler_priority_affinity_SMP_Set_affinity() function.
> What is broken is the concept of implicitly changing schedulers via
> affinity.

I don't think its broken, but there are alternatives.  I think that the 
alternative you proposed here is better.  So we should remove the implicit 
scheduler changes from the set affinity operation.

>
> By not discussing the options at the API level, you boxed yourself into an
> implementation. Since user use-cases were also not discussed before you
> started coding, there was no feedback on what would be considered
> desirable user visible behavior.

If I would have waited up to now to start coding, then we would be in deeper 
troubles.  The support for affinity maps was on your list and you didn't 
discuss anything in detail up to now.  It is trivial to change the 
implementation to reflect the new requirements.  I implemented and tested all 
the low level parts that you can use now.

>
>> Please have a look at the attached patch which I already sent to a similar thread.
>>
> Which patch?
>

It was attached to the last mail.

-- 
Sebastian Huber, embedded brains GmbH

Address : Dornierstr. 4, D-82178 Puchheim, Germany
Phone   : +49 89 189 47 41-16
Fax     : +49 89 189 47 41-09
E-Mail  : sebastian.huber at embedded-brains.de
PGP     : Public key available on request.

Diese Nachricht ist keine geschäftliche Mitteilung im Sinne des EHUG.



More information about the devel mailing list