Affinity and Scheduler Instance Interaction

Joel Sherrill joel.sherrill at OARcorp.com
Wed May 21 17:06:25 UTC 2014


On 5/21/2014 9:13 AM, Sebastian Huber wrote:
> On 2014-05-21 16:00, Joel Sherrill wrote:
>> Hi
>>
>> We have an SMP behavioral decision to make and it isn't getting
>> enough discussion.
>>
>> With cluster scheduling, there are potentially multiple scheduler
>> instances associated with non-overlapping subsets of cores.
>>
>> With affinity, a thread can be restricted to execute on a subset
>> of the cores associated with a scheduler instance.
>>
>> There are operations to change the scheduler associated with
>> a thread and the affinity of a thread.
>>
>> The question is whether changing affinity should be able to
>> implicitly change the scheduler instance?
>>
>> I lean to no because affinity and scheduler changes should
>> should be so rare in practical use that nothing should be
>> implicit.
> The pthread_setaffinity_np() is the only way to set the scheduler via the 
> non-portable POSIX API.  So I would keep the ability to change the scheduler 
> with it.

Changing the scheduler via the affinity API violates what it does on Linux
and I assume *BSD. Linux has the concept of cpuset(7) and the taskset(1)
service.
Those are roughly comparable to scheduler instances in that they provide a
higher level restriction on what the affinity can do. You can't change
taskset
with affinity operations. The taskset is an additional filter.

I write a lot below and give options but the bottom line is that when you
implicitly change scheduler via affinity, you end up with edge cases,
arbitrary decision making inside RTEMS, and violate the rule of least
surprise.

IMO we should not overload the affinity APIs to move scheduler instances.
Ignoring any other reason I think this is bad, the user should be 100%
aware it is happening.
> How would you change the scheduler with the non-portable POSIX API?

Since we started with an evaluation of the Linux and *BSD APIs, I think we
should return to them.

It is an error on Linux for pthread_setaffinity_np when
"The affinity bit mask contains
no processors that are currently physically on the system and
permitted to the thread according to any restrictions that may
be imposed by the "cpuset" mechanism described in cpuset(7).

I read this as an error for specifying a set where there are no
processors specified that are available to this thread. Filtered
by physical available and cpuset(7) restrictions. It is not an
error to specify processors that are physically present but
outside the cpuset(7).

But according to cpuset(7):

Cpusets are integrated with the sched_setaffinity(2) scheduling affin-
ity mechanism and the mbind(2) and set_mempolicy(2) memory-placement
mechanisms in the kernel. Neither of these mechanisms let a process
make use of a CPU or memory node that is not allowed by that process’s
cpuset. If changes to a process’s cpuset placement conflict with these
other mechanisms, then cpuset placement is enforced even if it means
overriding these other mechanisms.

If you treat cpuset(7) similar to a scheduler instance, then it is
an error to specify 0 cores within your scheduler instance.
Including a processor in the affinity mask outside those owned
by the taskset (e.g. scheduler instance) is not an error based on
the Linux man pages. Those extra bits in the affinity mask would
just be ignored.

On Linux, changing affinity explicitly can't move you outside that
cpuset(7). It is documented as a restriction on the affinity.

This allows you to have affinity for all processors and use the
scheduler instance or taskset(1) on Linux to balance.
Seems reasonable and the normal use case.

==Options==

As best I can tell, we have a few options.

Option 1. You can't do it via the POSIX API and must use the
Classic API. Linux certainly has its own unique tool with taskset(1).

Option 2. Add other POSIX APIs that are _np.

I don't really care if we add POSIX _np methods or say that
to change schedulers you have to use the Classic API. Since this
is so far beyond standardization, I leave to saying it is comparable
to Linux taskset(1) and you must use the Classic API scheduler
methods. It is tightly tied to OS implementation and system
configuration. POSIX will never address it.

In both of those options, there currently is some behavior that
does not match Linux. Since affinity is not stored by schedulers
without affinity support, there is no way to maintain affinity
information.

Since the per-scheduler Node information is created when a
thread moves schedulers, affinity aware schedulers that maintain
this information only have a couple of options. The thread's
affinity mask can be implicitly set to all cores or all those owned
by this instance when it is moved to a scheduler with affinity support.

Any attempt by the application to set an affinity that would be
meaningful to another scheduler is ignored. This does not
follow the Linux/taskset behavior. Consider this sequence:

Four core system, scheduler A on 0/1, scheduler B on 2/3.
Application wants thread to have affinity for 0 and 2.
Thread starts on Scheduler A.

When it changes to B, the process results in thread A having
affinity for both 2 and 3 and could run on 3. Violating the
explicit application requested affinity.

I think the user explicitly selecting the scheduler and thread
affinity being part of the SMP node information is the best option.

I only see one error condition to consider when explicitly setting
the scheduler. That is when the thread's affinity mask does not
include the new scheduler instance's processors. This could easily
be checked at the "scheduler independent" level if we add a
cpuset to Scheduler_SMP_Context to indicate which cores the scheduler
instance owns. Then we could easily error check this with
the cpuset.h boolean operations.

I also am now convinced that the SMP schedulers have
the option of ignoring affinity but they should honor the
data values. This was done by our original code but changed
upon request to not have the thread Affinity as part of
the basic Scheduler_SMP_Node.

Hmmm.. you could have an instance of a single CPU
scheduler in an SMP clustered configuration. This
means the affinity data really should be in
Scheduler_Node when SMP is enabled.

Options 3 -5 are implicit changes via set affinity but they
are not desirable to me. I think we could make them work.
But I don't like the implications, implicit actions, change from
Linux behavior, etc.

Option 3 (implicit). Allow setting affinity to no CPUs to remove a
thread from all schedulers. This would leave it in limbo
but it would be the caller's obligation to follow up with
a set affinity. I don't like this one because it trips the error
for Linux pthread_setaffinity_np() above when there are
no cores specified.

Option 3 is a no-go to me.

Option 4 (implicit): _Scheduler_Set_affinity should validate a 1->1
scheduler instance change and set the new affinity in
the new instance.

Option 5 is an easy optimization of 4. Add a cpuset to
Scheduler_SMP_Context indicating which cores are
associated with this scheduler instance.

Implicit scheduler changes break what I think is the
most useful case of clustered scheduling. Affinity for
cores in multiple schedulers, move threads dynamically
to different scheduler instances to perform load balancing.

>> Consider this scenario:
>>
>> Scheduler A: cores 0-1
>> Scheduler B: cores 2-3
>>
>> Thread 1 is associated with Scheduler B and with affinity 2-3
>> can run on either processor scheduled by B.
>>
>> Thread 1 changes affinity 1-3. Should this change the scheduler,
>> be an error, or just have an affinity for a core in the system
>> that is not scheduled by this scheduler instance?
> This is currently an error:
>
> http://git.rtems.org/rtems/tree/testsuites/smptests/smpscheduler02/init.c#n141

OK. If I am reading this correctly, the destination affinity must be
within a single scheduler instance?

If so, then that is not compatible with the Linux use of affinity and
task sets. But I agree that it is the only safe use of implicit scheduler
changes.

It is only a problem when the scheduler logic moved it to another
scheduler instance.

>> If you look at the current code for _Scheduler_Set_affinity(),
>> it looks like the current behavior is none of the above and
>> appears to just be broken. Scheduler A's set affinity operation
>> is invoked and Scheduler B is not informed that it no longer
>> has control of Thread 1.
>>
> It is informed in case _Scheduler_default_Set_affinity_body() is used.  What is 
> broken is the _Scheduler_priority_affinity_SMP_Set_affinity() function.
What is broken is the concept of implicitly changing schedulers via
affinity.

By not discussing the options at the API level, you boxed yourself into an
implementation. Since user use-cases were also not discussed before you
started coding, there was no feedback on what would be considered
desirable user visible behavior.

> Please have a look at the attached patch which I already sent to a similar thread.
>
Which patch?

-- 
Joel Sherrill, Ph.D.             Director of Research & Development
joel.sherrill at OARcorp.com        On-Line Applications Research
Ask me about RTEMS: a free RTOS  Huntsville AL 35805
Support Available                (256) 722-9985




More information about the devel mailing list