Scheduler-Related System Calls

Linux provides a family of system calls for the management of scheduler parameters. These system calls allow manipulation of process priority, scheduling policy, and processor affinity, as well as provide an explicit mechanism to yield the processor to other tasks.

Various booksand your friendly system man pagesprovide reference to these system calls (which are all implemented in the C library without much wrapperthey just invoke the system call). Table 4.3 lists the system calls and provides a brief description. How system calls are implemented in the kernel is discussed in Chapter 5, "System Calls."

Table 4.3. Scheduler-Related System Calls
System Call
Description
nice()
Sets a process's nice value
sched_setscheduler()
Sets a process's scheduling policy
sched_getscheduler()
Gets a process's scheduling policy
sched_setparam()
Sets a process's real-time priority
sched_getparam()
Gets a process's real-time priority
sched_get_priority_max()
Gets the maximum real-time priority
sched_get_priority_min()
Gets the minimum real-time priority
sched_rr_get_interval()
Gets a process's timeslice value
sched_setaffinity()
Sets a process's processor affinity
sched_getaffinity()
Gets a process's processor affinity
sched_yield()
Temporarily yields the processor

Scheduling Policy and Priority-Related System Calls

The sched_setscheduler() and sched_getscheduler() system calls set and get a given process's scheduling policy and real-time priority, respectively. Their implementation, like most system calls, involves a lot of argument checking, setup, and cleanup. The important work, however, is merely to read or write the policy and rt_priority values in the process's task_struct.

The sched_setparam() and sched_getparam() system calls set and get a process's real-time priority. These calls merely encode rt_priority in a special sched_param structure. The calls sched_get_priority_max() and sched_get_priority_min() return the maximum and minimum priorities, respectively, for a given scheduling policy. The maximum priority for the real-time policies is MAX_USER_RT_PRIO minus one; the minimum is one.

For normal tasks, the nice() function increments the given process's static priority by the given amount. Only root can provide a negative value, thereby lowering the nice value and increasing the priority. The nice() function calls the kernel's set_user_nice() function, which sets the static_prio and prio values in the task's task_struct as appropriate.

Processor Affinity System Calls

The Linux scheduler enforces hard processor affinity. That is, although it tries to provide soft or natural affinity by attempting to keep processes on the same processor, the scheduler also enables a user to say, "This task must remain on this subset of the available processors no matter what." This hard affinity is stored as a bitmask in the task's task_struct as cpus_allowed. The bitmask contains one bit per possible processor on the system. By default, all bits are set and, therefore, a process is potentially runnable on any processor. The user, however, via sched_setaffinity(), can provide a different bitmask of any combination of one or more bits. Likewise, the call sched_getaffinity() returns the current cpus_allowed bitmask.

The kernel enforces hard affinity in a very simple manner. First, when a process is initially created, it inherits its parent's affinity mask. Because the parent is running on an allowed processor, the child thus runs on an allowed processor. Second, when a processor's affinity is changed, the kernel uses the migration threads to push the task onto a legal processor. Finally, the load balancer pulls tasks to only an allowed processor. Therefore, a process only ever runs on a processor whose bit is set in the cpus_allowed field of its process descriptor.

Yielding Processor Time

Linux provides the sched_yield() system call as a mechanism for a process to explicitly yield the processor to other waiting processes. It works by removing the process from the active array (where it currently is, because it is running) and inserting it into the expired array. This has the effect of not only preempting the process and putting it at the end of its priority list, but also putting it on the expired listguaranteeing it will not run for a while. Because real-time tasks never expire, they are a special case. Therefore, they are merely moved to the end of their priority list (and not inserted into the expired array). In earlier versions of Linux, the semantics of the sched_yield() call were quite different; at best, the task was moved only to the end of its priority list. The yielding was often not for a very long time. Nowadays, applications and even kernel code should be certain they truly want to give up the processor before calling sched_yield().

Kernel code, as a convenience, can call yield(), which ensures that the task's state is TASK_RUNNING, and then call sched_yield(). User-space applications use the sched_yield()system call.