System Call Context

As discussed in Chapter 3, "Process Management," the kernel is in process context during the execution of a system call. The current pointer points to the current task, which is the process that issued the syscall.

In process context, the kernel is capable of sleeping (for example, if the system call blocks on a call or explicitly calls schedule()) and is fully preemptible. These two points are important. First, the capability to sleep means that system calls can make use of the majority of the kernel's functionality. As we will see in Chapter 6, "Interrupts and Interrupt Handlers," the capability to sleep greatly simplifies kernel programming^[7]. The fact that process context is preemptible implies that, like user-space, the current task may be preempted by another task. Because the new task may then execute the same system call, care must be exercised to ensure that system calls are reentrant. Of course, this is the same concern that symmetrical multiprocessing introduces. Protecting against reentrancy is covered in Chapter 8, "Kernel Synchronization Introduction," and Chapter 9, "Kernel Synchronization Methods."

^[7] Interrupt handlers cannot sleep, and thus are much more limited in what they can do than system calls running in process context.

When the system call returns, control continues in system_call(), which ultimately switches to user-space and continues the execution of the user process.

Final Steps in Binding a System Call

After the system call is written, it is trivial to register it as an official system call:

First, add an entry to the end of the system call table. This needs to be done for each architecture that supports the system call (which, for most calls, is all the architectures). The position of the syscall in the table, starting at zero, is its system call number. For example, the tenth entry in the list is assigned syscall number nine.
For each architecture supported, the syscall number needs to be defined in <asm/unistd.h>.
The syscall needs to be compiled into the kernel image (as opposed to compiled as a module). This can be as simple as putting the system call in a relevant file in kernel/, such as sys.c, which is home to miscellaneous system calls.

Let us look at these steps in more detail with a fictional system call, foo(). First, we want to add sys_foo() to the system call table. For most architectures, the table is located in enTRy.S and it looks like this:

ENTRY(sys_call_table)
        .long sys_restart_syscall    /* 0 */
        .long sys_exit
        .long sys_fork
        .long sys_read
        .long sys_write
        .long sys_open               /* 5 */

    ...

        .long sys_mq_unlink
        .long sys_mq_timedsend
        .long sys_mq_timedreceive       /* 280 */
        .long sys_mq_notify
        .long sys_mq_getsetattr

The new system call is then appended to the tail of this list:

        .long sys_foo

Although it is not explicitly specified, the system call is then given the next subsequent syscall number. In this case, 283. For each architecture you wish to support, the system call must be added to the architecture's system call table. The system call need not receive the same syscall number under each architecture. The system call number is part of the architecture's unique ABI. Usually, you would want to make the system call available to each architecture. Note the convention of placing the number in a comment every five entries; this makes it easy to find out which syscall is assigned which number.

Next, the system call number is added to <asm/unistd.h>, which currently looks somewhat like this:

/*
 * This file contains the system call numbers.
 */

#define __NR_restart_syscall  0
#define __NR_exit             1
#define __NR_fork             2
#define __NR_read             3
#define __NR_write            4
#define __NR_open             5

...
#define __NR_mq_unlink        278
#define __NR_mq_timedsend     279
#define __NR_mq_timedreceive  280
#define __NR_mq_notify        281
#define __NR_mq_getsetattr    282

The following is then added to the end of the list:

#define __NR_foo              283

Finally, the actual foo() system call is implemented. Because the system call must be compiled into the core kernel image in all configurations, it is put in kernel/sys.c. You should put it wherever the function is most relevant; for example, if the function is related to scheduling, you could put it in kernel/sched.c.

#include <asm/thread_info.h>

/*
 * sys_foo  everyone's favorite system call.
 *
 * Returns the size of the per-process kernel stack.
 */
asmlinkage long sys_foo(void)
{
        return THREAD_SIZE;
}

That is it! Seriously. Boot this kernel and user-space can invoke the foo() system call.

Accessing the System Call from User-Space

Generally, the C library provides support for system calls. User applications can pull in function prototypes from the standard headers and link with the C library to use your system call (or the library routine that in turn uses your syscall call). If you just wrote the system call, however, it is doubtful that glibc already supports it!

Thankfully, Linux provides a set of macros for wrapping access to system calls. It sets up the register contents and issues the trap instructions. These macros are named _syscalln(), where n is between zero and six. The number corresponds to the number of parameters passed into the syscall because the macro needs to know how many parameters to expect and, consequently, push into registers. For example, consider the system call open(), defined as

long open(const char *filename, int flags, int mode)

The syscall macro to use this system call without explicit library support would be

#define __NR_open 5
_syscall3(long, open, const char *, filename, int, flags, int, mode)

Then, the application can simply call open().

For each macro, there are 2+2xn parameters. The first parameter corresponds to the return type of the syscall. The second is the name of the system call. Next follows the type and name for each parameter in order of the system call. The __NR_open define is in <asm/unistd.h>; it is the system call number. The _syscall3 macro expands into a C function with inline assembly; the assembly performs the steps discussed in the previous section to push the system call number and parameters into the correct registers and issue the software interrupt to trap into the kernel. Placing this macro in an application is all that is required to use the open() system call.

Let's write the macro to use our splendid new foo() system call and then write some test code to show off our efforts.

#define __NR_foo 283
__syscall0(long, foo)

int main ()
{
        long stack_size;

        stack_size = foo ();
        printf ("The kernel stack size is %ld\n", stack_size);

        return 0;
}

Why Not to Implement a System Call

With luck, the previous sections have shown that it is easy to implement a new system call, but that in no way should encourage you to do so. Indeed, after my sterling effort to describe how system calls work and how to add new ones, I now suggest caution and unparalleled restraint in adding new syscalls. Often, much more viable alternatives to providing a new system call are available. Let's look at the pros, the cons, and the alternatives.

The pros of implementing a new interface as a syscall are as follows:

System calls are simple to implement and easy to use.
System call performance on Linux is blindingly fast.

The cons:

You need a syscall number, which needs to be officially assigned to you during a developmental kernel series.
After the system call is in a stable series kernel, it is written in stone. The interface cannot change without breaking user-space applications.
Each architecture needs to separately register the system call and support it.
System calls are not easily used from scripts and cannot be accessed directly from the filesystem.
For simple exchanges of information, a system call is overkill.

The alternatives:

Implement a device node and read() and write() to it. Use ioctl() to manipulate specific settings or retrieve specific information.
Certain interfaces, such as semaphores, can be represented as file descriptors and manipulated as such.
Add the information as a file to the appropriate location in sysfs.

For many interfaces, system calls are the correct answer. Linux, however, has tried to avoid simply adding a system call to support each new abstraction that comes along. The result has been an incredibly clean system call layer with very few regrets or deprecations (interfaces no longer used or supported). The slow rate of addition of new system calls is a sign that Linux is a relatively stable and feature-complete operating system.