[ Team LiB ] |
5.9 Handling SIGCHLD SignalsThe purpose of the zombie state is to maintain information about the child for the parent to fetch at some later time. This information includes the process ID of the child, its termination status, and information on the resource utilization of the child (CPU time, memory, etc.). If a process terminates, and that process has children in the zombie state, the parent process ID of all the zombie children is set to 1 (the init process), which will inherit the children and clean them up (i.e., init will wait for them, which removes the zombie). Some Unix systems show the COMMAND column for a zombie process as <defunct>. Handling ZombiesObviously we do not want to leave zombies around. They take up space in the kernel and eventually we can run out of processes. Whenever we fork children, we must wait for them to prevent them from becoming zombies. To do this, we establish a signal handler to catch SIGCHLD, and within the handler, we call wait. (We will describe the wait and waitpid functions in Section 5.10.) We establish the signal handler by adding the function call Signal (SIGCHLD, sig_chld); in Figure 5.2, after the call to listen. (It must be done sometime before we fork the first child and needs to be done only once.) We then define the signal handler, the function sig_chld, which we show in Figure 5.7. Figure 5.7 Version of SIGCHLD signal handler that calls wait (improved in Figure 5.11).tcpcliserv/sigchldwait.c 1 #include "unp.h" 2 void 3 sig_chld(int signo) 4 { 5 pid_t pid; 6 int stat; 7 pid = wait(&stat); 8 printf("child %d terminated\", pid); 9 return; 10 }
If we compile this program—Figure 5.2, with the call to Signal, with our sig_chld handler—under Solaris 9 and use the signal function from the system library (not our version from Figure 5.6), we have the following:
The sequence of steps is as follows:
The purpose of this example is to show that when writing network programs that catch signals, we must be cognizant of interrupted system calls, and we must handle them. In this specific example, running under Solaris 9, the signal function provided in the standard C library does not cause an interrupted system call to be automatically restarted by the kernel. That is, the SA_RESTART flag that we set in Figure 5.6 is not set by the signal function in the system library. Some other systems automatically restart the interrupted system call. If we run the same example under 4.4BSD, using its library version of the signal function, the kernel restarts the interrupted system call and accept does not return an error. To handle this potential problem between different operating systems is one reason we define our own version of the signal function that we use throughout the text (Figure 5.6). As part of the coding conventions used in this text, we always code an explicit return in our signal handlers (Figure 5.7), even though falling off the end of the function does the same thing for a function returning void. When reading the code, the unnecessary return statement acts as a reminder that the return may interrupt a system call. Handling Interrupted System CallsWe used the term "slow system call" to describe accept, and we use this term for any system call that can block forever. That is, the system call need never return. Most networking functions fall into this category. For example, there is no guarantee that a server's call to accept will ever return, if there are no clients that will connect to the server. Similarly, our server's call to read in Figure 5.3 will never return if the client never sends a line for the server to echo. Other examples of slow system calls are reads and writes of pipes and terminal devices. A notable exception is disk I/O, which usually returns to the caller (assuming no catastrophic hardware failure). The basic rule that applies here is that when a process is blocked in a slow system call and the process catches a signal and the signal handler returns, the system call can return an error of EINTR. Some kernels automatically restart some interrupted system calls. For portability, when we write a program that catches signals (most concurrent servers catch SIGCHLD), we must be prepared for slow system calls to return EINTR. Portability problems are caused by the qualifiers "can" and "some," which were used earlier, and the fact that support for the POSIX SA_RESTART flag is optional. Even if an implementation supports the SA_RESTART flag, not all interrupted system calls may automatically be restarted. Most Berkeley-derived implementations, for example, never automatically restart select, and some of these implementations never restart accept or recvfrom. To handle an interrupted accept, we change the call to accept in Figure 5.2, the beginning of the for loop, to the following: for ( ; ; ) { clilen = sizeof (cliaddr); if ( (connfd = accept (listenfd, (SA *) &cliaddr, &clilen)) < 0) { if (errno == EINTR) continue; /* back to for () */ else err_sys ("accept error"); } Notice that we call accept and not our wrapper function Accept, since we must handle the failure of the function ourselves. What we are doing in this piece of code is restarting the interrupted system call. This is fine for accept, along with functions such as read, write, select, and open. But there is one function that we cannot restart: connect. If this function returns EINTR, we cannot call it again, as doing so will return an immediate error. When connect is interrupted by a caught signal and is not automatically restarted, we must call select to wait for the connection to complete, as we will describe in Section 16.3. |
[ Team LiB ] |