Page Cache

The page cache, as its name suggests, is a cache of pages. The pages originate from reads and writes of regular filesystem files, block device files, and memory-mapped files. In this manner, the page cache contains entire pages from recently accessed files. During a page I/O operation, such as read()^[1], the kernel checks whether the data resides in the page cache. If the data is in the page cache, the kernel can quickly return the requested page rather than read the data off the disk.

^[1] As you saw in Chapter 12, "The Virtual Filesystem," it is not the read() and write() system calls that perform the actual page I/O operation, but the filesystem-specific methods specified by file->f_op->read() and file->f_op->write().

The `address_space` Object

A physical page might comprise multiple noncontiguous physical blocks^[2]. Checking the page cache to see whether certain data has been cached is rendered more difficult because of the noncontiguous nature of the blocks that constitute each page. Therefore, it is not possible to index the data in the page cache using only a device name and block number, which would otherwise be the simplest solution.

^[2] For example, a physical page is 4KB in size on the x86 architecture, whereas a disk block on most filesystems can be as small as 512 bytes. Therefore, 8 blocks might fit in a single page. The blocks need not be contiguous because the files themselves might be laid out all over the disk.

Furthermore, the Linux page cache is quite general in what pages it can cache. Indeed, the original page cache introduced in System V Release 4 cached only filesystem data. Consequently, the SVR4 page cache used its equivalent of the file object (called struct vnode) to manage the page cache. The Linux page cache aims to cache any page-based object, which includes many forms of files and memory mappings.

To remain generic, the Linux page cache uses the address_space structure to identify pages in the page cache. This structure is defined in <linux/fs.h>:

struct address_space {
        struct inode            *host;              /* owning inode */
        struct radix_tree_root  page_tree;          /* radix tree of all pages */
        spinlock_t              tree_lock;          /* page_tree lock */
        unsigned int            i_mmap_writable;    /* VM_SHARED ma count */
        struct prio_tree_root   i_mmap;             /* list of all mappings */
        struct list_head        i_mmap_nonlinear;   /* VM_NONLINEAR ma list */
        spinlock_t              i_mmap_lock;        /* i_mmap lock */
        atomic_t                truncate_count;     /* truncate re count */
        unsigned long           nrpages;            /* total number of pages */
        pgoff_t                 writeback_index;    /* writeback start offset */
        struct address_space_operations   *a_ops;   /* operations table */
        unsigned long           flags;              /* gfp_mask and error flags */
        struct backing_dev_info *backing_dev_info;  /* read-ahead information */
        spinlock_t              private_lock;       /* private lock */
        struct list_head        private_list;       /* private list */
        struct address_space    *assoc_mapping;     /* associated buffers */
};

The i_mmap field is a priority search tree of all shared and private mappings in this address space. A priority search tree is a clever mix of heaps and radix trees^[3].

^[3] The kernel implementation is based on the radix priority search tree proposed by Edward M. McCreight in SIAM Journal of Computing, volume 14, number 2, pages 257276, May 1985.

There are a total of nrpages in the address space.

The address_space is associated with some kernel object. Normally, this is an inode. If so, the host field points to the associated inode. The host field is NULL if the associated object is not an inode; for example, if the address_space is associated with the swapper.

The a_ops field points to the address space operations table, in the same manner as the VFS objects and their operations tables. The operations table is represented by struct address_space_operations and is also defined in <linux/fs.h>:

struct address_space_operations {
        int (*writepage)(struct page *, struct writeback_control *);
        int (*readpage) (struct file *, struct page *);
        int (*sync_page) (struct page *);
        int (*writepages) (struct address_space *, struct writeback_control *);
        int (*set_page_dirty) (struct page *);
        int (*readpages) (struct file *, struct address_space *,
                          struct list_head *, unsigned);
        int (*prepare_write) (struct file *, struct page *, unsigned, unsigned);
        int (*commit_write) (struct file *, struct page *, unsigned, unsigned);
        sector_t (*bmap)(struct address_space *, sector_t);
        int (*invalidatepage) (struct page *, unsigned long);
        int (*releasepage) (struct page *, int);
        int (*direct_IO) (int, struct kiocb *, const struct iovec *,
                          loff_t, unsigned long);
};

The readpage() and writepage() methods are most important. Let's look at the steps involved in a page read operation.

First, the readpage() method is passed an address_space plus offset pair. These values are used to search the page cache for the desired data:

page = find_get_page(mapping, index);

Here, mapping is the given address space and index is the desired position in the file.

If the page does not exist in the cache, a new page is allocated and added to the page cache:

struct page *cached_page;
int error;

cached_page = page_cache_alloc_cold(mapping);
if (!cached_page)
        /* error allocating memory */
error = add_to_page_cache_lru(cached_page, mapping, index, GFP_KERNEL);
if (error)
        /* error adding page to page cache */

Finally, the requested data can be read from disk, added to the page cache, and returned to the user:

error = mapping->a_ops->readpage(file, page);

Write operations are a bit different. For file mappings, whenever a page is modified, the VM simply calls

SetPageDirty(page);

The kernel later writes the page out via the writepage() method. Write operations on specific files are more complicated. Basically, the generic write path in mm/filemap.c performs the following steps:

page = __grab_cache_page(mapping, index, &cached_page, &lru_pvec);
status = a_ops->prepare_write(file, page, offset, offset+bytes);
page_fault = filemap_copy_from_user(page, offset, buf, bytes);
status = a_ops->commit_write(file, page, offset, offset+bytes);

First, the page cache is searched for the desired page. If it is not in the cache, an entry is allocated and added. Next, the prepare_write()method is called to set up the write request. The data is then copied from user-space into a kernel buffer. Finally, the data is written to disk via the commit_write() function.

Because the previous steps are performed during all page I/O operations, all page I/O is guaranteed to go through the page cache. Consequently, the kernel attempts to satisfy all read requests from the page cache. If this fails, the page is read in from disk and added to the page cache. For write operations, the page cache acts as a staging ground for the writes. Therefore, all written pages are also added to the page cache.

Page Cache

The address_space Object

The `address_space` Object