9.1. Introduction

Understanding filesystem fundamentals is key to understanding how Linux works. Everything is a file—data files, partitions, pipes, sockets, and hardware devices. Directories are simply files that list other files.

The Filesystem Hierarchy Standard (FHS) was developed as a voluntary standard. Most Linuxes follow it. These are the required elements of the Linux root filesystem:

/: Root directory, even though it is always represented at the top
/bin: Essential system commands
/boot: Static boot loader files
/dev: Device files
/etc: Host-specific system configuration files
/lib: Shared libraries needed to run the local system
/mnt: Temporary mount points
/opt: Add-on software packages (not used much in Linux)
/proc: Live kernel snapshot and configuration
/sbin: System administration commands
/tmp: Temporary files—a well-behaved system flushes them between startups
/usr: Shareable, read-only data and binaries
/var: Variably sized files, such as mail spools and logs

These are considered optional because they can be located anywhere on a network, whereas the required directories must be present to run the machine:

/home: User's personal files
/root: Superuser's personal files

The FHS goes into great detail on each directory, for those who are interested. Here are some things for the Linux user to keep in mind:

/tmp and /var can go in their own individual partitions, as a security measure. If something goes awry and causes them to fill up uncontrollably, they will be isolated from the rest of the system.
/home can go in its own partition, or on its own dedicated server, for easier backups and to protect it from system upgrades. You can then completely wipe out and re-install a Linux system, or even install a different distribution, while leaving /home untouched.
Because all configuration files are in /etc and /home, backups are simplified. It is possible to get away with backing up only /etc and /home and to rely on your installation disks to take care of the rest. However, this means that program updates will not be preserved—be sure to consider this when plotting a disaster-recovery plan.

9.1.1 Linux File Types

Remember that "everything is a file." There are seven file types in Linux; everything that goes in the file tree must be one of the types in Table 9-1.

Table 9-1. File types

Type indicator

Type of file

-

Regular file

d

Directory

l

Link

c

Character device

s

Socket

p

Named pipe

b

Block device

The type indicators show up at the very front of the file listings:

# ls -l /dev/initctl

prw-------  1 root   root     0 Jan 12 00:00 /dev/initctl

# ls -l /tmp/.ICE-unix/551

srwx------  1 carla  carla    0 Jan 12 09:09 /tmp/.ICE-unix/551

You can specify which file types to look at with the find command:

# find / -type p

# find / -type s

Ctrl-C interrupts find, if it goes on for too long.

9.1.2 File Attributes

Take a look at the attributes of a file, such as this shell script, sortusers:

$ ls -l sortusers

-rwxr-xr-x 1 meredydd  programmers   3783 Jan  7 13:29 sortusers

-rwxr-xr-x 1 meredydd programmers tells us a lot of things:

The - means that this is a regular file. This attribute is not changeable by the user. This is the bit that tells Linux what the file type is, so it does not need file extensions. File extensions are for humans and applications.
rwx are the file owner's permissions.
The first r-x is the group owner's permissions.
The second r-x applies to anyone with access to the file, or "the world."
1 is the number of hard links to the file. All files have at least one, the link from the parent directory.
meredydd programmers names the file owner and the group owner of the file. "Owner" and "user" are the same; remember this when using chmod's symbolic notation u = user = owner.

Permissions and ownership are attributes that are configurable, with the chmod, chgrp, and chown commands; chmod changes the permissions, chown and chgrp change ownership.

All those rwx things look weird, but they are actually mnemonics: rwx = read, write, execute. These permissions are applied in order to user, group, and other.

So, in the sortusers example, meredydd can read, write, and execute the file. Group members and others may only read and execute. Even though only meredydd may edit the file itself, nothing is stopping group and other users from copying it.

Since this is a shell script, both read and execute permissions must be set, because the interpreter needs to read the file. Binary files are read by the kernel directly, without an interpreter, so they don't need read permissions.

9.1.3 File Type Definitions

Let's take a closer look at what the file types in Linux really are:

Regular files: Plain ole text and data files, or binary executables.
Directories: Lists of files.
Character and block devices: Files that could be considered as meeting points between the kernel, and device drivers—for example, /dev/hda (IDE hard drive), /dev/ttyS1 (serial modem), and so forth. These allow the kernel to correctly route requests for the various hardware devices on your system.
Local domain sockets: Communications between local processes. They are visible as files but cannot be read from or written to, except by the processes directly involved.
Named pipes: Also for local interprocess communications. It is highly unlikely that a Linux user will ever need to do anything with either sockets or pipes; they are strictly system functions. Programmers, however, need to know everything about them.
Links: Links are of great interest to Linux users. There are two types: hard links and soft links. Links are pointers to files. A hard link is really just another name for a file, as it points to a specific inode. All the hard links that point to a file retain all of the file's attributes—permissions, ownership, and so on. rm will happily delete a hard link, but the file will remain on disk until all hard links are gone and all processes have released it. Hard links cannot cross filesystems, so you can't make hard links over a network share. Soft links point to a filename; they can point to any file, anywhere. You can even create "dead" soft links by deleting the files they point to, or changing the names of the files.

9.1.4 Filesystem Internals

Here are some more useful definitions relating to filesystems:

Logical block: The smallest unit of storage, measured in bytes, that can be allocated by the filesystem. A single file may consume several blocks.
Logical volume: A disk partition, a disk, or a volume that spans several disks or partitions—any unit of storage that is perceived as a single, discrete allocation of space.
Internal fragmentation: Empty spaces that occur when a file, or a portion of a file, does not a fill a block completely. For example, if the block is 4K, and the file is 1K, 3K are wasted space.
External fragmentation: Fragmentation occurs when the blocks that belong to a single file are not stored contiguously, but are scattered all over the disk.
Extent: A number of contiguous blocks that belong to a single file. The filesystem sees an extent as a single unit, which is more efficient for tracking large files.
B+trees: First there were btrees (balanced trees), which were improved and became b+trees. These are nifty concepts borrowed from indexed databases, which make searching and traversing a given data structure much faster. Filesystems that use this concept are able to quickly scan the directory tree, first selecting the appropriate directory, then scanning the contents. The Ext2 filesystem does a sequential scan, which is slower.
Metadata: Everything that describes or controls the internal data structures is lumped under metadata. This includes everything except the data itself: date and time stamps, owner, group, permissions, size, links, change time, access time, the location on disk, extended attributes, and so on.
Inode: Much of a file's metadata is contained in an inode, or index node. Every file has a unique inode number.

9.1.5 Journaling Filesystems

Our faithful old Ext2 filesystem is showing its age. It can't keep up with users who need terabytes to play with and who need fast recovery from service interruptions. For the majority of users, who still measure their storage needs in gigabytes or less, fast recovery and data integrity are the most important reasons to use a journaling filesystem.

Linux filesystems are asynchronous. They do not instantly write metadata to disk, but rather use a write cache in memory and then write to disk periodically, during slack CPU moments. This speeds up overall system performance, but if there is a power failure or system crash, there can be metadata loss. In this event, when the filesystem driver kicks in at restart and fsck (filesystem consistency check) runs, it finds inconsistencies. Because Ext2 stores multiple copies of metadata, it is usually able to return the system to health.

The downside to this is recovery time. fsck checks each and every bit of metadata. This can take from a few minutes to 30 minutes or more on a large filesystem. Journaling filesystems do not need to perform this minute, painstaking inspection, because they keep a journal of changes. They check only files that have changed, rather than the entire filesystem.

Linux users have a number of great choices for journaling filesystems, including Ext3, ReiserFS, XFS, and JFS. Ext3 is a journaling system added to Ext2. ReiserFS, XFS, and JFS are all capable of handling filesystems that measure in exabytes on 64-bit platforms. ia32 users are limited to mere terabytes, I'm afraid.

Which one should you use? There's no definitive "best" one; they're all great. Here's a rundown on the high points:

Ext3: This one is easy and comfortable. That's what it's designed to be. It fits right on top of Ext2, so you don't need to rebuild the system from scratch. All the other filesystems discussed here must be selected at system installation, or when you format a partition. You can even have "buyer's remorse"—you can remove Ext3 just as easily. Because it's an extension of Ext2, it uses the same file utilities package, e2fsprogs. One major difference between Ext3 and the others is that it uses a fixed number of inodes, while the others allocate them dynamically. Another difference is that Ext3 can do data journaling, not just metadata journaling. This comes at a cost, though, of slower performance and more disk space consumed. Ext3 runs on any Linux-supported architecture.
ReiserFS: ReiserFS is especially suited for systems with lots of small files, such as a mail server using the maildir format, or a news server. It's very efficient at file storage; it stuffs leftover file bits into btree leaf nodes, instead of wasting block space. This is called "tail packing." It scales up nicely, and it handles large files just fine. ReiserFS runs on any Linux-supported architecture.
JFS: This is IBM's entry in the Way Big Linux Filesystems contest, ported from AIX and OS/2 Warp. It supports multiple processors, access control lists (ACLs), and—get this—native resizing. That's right, simply remount a JFS filesystem with the new size you desire, and it's done. Note that you may only increase the volume size, not decrease it.
XFS: This is SGI's brainchild, ported from IRIX. XFS thinks big—it claims it can handle filesystems of up to nine exabytes. Its strength is handling very large files, such as giant database files. There is one excellent feature unique to XFS, called delayed allocation. It procrastinates. It puts off actually writing to disk, delaying the decision on which blocks to write to, so that it can use the largest possible number of contiguous blocks. When there are a lot of short-term temp files in use, XFS might never allocate blocks to these at all, in effect ignoring them until they go away. XFS has its own native support for quotas, ACLs, and backups and restores.

On a 32-bit system, there's only so much addressing space available, so the theoretical upper filesystem size limit is 16 terabytes (as of the 2.5 kernel). Calculating the maximum possible filesystem size depends on hardware, operating system, and block sizes, so I shall leave that as an exercise to those who really need to figure out those sort of things.

Another way to while away the hours is to compare performance benchmarks, or run your own. About all they agree on is that Ext3 really isn't suited for high-performance, high-demand applications. It's fine for workstations and light-to-medium-duty servers, but the others are better choices for high-demand servers.

9.1.6 When Not to Use a Journaling Filesystem

Stick with plain ole Ext2 when you have a /boot partition and are running LILO. LILO cannot read any filesytem but Ext2 or Ext3. The /boot partition is so small, and so easily backed up and restored, that there's no advantage to be gained from journaling in any case. You can put a journaling filesystem on your other partitions; in fact, you can mix and match all you like, as long as your kernel supports them.

On small partitions or small disks, such as 100-MB Zip disks, the journal itself consumes a significant amount of disk space. The ReiserFS journal can take up to 32 MB. Ext3, JFS, and XFS use about 4 MB, but if data journaling is enabled in Ext3, it will eat up a lot more space.

9.1.7 See Also

JFS (http://www-124.ibm.com/jfs/)
XFS (http://oss.sgi.com/projects/xfs/)
ReiserFS (http://www.namesys.com)
Ext2/3 (http://e2fsprogs.sourceforge.net/ext2.html)
Filesystem Hierarchy Standard (http://www.pathname.com/fhs/)

< Day Day Up >