Word Size and Data Types

A word is the amount of data that a machine can process at one time. This fits into the document analogy that includes characters (usually eight bits) and pages (many words, often 4 or 8KB worth) as other measurements of data. A word is an integer number of bytesfor example, one, two, four, or eight. When someone talks about the "n-bits" of a machine, they are generally talking about the machine's word size. For example, when people say the Pentium is a 32-bit chip, they are referring to its word size, which is 32 bits, or four bytes.

The size of a processor's general-purpose registers (GPR's) is equal to its word size. The widths of the components in a given architecturefor example, the memory busare usually at least as wide as the word size. Typically, at least in the architectures that Linux supports, the memory address space is equal to the word size^[2]. Consequently, the size of a pointer is equal to the word size. Additionally, the size of the C type long is equal to the word size, whereas the size of the int type is sometimes less than that of the word size. For example, the Alpha has a 64-bit word size. Consequently, registers, pointers, and the long type are 64 bits in length. The int type, however, is 32 bits long. The Alpha can access and manipulate 64 bitsone wordat a time.

Words, Doublewords, and Confusion

Some operating systems and processors do not call the standard data size a word. Instead, a word is some fixed size based on history or arbitrary naming decisions. For example, some systems might partition data sizes into bytes (8 bits), words (16 bits), double words (32 bits), and quad words (64 bits), despite the fact the system in question may be 32 bits. In this bookand Linux in generala word is the standard data size of the processor, as discussed previously.

Each supported architecture under Linux defines BITS_PER_LONG in <asm/types.h> to the length of the C long type, which is the system word size. A full listing of all supported architectures and their word size is in Table 19.1.

The C standard explicitly leaves the size of the standard types up to implementations^[3], although it does dictate a minimum size. The uncertainty in the standard C types across architectures is both a pro and a con. On the plus side, the standard types can take advantage of the word size of various architectures, and types need not explicitly specify a size. The size of the C long type is guaranteed to be the machine's word size. On the downside, however, code cannot assume that the standard C types have any specific size. Furthermore, there is no guarantee that an int is the same size as a long^[4].

^[3] With the exception of char, which is always 8 bits.

^[4] On the 64-bit architectures supported in Linux, in fact, an int and a long are not the same size; an int is 32 bits and a long is 64 bits. The 32-bit architectures we are all familiar with have both types equal to 32 bits.

The situation grows even more confusing because there doesn't need to be a relation between the types in user-space and kernel-space. The sparc64 architecture provides a 32-bit user-space and therefore pointers and both the int and long types are 32-bit. In kernel-space, however, sparc64 has a 32-bit int type and 64-bit pointers and long types. This is not the norm, however.

Some rules to keep in mind:

A char is always eight bits.
Although there is no rule that the int type be 32 bits, it is in Linux on all currently supported architectures.
The same goes for the short type, which is 16 bits on all current architectures, although no rule explicitly decrees that.
Never assume the size of a pointer or a long, which can be either 32 or 64 bits on the currently supported machines in Linux.
Because the size of a long varies on different architectures, never assume that sizeof(int) is equal to sizeof(long).
Likewise, do not assume that a pointer and an int are the same size.

Opaque Types

Opaque data types do not reveal their internal format or structure. They are about as "black box" as you can get in C. There is not a lot of language support for them. Instead, developers declare a typedef, call it an opaque type, and hope no one typecasts it back to a standard C type. All use is generally through a special set of interfaces that the developer creates. An example is the pid_t type, which stores a process identification number. The actual size of this type is not revealedalthough anyone can cheat and take a peak and see that it is an int. If no code makes explicit use of this type's size, it can be changed without too much hassle. Indeed, this was once the case: In older Unix systems, pid_t was declared as a short.

Another example of an opaque type is atomic_t. As discussed in Chapter 9, "Kernel Synchronization Methods," this type holds an integer value that can be manipulated atomically. Although this type is an int, using the opaque type helps ensure that the data is used only in the special atomic operation functions. The opaque type also helps hide the size of the type, which was not always the full 32 bits because of architectural limitations on 32-bit SPARC.

Other examples of opaque types in the kernel include dev_t, gid_t, and uid_t.

Keep the following rules in mind when dealing with opaque types:

Do not assume the size of the type.
Do not convert the type back to a standard C type.
Write your code so that the actual storage and format of the type can change.

Special Types

Some data in the kernel, despite not being represented by an opaque type, requires a specific data type. Two examples are jiffy counts and the flags parameter used in interrupt control, both of which should always be stored in an unsigned long.

When storing and manipulating specific data, always pay careful attention to the data type that represents the type and use it. It is a common mistake to store one of these values in another type, such as unsigned int. Although this will not result in a problem on 32-bit architectures, 64-bit machines will have trouble.

Explicitly Sized Types

Often, as a programmer, you need explicitly sized data in your code. This is usually to match an external requirement, such as with hardware, networking, or binary files. For example, a sound card might have a 32-bit register, a networking packet might have a 16-bit field, or an executable file might have an 8-bit cookie. In these cases, the data type that represents the data needs to be exactly the right size.

The kernel defines these explicitly sized data types in <asm/types.h>, which is included by <linux/types.h>. Table 19.2 is a complete listing.

Table 19.2. Explicitly-Sized Data Types
Type
Description
s8
Signed byte
u8
Unsigned byte
s16
Signed 16-bit integer
u16
Unsigned 16-bit integer
s32
Signed 32-bit integer
u32
Unsigned 32-bit integer
s64
Signed 64-bit integer
u64
Unsigned 64-bit integer

The signed variants are rarely used.

These explicit types are merely typedefs to standard C types. On a 64-bit machine, they may look like this:

typedef signed char s8;
typedef unsigned char u8;
typedef signed short s16;
typedef unsigned short u16;
typedef signed int s32;
typedef unsigned int u32;
typedef signed long s64;
typedef unsigned long u64;

On a 32-bit machine, however, they are probably defined as follows:

typedef signed char s8;
typedef unsigned char u8;
typedef signed short s16;
typedef unsigned short u16;
typedef signed int s32;
typedef unsigned int u32;
typedef signed long long s64;
typedef unsigned long long u64;

These types can be used only inside the kernel, in code that is never revealed to user-space (say, inside a user-visible structure in a header file). This is for reasons of namespace. The kernel also defines user-visible variants of these types, which are simply the same type prefixed by two underscores. For example, the unsigned 32-bit integer type that is safe to export to user-space is __u32. This type is the same as u32; the only difference is the name. You can use either name inside the kernel, but if the type is user-visible you must use the underscored prefixed version to prevent polluting user-space's namespace.

Signedness of Chars

The C standard says that the char data type can be either signed or unsigned. It is the responsibility of the compiler, the processor, or both to decide what the suitable default for the char type is.

On most architectures, char is signed by default and thus has a range from 128 to 127. On a few other architectures, such as ARM, char is unsigned by default and has a range from 0 to 255.

For example, on systems where a char is by default unsigned, this code ends up storing 255 instead of 1 in i:

char i = -1;

On other machines, where char is by default signed, this code correctly stores 1 in i. If the programmer's intention is to store 1, the previous code should be

signed char i = -1;

And if the programmer really intends to store 255, then the code should read

unsigned char = 255;

If you use char in your code, assume it can be either a signed char or an unsigned char. If you need it to be explicitly one or the other, declare it as such.