Ordering and Barriers

When dealing with synchronization between multiple processors or with hardware devices, it is sometimes a requirement that memory-reads (loads) and memory-writes (stores) issue in the order specified in your program code. When talking with hardware, you often need to ensure that a given read occurs before another read or write. Additionally, on symmetrical multiprocessing systems, it may be important for writes to appear in the order that your code issues them (usually to ensure subsequent reads see the data in the same order). Complicating these issues is the fact that both the compiler and the processor can reorder reads and writes^[6] for performance reasons. Thankfully, all processors that do reorder reads or writes provide machine instructions to enforce ordering requirements. It is also possible to instruct the compiler not to reorder instructions around a given point. These instructions are called barriers.

^[6] Intel x86 processors do not ever reorder writes. That is, they do not do out-of-order stores. But other processors do.

Essentially, on some processors the code

a = 1;
b = 2;

may allow the processor to store the new value in b before it stores the new value in a. Both the compiler and processor see no relation between a and b. The compiler would perform this reordering at compile time; the reordering would be static, and the resulting object code would simply set b before a. The processor, however, could perform the reordering dynamically during execution by fetching and dispatching seemingly un-related instructions in whatever order it feels is best. The vast majority of the time, such reordering is optimal because there is no apparent relation between a and b. Sometimes the programmer knows best, though.

Although the previous example might be reordered, the processor would never reorder writes such as

a = 1;
b = a;

where a and b are global, because there is clearly a data dependency between a and b. Neither the compiler nor the processor, however, know about code in other contexts. Occasionally, it is important that writes are seen by other code and the outside world in the specific order you intend. This is often the case with hardware devices, but is also common on multiprocessing machines.

The rmb() method provides a read memory barrier. It ensures that no loads are reordered across the rmb() call. That is, no loads prior to the call will be reordered to after the call and no loads after the call will be reordered to before the call.

The wmb() method provides a write barrier. It functions in the same manner as rmb(), but with respect to stores instead of loadsit ensures no stores are reordered across the barrier.

The mb() call provides both a read barrier and a write barrier. No loads or stores will be reordered across a call to mb(). It is provided because a single instruction (often the same instruction used by rmb()) can provide both the load and store barrier.

A variant of rmb(), read_barrier_depends(), provides a read barrier, but only for loads on which subsequent loads depend. All reads prior to the barrier are guaranteed to complete before any reads after the barrier that depend on the reads prior to the barrier. Got it? Basically, it enforces a read barrier, like rmb(), but only for certain readsthose that depend on each other. On some architectures, read_barrier_depends() is much quicker than rmb() because it is not needed and is, thus, a noop.

Let's consider an example using mb() and rmb(). The initial value of a is one and the initial value of b is two.

Thread 1
Thread 2
a = 3;
-
mb();
-
b = 4;
c = b;
-
rmb();
-
d = a;

Without using the memory barriers, on some processors it is possible that c receives the new value of b, whereas d receives the old value of a. For example, c could equal four (what you'd expect), yet d could equal one (not what you'd expect). Using the mb() ensured that a and b were written in the intended order, whereas the rmb() insured c and d were read in the intended order.

This sort of reordering occurs because modern processors dispatch and commit instructions out of order, to optimize use of their pipelines. What can end up happening in the previous example is that the instructions associated with the loads of b and a occur out of order. The rmb()and wmb() functions correspond to instructions that tell the processor to commit any pending load or store instructions, respectively, before continuing.

Let's look at a similar example, but one that uses read_barrier_depends() instead of rmb(). In this example, initially a is one, b is two, and p is &b.

Thread 1
THRead 2
a = 3;
-
mb();
-
p = &a;
pp = p;
-
read_barrier_depends();
-
b = *pp;

Again, without memory barriers, it would be possible for b to be set to pp before pp was set to p. The read_barrier_depends(), however, provides a sufficient barrier because the load of *pp depends on the load of p. It would also be sufficient to use rmb() here, but because the reads are data dependent, we can use the potentially faster read_barrier_depends(). Note that in either case, the mb() is required to enforce the intended load/store ordering in the left thread.

The macros smp_rmb(), smp_wmb(), smp_mb(), and smp_read_barrier_depends() provide a useful optimization. On SMP kernels they are defined as the usual memory barriers, whereas on UP kernels they are defined only as a compiler barrier. You can use these SMP variants when the ordering constraints are specific to SMP systems.

The barrier() method prevents the compiler from optimizing loads or stores across the call. The compiler knows not to rearrange stores and loads in ways that would change the effect of the C code and existing data dependencies. It does not have knowledge, however, of events that can occur outside the current context. For example, the compiler cannot know about interrupts that might read the same data you are writing. For this reason, you might want to ensure a store is issued before a load, for example. The previous memory barriers also function as compiler barriers, but a compiler barrier is much lighter in weight than a memory barrier. Indeed, a compiler barrier is practically free, because it simply prevents the compiler from possibly rearranging things.

Table 9.10 has a full listing of the memory and compiler barrier methods provided by all architectures in the Linux kernel.

Table 9.10. Memory and Compiler Barrier Methods
Barrier
Description
rmb()
Prevents loads from being reordered across the barrier
read_barrier_depends()
Prevents data-dependent loads from being reordered across the barrier
wmb()
Prevents stores from being reordered across the barrier
mb()
Prevents load or stores from being reordered across the barrier
smp_rmb()
Provides an rmb() on SMP, and on UP provides a barrier()
smp_read_barrier_depends()
Provides a read_barrier_depends() on SMP, and provides a barrier() on UP
smp_wmb()
Provides a wmb() on SMP, and provides a barrier() on UP
smp_mb()
Provides an mb() on SMP, and provides a barrier() on UP
barrier()
Prevents the compiler from optimizing stores or loads across the barrier

Note that the actual effects of the barriers vary for each architecture. For example, if a machine does not perform out-of-order stores (for example, Intel x86 chips do not) then wmb() does nothing. You can use the appropriate memory barrier for the worst case (that is, the weakest ordering processor) and your code will compile optimally for your architecture.