Modern-Day Cipher Structure and Operation Modes

The cipher strength depends on the key length, key secrecy (including appropriate key management and distribution), and the design of the algorithm itself. Claude Shannon, who is considered to be a father of modern cryptography by many researchers, proposed two essential elements of cryptographic systems. He designated these elements as diffusion and confusion.

Diffusion refers to eliminating the relation of ciphertext statistical composition to that of the plaintext. The confusion element states that the relationship of the statistical composition between the ciphertext and the value of a key must be as complex as possible.

The Lucifer cipher, published by Horst Feistel from the IBM cryptography team in 1973 (56 years after Shannon), was the first cryptosystem to employ Shannon's principles. The way these principles were implemented involved combining multiple permutations and substitutions in a sequence. Each performed substitution or permutation of data was designated as round, with the whole sequence known as iteration. Instead of using the key as it is, a key schedule algorithm was employed to generate the subkeys from the original key, further XORing different subkeys with blocks of data as it goes through the iteration process. Thus, a sufficient level of confusion was achieved.

A Classical Example: Dissecting DES

It was on the basis of Lucifer that the Data Encryption Standard (DES) for the next 20 years was developed. Because of the importance of DES and the elegance of its design, we focus more on its mathematical description, clearly demonstrating Shannon's elements at work. Once you understand how DES works, you can easily grasp the inner workings of post-DES ciphers by comparing their machinery to that of DES. In our experience this tremendous simplification is very helpful.

In addition to all previously described simple algorithms, DES is a symmetric block cipher. Symmetric refers to the fact that all users of such cryptosystems must share secret keys (usually one key per user) to encrypt and decrypt data. The secrecy of these keys should never be compromised; that is why shadowing of /etc/passwd was introduced (remember the times of misconfigured anonymous FTP servers and cat /etc/passwd followed by paste and John or Crack?). Block refers to the fact that the plaintext is encrypted by equal blocks of data rather than bit by bit. In the case of DES and many of the modern production ciphers, the block size is still 64 bits. Ciphers that encrypt data bit by bit are referred to as streaming ciphers and are discussed later. Streaming ciphers are crucial for encrypting and decrypting data on the fly when it is sent or received over the network connection. Because wireless networks have little Layer 1 security, using streaming ciphers or block ciphers tweaked to operate in a manner similar to the streaming ones is the only way to ensure data confidentiality.

One would expect that if DES uses a 64-bit plaintext block size, the key size is also 64 bits. However, not everything is that simple. DES key bits in positions 8, 16, 24, 32, 40, 48, 56, and 64 are used for error control (which is basically assurance of bytes' odd parity; each byte has an odd number of 1s). Even more, when the data goes through DES rounds, only 48 bits of the remaining 56 bits of key space are selected for use.

DES iteration starts and ends with the initial and final permutation functions that exist for data input–output convenience and does not affect overall DES security.

If a 64-bit plaintext block = x, initial permutation can be described as x₀ = IP(x) = L₀R₀ where L₀ = R₀ = 32 bit => the data block is split on the "right" and "left" sides, 32 bits each.

The initial permutation is followed by 16 rounds of substitution and transposition. If we define the cipher function as f, key scheduling function as K_s, and subkeys generated by K_s as K_i, the 16 rounds can be described as follows:


x_i = L_i R_i ;  L_i = R_i _- ₁ ;  R_i = L_i ^= f (R_i-1, K_i) while i =< 16

This means that as the data goes through these rounds, the right side of data passes the cipher function and the left side is XORed with the function output on the right to give new function input on the right. The process then repeats itself until i reaches 16.

What happens inside of the f (R_i-1, K_i) box?

First of all, we mentioned that only 48 bits of key space is eventually used. In practical terms it means K_i = 48 bits. However, the data block size after the IP(x) is 32 bits. Thus, the first step in a round is an expansion permutation E(R_i), which expands the right side of the data from 32 to 48 bits (apparently, it is the reason why the right side of data passes the cipher function). How does one magically turn 32 bits into 48?

Split 32 bits of data into 4-bit input blocks. Then shift the data in input blocks so that the last bit of each 4-bit block becomes the first bit of the next output block and the first bit of each input block becomes the last bit of the last output block. Thus, using this shift, each input block donates 2 additional bits to each output block. Providing that the original 4 bits in every input block is passed into a corresponding output block, we get 8 x 6-bit output blocks out of 8 x 4-bit input blocks. Problem solved! Even more, the expansion permutation exhibits a very effective cryptographic property called the avalanche effect. Because the expansion permutation presence efficiently generates more ciphertext output from less plaintext input, small differences in plaintext produce vastly different blocks of ciphertext.

Once we get our 48 blocks, we can XOR them with the subkeys: E(R_i) ^= K_i.

However, the round does not end with this XORing. Remember that we have to get 64 bits of encrypted data from 64 bits of plaintext at the end, so we need to get back to two 32-bit blocks of data, essentially reversing the expansion permutation function after it did its job. This is accomplished by splitting 48-bit blocks of data into 8 x 6-bit blocks and feeding them into so-called S-boxes, which produce 4-bit output from each 6-bit data block.

In the S-box, the first and last bits of the 6 bits supplied form a binary number to select a row. The inner 4 bits are used to select a column. Thus, an S-box is simply a table with 4 rows and 16 columns. Four inner bits form a number positioned in the table and selected via row determined by 2 "external" bits and column determined by 4 "internal" bits. Thus, 2 outer bits are cut away (but participate in the 4 inner bits selection) and the remaining 4-bit numbers are reshuffled in a predetermined matter. The function of the S-box is the most nonlinear event in the whole process of DES iteration and is responsible for the lion's share of DES security. Total memory requirement for all eight S-boxes used is 256 bytes. The 4-bit output from each S-box is concatenated back into 32 bits of data before putting these 32-bit blocks through another permutation.

This time the permutation is defined as "straight"; no bits are "reused" to expand the data and no bits are ignored. Basically, 32 bits of data is fed into a P-box with 2 rows and 32 columns, and numbers between rows exchange places.

Did you already forget that everything we just discussed applies only to the right side of the initial 64-bit block?

After the P-box spits out 32 bits of data, they are XORed with the left 32-bit half of the initial input. Then both halves switch places and the next round can begin. Following Round 16, the left and right halves are not exchanged. Instead they are concatenated and fed into the final permutation, which exchanges halves and concatenates them together in a way opposite to the initial permutation IP.

The last thing to consider inside of the function box is K => K_i—in other words, where the subkeys come from.

First, the 8 parity bits are subtracted and the remaining 56 bits are split in half. This split is similar to IP(x) and is referred to as fixed permutation PC1(K) = C₀D₀. Afterward, the halves are circularly (modulo 28) shifted by either 1 or 2 bits depending on a round number: C_i = LS_i (C_i-1) ; D_i = LS_i (D_i-1).

Between Rounds 3 and 8 and 10 and 15, 2-bit left shifts are done, otherwise it is a 1-bit shift. After the shift, C_iand D_i are concatenated, and we are left with 56 bits of key data that must be XORed with 48 bits of data produced by the expansion permutation on the right side. Thus, we need a compression permutation function PC-2 for the keying material to match the corresponding (permutated) "plaintext": K_i = PC-2 (C_iD_i).

PC-2 does not use any specific algorithm to shift the positions of bits or drop some bits to get 48-bit output. Instead it uses a predefined table with numbers of bit positions. Bits are assigned their new positions in accordance with their position in the table. For example, position one from the beginning of the PC-2 table is 14, so the 14th bit is assigned the first position in PC-2 output. The table for PC-2 is widely published in the literature (see p. 274 in Schneier's Applied Cryptography). As a result of PC-2 we get a 48-bit K_i ready for XORing. Because of the round-dependent left shifts, different parts of the initial key material are used to create each subkey. This is the element of confusion at work.

The images of cipher structures from John Savard's home page cryptography section (http://home.ecn.ab.ca/~jsavard/crypto/entry.htm) are so wonderful that we could not resist borrowing them for this chapter. Sometimes, a picture can be worth a thousand lines of code! Figure 11-1 summarizes all we went through, with the left scheme representing the whole iteration of DES and the right scheme representing a single round.

Figure 11.1. DES structure and operation.

graphics/11fig01.gif

[View full size image]

Kerckhoff's Rule and Cipher Secrecy

You might have wondered, if DES is a U.S. government standard, why were its inner workings given away to the general public? The answer is this: Keeping the cipher closed from scrutiny does no good for the cipher, its developers, and its users. As early as 1883, Jean Guillaumen Hubert Victor Fransois Alexandre Auguste Kerckhoff von Nieuwenhof (yes, some people have rather long names) wrote that the key used and the cipher's function must be two separate entities and cryptosystems should rely on secrecy of keys, but not algorithms. Some 111 years later, a proprietary secret algorithm, RC4, was published on the Internet by an unknown hacker who posted its source code to the Cypherpunks mailing list. Opening the RC4 structure quickly led to the development of several attacks on the cipher (however, these attacks aren't related to the weakness of WEP, which uses RC4). Whereas the developers of RC4, RSA Data Security, Inc., are well-reputed cryptographers who created a variety of strong product ciphers, many small companies that claim to develop highly efficient and secure secret encryption algorithms often do not offer anything more than a variety of ROT13 with a single "round" of XORing. It appears that open sourcing ciphers, just like open sourcing software, has advantages when it comes to security, public scrutiny being one of them.

Another advantage of DES openness is that now you have learned about S-boxes, subkeys, expansion, and compression, and straight permutations, using a classical and still practical (considering the use of 3DES and still running legacy cryptographics software and hardware) example. Now it is much easier to explain how post-DES ciphers work, not to mention saving a lot of space and our time.

The 802.11i Primer: A Cipher to Help Another Cipher

As a very relevant example, the per-packet key mixing function in TKIP is a small Feistel cipher on its own, developed by Doug Whiting and Ronald Rivest. The purpose of this function is producing a per-packet key from a temporal key and the IV or TKIP sequence counter (TSC). This per-packet key then serves as a secure WEP seed, eliminating the risk of an FMS-style attack. Such a WEP seed can be computed before it is used, which positively affects the network performance.

Let's walk through the per-packet key mixing function in detail, as defined by the 802.11i standard drafts available at the time of writing. As mentioned in the previous chapter, there are two phases of per-packet key mixing function operation. However, both Phase 1 and Phase 2 of per-packet key generation rely on the S-box that substitutes one 16-bit value with another 16-bit value. The substitution function is nonlinear and is implemented as a table lookup. The table lookup can be implemented as a single large table with 65,536 entries and a 16-bit index (128 KB of table) or two different tables with 256 entries and an 8-bit index (1024 bytes for both tables). When the two smaller tables are chosen, the high-order byte is used to obtain a 16-bit value from one table and the low-order byte is used to obtain a 16-bit value using the other table. The S-box output in this case would be the XOR of the two 16-bit values selected.

The inputs taken by the first phase are 80 bits of the 128-bit temporal session key (TK), the transmitter MAC address (TA), and the 32 bits of the IV = TSC. Its output (TTAK) is also 80 bits in length and constitutes an array of five 16-bit TTAK₀, TTAK₁, TTAK₂, TTAK_3, and TTAK₄ values. The description of the Phase 1 algorithm treats these values as 8-bit arrays: TA₀..TA₅ and TK₆..TK₁₂.

XOR, ADD, and bitwise AND operations are used in the Phase 1 computation. A loop counter i and an array index temporary variable j are also employed and a single function called Mk16 is applied in the process. The function Mk16 produces a 16-bit value from two given 8-bit inputs: Mk16(X,Y) = 256*X+Y.

The Phase 1 algorithm consists of two steps. The first step does an initialization of TTAK from both IV and MAC address, but without the temporary key. The second step employs the S-box we outlined earlier to mix the keying material into the 80-bit TTAK and sets the PHASE1_LOOP_COUNT value to 8:


Input: transmit address TA0...TA5, temporal key TK0..TK12, and TSC0..TSC2

Output: intermediate key TTAK0..TTAK4



PHASE1-KEY-MIXING(TA0...TA5, TK0..TK12, TSC0..TSC2)



PHASE1_STEP1:

          TTAK0 <= TSC0

          TTAK1 <= TSC1

          TTAK2 <= Mk16(TA1,TA0)

          TTAK3 <= Mk16(TA3,TA2)

          TTAK4 <= Mk16(TA5,TA4)



PHASE1_STEP2:

    for i = 0 to PHASE1_LOOP_COUNT -1

    j = 2(i & 1)

          TTAK0 <= TTAK0 + S[TTAK4 ^= Mk16(TK1+j,TK0+j)]

          TTAK1 <= TTAK1 + S[TTAK0 ^= Mk16(TK5+j,TK4+j)]

          TTAK2 <= TTAK2 + S[TTAK1 ^= Mk16(TK9+j,TK8+j)]

          TTAK3 <= TTAK3 + S[TTAK2 ^= Mk16(TK13+j,TK12+j)]

          TTAK4 <= TTAK4 + S[TTAK3 ^= Mk16(TK1+j,TK0+j)]+i

          end

The inputs to the second phase of the temporal key mixing function include the output of the first phase (TTAK), the TK, and the lower 16 bits of the TSC. The created WEP seed possesses an internal structure that conforms to the original WEP specification. The first 24 bits of the seed are transmitted in plaintext in the same way as with the old WEP IVs. As mentioned in the previous chapter, these 24 bits are used to convey the lower 16 bits of the TSC from transmitter to receiver. The remaining 32 bits are conveyed in the Extended IV (EIV) field, in Big-Endian order.

In Phase 2, both TK and TTAK values are represented as in Phase 1. The WEP seed produced is an array of 8-bit values ranging from Seed0 to Seed15. The TSC is viewed as another array, this time consisting of 16-bit values TSC0–TSC2. Finally, the pseudocode used by the Phase 2 mixing function employs a loop counter i and a single variable: PPK. This variable is 128 bits long and consists of an array of 16-bit values ranging from PPK₀ to PPK₇. The mapping from the 16-bit PPK values to the 8-bit WEP seed values generated is explicitly Little-Endian. This is done to match the Endian architecture of the most common processors used for TKIP computation.

XOR, ADD, AND, OR, and the right bit shift operations (>>) are employed in the process of Phase 2 computation that relies on four functions:

Lo8 references the least significant 8 bits of the 16-bit input value.
Hi8 references the most significant 8 bits of the 16-bit value.
RotR1 rotates its 16-bit argument 1 bit to the right.
Mk16 was already described when outlining Phase 1.

Phase 2 consists of three steps:

STEP1 copies the TTAK and brings in the 16 bits of TSC.
STEP2 is the S-box.
STEP3 brings in the remaining TK bits and defines the 24-bit WEP IV values transmitted.



Input: intermediate key TTAK0...TTAK4, TK, and TKIP sequence counter TSC. Output: WEPSeed0

...WEPSeed15



PHASE2-KEY-MIXING(TTAK0...TTAK4, TK, TSC)



PHASE2_STEP1:

    PPK0 <= TTAK0

    PPK1 <= TTAK1

    PPK2 <= TTAK2

    PPK3 <= TTAK3

    PPK4 <= TTAK4

    PPK5 <= TTAK4 + TSC



PHASE2_STEP2:

    PPK0 <= PPK0 + S[PPK5 ^= Mk16(TK1,TK0)]

    PPK1 <= PPK1 + S[PPK0 ^= Mk16(TK3,TK2)]

    PPK2 <= PPK2 + S[PPK1 ^= Mk16(TK5,TK4)]

    PPK3 <= PPK3 + S[PPK2 ^= Mk16(TK7,TK6)]

    PPK4 <= PPK4 + S[PPK3 ^= Mk16(TK9,TK8)]

    PPK5 <= PPK5 + S[PPK4 ^= Mk16(TK11,TK10)]

    PPK0 <= PPK0 + RotR1(PPK5 ^= Mk16(TK13,TK12))

    PPK1 <= PPK1 + RotR1(PPK0 ^= Mk16(TK15,TK14))

    PPK2 <= PPK2 + RotR1(PPK1)

    PPK3 <= PPK3 + RotR1(PPK2)

    PPK4 <= PPK4 + RotR1(PPK3)

    PPK5 <= PPK5 + RotR1(PPK4)



PHASE2_STEP3:

    WEPSeed0 <= Hi8(TSC)

    WEPSeed1 <= (Hi8(TSC) || 0x20) && 0x7F

    WEPSeed2 <= Lo8(TSC)

    WEPSeed3 <= Lo8((PPK5 ^= Mk16(TK1,TK0)) >> 1)

    for i = 0 to 5

    WEPSeed_4+(2.i) <= Lo8(PPKi)

    WEPSeed_5+(2.i) <= Hi8(PPKi)

 end



return WEPSeed0...WEPSeed15

Step 3 of Phase 2 determines the values of all three WEP IV octets. Its structure was designed to eliminate the use of known weak keys. The receiving device can easily reconstruct the least significant 16 bits of the TSC used by the sender by concatenating the first and third IV octets and ignoring the second one. The remaining 32 bits of the TSC are obtained from the EIV. Thus, you have been presented with an interesting case when a specific cipher is designed and implemented to correct a flaw in another cipher's implementation.

There Is More to a Cipher Than the Cipher: Understanding Cipher Operation Modes

Understanding how DES (or any other symmetric block cipher including the TKIP function we just described and the AES used by the final 802.11i standard) works is not sufficient per se. An extremely important detail is the unique mode of the block cipher operation. With DES you can encrypt 64 bits of plaintext into an equivalent amount of ciphertext, but what if you want to encrypt only 50 bits? 128 bits? 200 bits? Or 31337 bits of data?

An obvious solution is to split the long string of data into 64-bit blocks and pad blocks with less than 64 bits of data with some regular pattern of 0s and 1s. This is the most basic and simplest mode of block cipher operation, the Electronic Codebook Mode (ECB). In a nutshell, if the plaintext string x = x₁ x₂ x₃ .... c_i = e_k (x_i). Advantages of the ECB mode include the possibility of parallelized encryption and decryption on multiprocessing systems and the fact that an error in ciphertext would affect only one block of data. Then the advantages of the ECB end. In a long enough string of data, the patterns tend to repeat, and splitting the data into 64-bit blocks would not conceal such repetition. If we can deduce from these patterns which piece of plaintext corresponds to a particular piece of ciphertext, we can mount a replay attack using such knowledge. For example, we can determine that the encrypted messages are e-mails and be confident that the repeating pattern we see is mail headers. Then we can replay a header to send a message to a receiver of interest. This time instead of the usual encrypted love message it might contain "I Love You" in a slightly different form, perhaps in the form of a Visual Basic script.

Another problem we have encountered with the ECB is a short block length. We have discovered that on a machine running the U.S. version of Debian (which still uses ECB-mode DES for password security due to the export regulations) maximum password length cannot exceed 8 characters, and any symbol in a password longer than that does not make any difference at all (e.g., you can log in with "password" if the real password is "password%^*&))@!#0x69"). Thus, the possibility of a successful dictionary attack or password brute-forcing is increased. If you recall that the block size is 64 bits, and one ASCII character takes 1 byte, the reason for such an event is obvious. Yet, the majority of system administrators we asked about this discovery looked rather puzzled. Sometimes a theory might be more practical than it seems.

To summarize, ECB is reasonable for encryption of short data strings, like PIN numbers or database entry values (parallel encryption and decryption of large databases does have its advantages). It is not a suitable mode to rely on for encrypting strong passwords or large data volumes.

The way to avoid the replay problem is to chain 64-bit blocks of data together so that they become interdependent. Thus, a next mode of block cipher operation we are looking at is a Cipher Block Chaining mode (CBC). It is based on XORing the plaintext with a previously obtained block of encrypted data before the plaintext is encrypted. Because when the encryption happens the first time there is no ciphertext to XOR with, a new parameter known as the initialization vector (IV; sometimes also called an injection vector, initializing value, or initial chaining value) is introduced. IV is nothing more than a block of random data in the size of a block the cipher uses (64 bits with DES and many other symmetric block ciphers out there). It could be a timestamp, /dev/urandom output, or anything else. IV doesn't have to be secret and can be transmitted in the clear; view it as a dummy cipher block. This is the case with WEP IVs broadcast plaintext across WLANs. The decrypting machine pushes the IV into a feedback register when decryption starts, from which it goes no further than /dev/null. To summarize how CBC works, if x = x₁ x₂ x₃ ... x_n:


x₁ ^= IV

c₁ = e_k (x₁ ^= IV)

c₂ = e_k (x₂ ^= c₁)

c_i = e_k (x_i ^= c_i _-1)

The use of IVs ensures that the ciphertexts resulting from the plaintexts that are similar in the first few bytes (e.g., LLC SNAP or IP headers) are different. There are two main disadvantages of CBC:

Because chaining is applied, parallel encryption is not possible (although parallel decryption is still an option).
An error in one block will propagate to other (chained) blocks, which is likely to result in data retransmission.

Still, we encrypt data in fixed-size blocks, chained or not. What if we want to encrypt it in smaller blocks, or bit-by-bit, starting the encryption before the whole block is received? For some applications (e.g., remote shells), data should be encrypted immediately, character by character (8-bit blocks). There are two solutions for this problem, namely Cipher-Feedback (CFB) and Output Feedback (OFB) modes. In CFB we start feeding the blocks of generated ciphertext into the encrypted IV value rather than into plaintext:


z₁ = e_k (IV)

c₁ = x₁ ^= z₁

z₂ = e_k (c₁)

c₂ = x₂ ^= z₂

z_i = e_k (c_i-1)

c_i = x_i^= z_i

Unlike CBC, we can start sending enciphered data as the plaintext and z are getting XORed. In CBC, c_i is generated by enciphering, not XORing with previously encrypted data, which means only blocks of cipher-dependent block size can be sent. Of course, generating z_i before each round of XORing is a speed- and throughput-limiting factor here. It was estimated that the throughput of CFB mode encryption is reduced by a factor of m/n, where m is a block cipher size and n is a number of bits encrypted at a time. For example, 64-bit block cipher encrypting ASCII characters in CFB mode will work 64/8 = 8 times slower compared to the same cipher operating in ECB or CBC with 64-bit blocks. The CBC statement on parallel processing applies to the CFB as well. However, when it comes to error propagation, a ciphertext error would affect only the corresponding plaintext and the next full block.

Finally, Output Feedback (OFB) mode removes chaining, and as such, removes error propagation. On the other hand, the data blocks are not interdependent anymore, so some external form of synchronization (e.g., similar to the CSMA/CA algorithm on 802.11 LANs) is needed. To remove chaining, OFB generates a constant stream of z_i from IV, with which the plaintext is XORed to encrypt data; z_i does not depend on either plaintext or ciphertext:


z₁ = e_k (IV)

c₁ = x₁ ^= z₁

z_i = e_k (z_i-1)

c_i = x_i ^= z_i

Because there is only one stream of z_i per encryption and decryption process, no parallel encrypting and decrypting is possible. Thus, the number of processors on sending and receiving hosts doesn't have any effect on OFB mode cipher speed and throughput. Besides, the m/n rule applies to OFB as well as CFB.

How about the counter mode (CCM) used by the 802.11i standard for Advanced Encryption Standard (AES) operation? It is quite similar to the OFB mode just reviewed. In the OFB mode, each z_i value is linked to the previous one via the z_i = e_k (z_i-1) procedure. CCM is simpler: The IV values taken from the incrementing counter are encrypted and XORed with blocks of plaintext to produce the ciphertext, or:


z₁ = e_k (IV)

c₁ = x₁ ^= z₁

z_i = e_k (IV + n)

c_i = x_i ^= z_i

The n value signifies the fact that the counter can start at any arbitrary value and increment by a chosen value or pattern. In reality, the IV is supposed to be initialized from a nonce that changes for each successive message. Thus, the repeating ECB blocks problem is eliminated. Of course, the receiver must be aware of both IV and n, which means the n pattern should be standardized and IV has to be transmitted (perhaps unencrypted) before the secure communication begins. When both transmitting and receiving systems are synchronized, XORing the arriving data on the receiving end is all that is needed to decrypt the data. Thus, there is no need for a specific AES decrypting scheme and the encryption and decryption processes can be done in a parallel operation. An important thing here is to avoid the reuse of IVs (think of the repeating ECB blocks mentioned and the FMS attack on WEP). However, 48-bit IV space should be sufficient to mitigate this problem. We have calculated that on a full-duplex 100BaseT link (packet size 1,500 bits) it would take approximately 127 years to exhaust the 48-bit space, and current WLAN links are still slower than 100BaseT and usually employ larger packets.

< Day Day Up >