[ Team LiB ] Previous Section Next Section

Perl Compatible Regular Expressions

Perl is a powerful scripting language. It was originally designed as a replacement for more limited Unix shell tools, and one of its core features is an extended regular expression engine. PHP provides support for the Perl regular expression syntax, giving you a suite of flexible tools for managing and transforming text.

A regular expression is a combination of symbols that match a pattern in text. Learning how to use regular expressions, therefore, is much more than learning the arguments and return types of PHP's regular expression functions. We will begin with the functions and use them to introduce regular expression syntax.

Matching Patterns with preg_match()

preg_match() accepts four arguments: a regular expression string, a source string, an array variable (which stores matches), and an optional fourth flag argument. preg_match() returns 0 if a match is found and 1 otherwise. These numbers represent the number of matches the function can make in a string. Your regular expression string should be enclosed by delimiters, conventionally forward slashes, although you can use any character that isn't alphanumeric (apart from the backslash character).

Let's search the string "aardvark advocacy" for the letters "aa":


print "<pre>\n";
print preg_match("/aa/", "aardvark advocacy", $array) . "\n";
print_r( $array );
print "</pre>\n";

// output:
// 1
// Array
// (
//   [0] => aa
// )

The letters aa exist in aardvark, so preg_match() returns 1. The first element of the $array variable is also filled with the matched string, which we print to the browser. This might seem strange given that we already know the pattern we are looking for is "aa". We are not, however, limited to looking for predefined characters. We can use a single dot (.) to match any character:


print "<pre>\n";
print preg_match("/d./", "aardvark advocacy", $array);
print "</pre>\n";
print_r( $array );

// output:
// 1
// Array
// (
//   [0] => dv
// )

d. matches "d" followed by any character. We don't know in advance what the second character will be, so the value in $array[0] becomes useful.

If you pass an integer constant flag, PREG_OFFSET_CAPTURE, to preg_match() as the fourth argument, matches in the $array variable are returned as two element arrays, with the first element containing the match and the second containing the number of characters from the start of the search string where the match was found. Suppose we amend our previous call to preg_match():


preg_match("/d./", "aardvark advocacy", $array, PREG_OFFSET_CAPTURE );

$array will contain a subarray, containing the matched string "dv" and the number 3, representing the number of characters before the match:


// Array
// (
//   [0] => Array
//     (
//       [0] => dv
//       [1] => 3
//     )
//
// )


Using Quantifiers to Match a Character More Than Once

When you search for a character in a string, you can use a quantifier to determine the number of times this character should repeat for a match to be made. The pattern a+, for example, will match at least one "a" followed by "a" zero or more times. Let's put this to the test:


if ( preg_match("/a+/","aaaa", $array) ) {
  print "<pre>\n";
  print_r( $array );
  print "</pre>\n";
}

// output:
// Array
// (
//   [0] => aaaa
// )

Notice that this regular expression greedily matches as many characters as it can. Table 18.1 lists the quantifiers you can use to test for a recurring character.

Table 18.1. Quantifiers for Matching a Recurring Character

Symbol

Description

Example

*

Zero or more instances

a*

+

One or more instances

a+

?

Zero or one instance

a?

{n}

n instances

a{3}

{n,}

At least n instances

a{3,}

{,n}

Up to n instances

a{,2}

{n1, n2}

At least n1 instances, no more than n2 instances

a{1,2}

The numbers between braces in Table 18.1 are called bounds. Bounds define the number of times a character or range of characters should be matched in a regular expression. You should place your upper and lower bounds between braces after the character you want to match:


a{4,5}

This line matches no fewer than four and no more than five instances of the character a.

PCREs and Greediness

By default, regular expressions attempt to match as many characters as possible. Notice the following line:


"/p.*t/"

It will find the first "p" in a string and match as many characters as possible until the last possible "t" character is reached. So this regular expression matches the entire test string in the following fragment:


$text = "pot post pat patent";
if (preg_match ( "/p.*t/", $text, $array ) ) {
  print "<pre>\n";
  print_r( $array );
  print "</pre>\n";
}

// output:
// Array
// (
//   [0] => pot post pat patent
// )

By placing a question mark (?) after any quantifier, you can force a PCRE to be more frugal. Notice the following line:


"p.*t"

It means "p followed by as many characters as possible followed by t." But now notice the next line:


"p.*?t"

It means "p followed by as few characters as possible followed by t."

The following fragment uses this technique to match the smallest number of characters starting with "p" and ending with "t":


$text = "pot post pat patent";
if ( preg_match( "/p.*?t/", $text, $array ) ) {
  print "<pre>\n";
  print_r( $array );
  print "</pre>\n";
}

// output:
// Array
// (
//   [0] => pot
// )



Matching Ranges of Characters with Character Classes

Until now, we have either matched specified characters or used . to match any character. Character classes enable you to match any one of a group of characters. To define a character class, you surround the characters you want to match in square brackets. [ab] will match "a" or "b." After you define a character class, you can treat it as if it were a character. So [ab]+ will match "aaa," "bbb," or "ababab."

You can also match ranges of characters with a character class: [a-z] will match any lowercase letter, [A-Z] will match any uppercase letter, and [0-9] will match any number. You can combine ranges and individual characters into one character class, so [a-z5] will match any lowercase letter or the number 5.

In the following fragment, we are looking for any lowercase alphabetical character or the numbers 3, 4, and 7:


if ( preg_match("/[a-z347]+/", "AB dkfd773sxFF", $array) ) {
  print "<pre>\n";
  print_r( $array );
  print "</pre>\n";
}

// output:
// Array
// (
//   [0] => dkfd773sx
// )

You can also negate a character class by including a caret (^) character after the opening square bracket: [^A-Z] will match anything apart from an uppercase character.

Let's negate the characters in the character class we defined in the previous example:


if ( preg_match("/[^a-z347]+/","AB dkfd773sxFF", $array) ) {
  print "<pre>\n";
  print_r( $array );
  print "</pre>\n";
}

// output:
// Array
// (
//   [0] => AB
// )



PCREs and Backslashed Characters

You can escape certain characters with PCREs, just as you can within strings. \t, for example, represents a tab character, and \n represents a newline. PCREs also define some escape characters that will match entire character types. Table 18.2 lists these backslash characters.

Table 18.2. Escape Characters That Match Character Types

Character

Matches

\d

Any number

\D

Anything other than a number

\s

Any kind of whitespace

\S

Anything other than whitespace

\w

Alphanumeric characters (including the underscore character)

\W

Anything other than an alphanumeric character or an underscore

These escape characters can vastly simplify your regular expressions. Without them, you would be forced to use a character class to match ranges of characters. Compare the following valid methods for matching word characters:


preg_match( "/p[a-zA-Z0-9_]+t/", $text, $array );
preg_match( "/p\w+t/", $text, $array );

Both the examples match "p" followed by one or more alphanumeric characters followed by "t." The second example is easier to write and read, however.

PCREs also support a number of escape characters that act as anchors. Anchors match positions within a string, without matching any characters. They are listed in Table 18.3.

Table 18.3. Escape Characters That Act As Anchors

Character

Matches

\A

Beginning of string

\b

Word boundary

\B

Not a word boundary

\Z

End of string (matches before final newline or at end of string)

\z

End of string (matches only at very end of string)

Let's put the word boundary character to the test:


$text = "pot post pat patent";
if ( preg_match( "/\bp\w+t\b/", $text, $array ) ) {
  print "<pre>\n";
  print_r( $array );
  print "</pre>\n";
}

// output:
// Array
// (
//   [0] => pot
// )

The preg_match() call in the previous fragment will match the character "p" but only if it is at a word boundary, followed by any number of word characters, followed by "t," but only if it is at a word boundary. The word boundary escape character does not actually match a character; it merely confirms that a boundary exists for a match to take place.

You can also escape characters to turn off their meanings. To match a "." character, for example, you should add a backslash to the character in your regular expression string:


preg_match( "/\./", $string, $array );



Working with Subpatterns

A subpattern is a pattern enclosed in parentheses (sometimes referred to as an atom). After you define a subpattern, you can treat it as if it were itself a character or character class. In other words, you can match the same pattern as many times as you want using the syntax described in Table 18.1.

Subpatterns are also used to change the way a regular expression is interpreted, usually by limiting the scope of a set of alternatives.

Finally, you can use subpatterns to save the results of a submatch within a regular expression for later use.

In the next fragment, we define a pattern and use parentheses to match individual elements within it:


$test = "Whatever you do, don't panic!";
if ( preg_match( "/(don't)\s+(panic)/", $test, $array ) ) {
  print "<pre>\n";
  print_r( $array );
  print "</pre>\n";
}

// output:
// Array
// (
//   [0] => don't panic
//   [1] => don't
//   [2] => panic
// )

The first element of the array variable that is passed to preg_match() contains the complete matched string. Subsequent elements contain each individual atom matched. This means that you can access the component parts of a matched pattern as well as the entire match.

In the following code fragment, we match an IP address and access not only the entire address, but also each of its component parts:


$test = "158.152.55.35";
if ( preg_match( "/(\d+)\.(\d+)\.(\d+)\.(\d+)/", $test, $array ) ) {
  print "<pre>\n";
  print_r( $array );
  print "</pre>\n";
}

// output:
// Array
// (
//   [0] => 158.152.55.35
//   [1] => 158
//   [2] => 152
//   [3] => 55
//   [4] => 35
// )

Notice that we used a backslash (\) to escape the dots in the regular expression. By doing so, we signal that we want to strip . of its special meaning and treat it as a specific character. You must do the same for any character that has a function in a regular expression if you want to refer to it.

Branches

You can combine patterns with the pipe (|) character to create branches in your regular expressions. A regular expression with two branches will match either the first pattern or the second. This process adds yet another layer of flexibility to regular expression syntax. In the next code fragment, we match either .com or .co.uk in a string:


$test = "www.example.com";
if ( preg_match( "/www\.example(\.com|\.co\.uk)/", $test, $array ) ) {
  print "it is a $array[1] domain<br/>";
}
// output:
// it is a .com domain

We illustrate two aspects of a subpattern in the preceding example. First, we capture the match of .com or .co.uk, making it available in $array[1], and second, we define the scope of the branch. Without the parentheses, we would match either www.example.com or .co.uk, which is not what we want at all.

Anchoring a Regular Expression

Not only can you determine the pattern you want to find in a string, you also can decide where in the string you want to find it. To test whether a pattern is at the beginning of a string, prepend a caret (^) symbol to your regular expression. ^a will match "apple," but not "banana."

To test that a pattern is at the end of a string, append a dollar ($) symbol to the end of your regular expression. a$ will match "flea" but not "dear."

Finding Matches Globally with preg_match_all()

It is a feature of preg_match() that it only matches the first pattern it finds in a string. So searching for words beginning with "p" and ending with "s," we will match only the first found pattern. Let's try it out:


$text = "I sell pots, plants, pistachios, pianos and parrots";
if ( preg_match( "/\bp\w+s\b/", $text, $array ) ) {
  print "<pre>\n";
  print_r( $array );
  print "</pre>\n";
}

// output:
// Array
// (
//   [0] => pots
// )

As we would expect, the first match, "pots," is stored in the first element of the $array variable. None of the other words are matched.

We can use preg_match_all() to access every match in the test string in one call. preg_match_all() accepts a regular expression, a source string, and an array variable and will return true if a match is found. The array variable is populated with a multidimensional array, the first element of which will contain every match to the complete pattern defined in the regular expression.

Listing 18.1 tests a string using preg_match_all(), the print_r() function to output the multidimensional array of results.

Listing 18.1 Using preg_match_all() to Match a Pattern Globally
 1: <!DOCTYPE html PUBLIC
 2:   "-//W3C//DTD XHTML 1.0 Strict//EN"
 3:   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
 4: <html>
 5: <head>
 6: <title>Using preg_match_all() to Match a Pattern Globally</title>
 7: </head>
 8: <body>
 9: <?php
10: $text = "I sell pots, plants, pistachios, pianos and parrots";
11: if ( preg_match_all( "/\bp\w+s\b/", $text, $array ) ) {
12:   print "<pre>\n";
13:   print_r( $array );
14:   print "</pre>\n";
15: }
16:
17: // output:
18: // Array
19: // (
20: //   [0] => Array
21: //     (
22: //       [0] => pots
23: //       [1] => plants
24: //       [2] => pistachios
25: //       [3] => pianos
26: //       [4] => parrots
27: //     )
28: //
29: // )
30:
31: ?>
32: </body>
33: </html>

The first and only element of the $array variable that we passed to preg_match_all() on line 11 has been populated with an array of strings. This array contains every word in the test string that begins with "p" and ends with "s."

preg_match_all() populates a multidimensional array to store matches to subpatterns. The first element of the array argument passed to preg_match_all() will contain every match of the complete regular expression. Each additional element will contain the matches that correspond to each atom (subpattern in parentheses). Notice the following call to preg_match_all():


$text = "01-05-99, 01-10-99, 01-03-00";
preg_match_all( "/(\d+)-(\d+)-(\d+)/", $text, $array );

$array[0] will store an array of complete matches:


$array[0][0]: 01-05-99
$array[0][1]: 01-10-99
$array[0][2]: 01-03-00

$array[1] will store an array of matches that corresponds to the first subpattern:


$array[1][0]: 01
$array[1][1]: 01
$array[1][2]: 01

$array[2] will store an array of matches that corresponds to the second subpattern:


$array[2][0]: 05
$array[2][1]: 10
$array[2][2]: 03

And so on. We can change this behavior by passing a constant integer flag, PREG_SET_ORDER, to preg_match_all() as its optional fourth argument:


$text = "01-05-99, 01-10-99, 01-03-00";
preg_match_all( "/(\d+)-(\d+)-(\d+)/", $text, $array, PREG_SET_ORDER );

This will change the structure of $array. Each element will be an array as before. Of the subarrays in $array, the first element of each will be a complete match, and each subsequent element will be a submatch. So the first element of $array will contain all aspects of the first match:


$array[0][0]: 01-05-99
$array[0][1]: 01
$array[0][2]: 05
$array[0][3]: 99

The second array will contain all aspects of the second match:


$array[1][0]: 01-10-99
$array[1][1]: 01
$array[1][2]: 10
$array[1][3]: 99

And so on.

Using preg_replace() to Replace Patterns

Until now, we have searched for patterns in a string, leaving the search string untouched. preg_replace() enables you to find a pattern in a string and replace it with a new substring. preg_replace() requires three strings: a regular expression, the text with which to replace a found pattern, and the text to modify. It optionally accepts a fourth integer argument, which sets a limit to the number of replacements the function should perform. preg_replace() returns a string, including the modification if a match was found or an unchanged copy of the original source string otherwise. In the following fragment, we search for the name of a club official, replacing it with name of her successor:


$test = "Our Secretary, Sarah Williams is pleased to welcome you.";
print preg_replace("/Sarah Williams/", "Rev. P.W. Goodchild", $test);
// output:
// Our Secretary, Rev. P.W. Goodchild is pleased to welcome you.

Note that although preg_match() will only match the first pattern it finds, preg_replace() will find and replace every instance of a pattern, unless you pass a limit integer as a fourth argument.

Using Back References with preg_replace()

Back references make it possible for you to use part of a matched pattern in the replacement string. To use this feature, you should use parentheses to wrap any elements of your regular expression that you might want to use. The text matched by these subpatterns will be available to the replacement string if you refer to them with a dollar character ($) and the number of the subpattern ($1, for example). Subpatterns are numbered in order, outer to inner, left to right, starting at $1.$0 stores the entire match.

The following fragment converts dates in dd/mm/yy format to mm/dd/yy format:


$test = "25/12/2000";
print preg_replace("|(\d+)/(\d+)/(\d+)|", "$2/$1/$3", $test);
// output:
// 12/25/2000

Notice that we used a pipe (|) symbol as a delimiter. This is to save us from having to escape the forward slashes in the pattern we want to match.

Instead of a source string, you can pass an array of strings to preg_replace(), and it will transform each string in turn. In this case, the return value will be an array of transformed strings.

You can also pass arrays of regular expressions and replacement strings to preg_replace(). Each regular expression will be applied to the source string, and the corresponding replacement string will be applied. The following fragment transforms date formats as before but also changes copyright information in the source string:


$text = "25/12/99, 14/5/00. Copyright 2003";
$regs = array( "|\b(\d+)/(\d+)/(\d+)\b|", "/([Cc]opyright) 2003/" );
$reps = array( "$2/$1/$3", "$1 2004" );
$text = preg_replace( $regs, $reps, $text );
print "$text<br />";
// output:
// 12/25/99, 5/14/00. Copyright 2004<br />

We create two arrays. The first, $regs, contains two regular expressions, and the second, $reps, contains replacement strings. The first element of the $regs array corresponds to the first element of the $reps array, and so on.

If the array of replacement strings contains fewer elements than the array of regular expressions, patterns matched by those regular expressions without corresponding replacement strings will be replaced with an empty string.

If you pass preg_replace() an array of regular expressions but only a string as replacement, the same replacement string will be applied to each pattern in the array of regular expressions.

Modifiers

PCREs allow you to modify the way that a pattern is applied through the use of pattern modifiers.

A pattern modifier is a letter that should be placed after the final delimiter in your PCRE. It will refine the behavior of your regular expression.

Table 18.4 lists some PCRE pattern modifiers.

Table 18.4. PCRE Modifiers

Pattern

Description

/i

Case insensitive.

/e

Treats replacement string in preg_replace() as PHP code.

/m

$ and ^ anchors match at newlines as well as the beginning and end of the string.

/s

Matches newlines (newlines are not normally matched by .).

/x

Whitespace outside character classes is not matched to aid readability. To match whitespace, use \s, \t, or \.

/A

Matches pattern only at start of string (this modifier is not found in Perl).

/E

Matches pattern only at end of string (this modifier is not found in Perl).

/U

Makes the regular expression ungreedy; the minimum number of allowable matches is found (this modifier is not found in Perl).

Where they do not contradict one another, you can combine pattern modifiers. You might want to use the x modifier to make your regular expression easier to read, for example, and also the i modifier to make it match patterns regardless of case. Note the following line:


/ b \S* t /ix

It will match "bat" and "BAT" but not "B A T," for example. Unescaped spaces in a regular expression modified by x are there for aesthetic reasons only and will not match any patterns in the source string.

The m modifier can be useful if you want to match an anchored pattern on multiple lines of text. The anchor patterns ^ and $ match the beginning and end of an entire string by default. The following fragment uses the m modifier to change the behavior of $:


$text = "name: matt\noccupation: coder\neyes: blue\n";
if ( preg_match_all( "/^\w+:\s+(.*)$/m", $text, $array ) ) {
  print "<pre>\n";
  print_r( $array );
  print "</pre>\n";
}

// output:
// Array
// (
//   [0] => Array
//     (
//       [0] => name: matt
//       [1] => occupation: coder
//       [2] => eyes: blue
//     )
//
//   [1] => Array
//     (
//      [0] => matt
//      [1] => coder
//      [2] => blue
//     )
//
// )

We create a regular expression that will match any word characters followed by a colon and any number of space characters. We then match any number of characters followed by the end of string ($) anchor. Because we have used the m pattern modifier, $ matches the end of every line rather than the end of the string.

The s modifier is useful when you want to use . to match characters across multiple lines. The following fragment attempts to access the first and last words of a string:


$text = "start with this line\nand you will reach\na conclusion in the end\n";
if ( preg_match( "/^(\w+).*?(\w+)$/", $text, $array ) ) {
  print "<pre>\n";
  print_r( $array );
  print "</pre>\n";
}

This code will print nothing. Although the regular expression will find word characters at the beginning of the string, the . will not match the newline characters embedded in the text. The s modifier will change this:


$text = "start with this line\nand you will reach\na conclusion in the end\n";
if ( preg_match( "/^(\w+).*?(\w+)$/s", $text, $array ) ) {
  print "<pre>\n";
  print_r( $array );
  print "</pre>\n";
}

// output:
// Array
// (
//   [0] => start with this line
// and you will reach
// a conclusion in the end
//   [1] => start
//   [2] => end
// )

The e modifier can be particularly powerful. It allows you to treat the replacement string in preg_replace() as if it were PHP. You can pass back references to functions as arguments, for example, or process lists of numbers. In the following example, we use the e modifier to pass matched numbers in dates to a function that returns the same date in a new format:


function convDate( $month, $day, $year ) {
  $year = ($year < 70 )?$year+2000:$year;
  $time = ( mktime( 0,0,0,$month,$day,$year) );
  return date("l d F Y", $time);
}

$dates = "3/18/03<br />\n7/22/04";
$dates = preg_replace( "/([0-9]+)\/([0-9]+)\/([0-9]+)/e",
      "convDate($1,$2,$3)", $dates);
print $dates;

// output:
// Tuesday 18 March 2003<br />
// Thursday 22 July 2004

We match any set of three numbers separated by slashes, using parentheses to capture the matched numbers. Because we are using the e modifier, we can call the user-defined function convDate() from the replacement string argument, passing the three back references to the function. convDate() simply takes the numerical input and produces a more verbose date, which replaces the original. Because in our example, we are matching numbers, we do not need to enclose the backreferences in quotes. If we were matching strings, quotes would be necessary around each string backreference.

Using preg_replace_callback() to Replace Patterns

preg_replace_callback() allows you to assign a callback function that will be called for every full match your regular expression finds. preg_replace_callback() requires a regular expression, a reference to a callback function, and the string to be analyzed. Like preg_replace(), it also optionally accepts a limit argument.

The callback function should be designed to accept a single array argument. It will contain the full match at index 0 and each submatch in subsequent positions in the array. Whatever the callback function returns will be incorporated into the string returned by preg_replace_callback().

We can use preg_replace_callback() to rewrite our date-replacement example:


function convDate( $matches ) {
  $year = ($year < 70 )?$matches[3]+2000:$matches[3];
  $time = ( mktime( 0,0,0,$matches[1],$matches[2],$matches[3]) );
  return date("l d F Y", $time);
}

$dates = "3/18/03<br />\n7/22/04";
$dates = preg_replace_callback( "/([0-9]+)\/([0-9]+)\/([0-9]+)/",
      "convDate", $dates);
print $dates;

// output:
// Tuesday 18 March 2003<br />
// Thursday 22 July 2004

This example calls the convDate() function twice, once for each time the regular expression matches. The day, month, and year figures are then easy to extract from the array that is passed to convDate() and stored in the $matches argument variable.

Using preg_split() to Break Up Strings

In Hour 8, you saw that you could split a string of tokens into an array using explode(). This is powerful but limits you to a single set of characters that can be used as a delimiter. PHP's preg_split() function enables you to use the power of regular expressions to define a flexible delimiter. preg_split() requires a string representing a pattern to use as a delimiter and a source string. It also accepts an optional third argument representing a limit to the number of elements you want returned and an optional flag argument. preg_split() returns an array.

The following fragment uses a regular expression with two branches to split a string on a comma followed by a space or the word and surrounded by two spaces:


$text = "apples, oranges, peaches and grapefruit";
$fruitarray = preg_split( "/, | and /", $text );
print "<pre>\n";
print_r( $fruitarray );
print "</pre>\n";

// output:
// Array
// (
//   [0] => apples
//   [1] => oranges
//   [2] => peaches
//   [3] => grapefruit
// )



    [ Team LiB ] Previous Section Next Section