Team LiB
Previous Section Next Section

Perl-Compatible Regular Expressions (PCRE)

Perl Compatible Regular Expressions (PCRE) are much more powerful than their POSIX counterpartsand consequently, also more complex and difficult to use.

PCRE adds its own character classes to the extended regular expression rules that we saw earlier:

  • \w represents a "word" character and is equivalent to the expression [A-Za-z0-9].

  • \W represents the opposite of \w and is equivalent to [^A-Za-z0-9].

  • \s represents a whitespace character.

  • \S represents a nonwhitespace character.

  • \d represents a digit and is equivalent to [0-9].

  • \D represents a nondigit character and is equivalent to [^0-9].

  • \n represents a newline character.

  • \r represents a return character.

  • \t represents a tab character.

As you can see, PCRE are significantly more concise than their POSIX counterparts. In fact, our simple email validation regex can now be written as

/\w+@\w+\.\w{2,4}/

But, wait a minutewhat are those slash characters at the beginning and at the end of the regex string? PCRE requires that the actual regular expression be delimited by two characters. By convention, two forward slashes are used, although any character other than the backslash that is not alphanumeric would do just as well.

Naturally, regardless of which character you choose, you will be required to escape the delimiter whenever you use it as part of the regex itself. For example:

/face\/off/

is the equivalent of the regular expression face/off.

PCRE also expands on the concept of references, making them useful not only as a byproduct of the regex operation, but as part of the operation itself.

In PCRE, it is possible to use a reference that was defined previously in a regular expression as part of the expression itself. Let's make an example. Suppose that you find yourself in a situation in which you have to verify that in a string such as the following:

Marco is a programmer. Marco's specialty is programming.
John is a programmer. John's specialty is programming.

The name of the person to whom the sentence refers is the same in both positions (that is, "Marco" or "John"). Using a normal search-and-replace operation would take a significant effort, and so would using a POSIX regex, because you do not know the name of the person a priori.

With a PCRE, however, this operation is trivial. You start by matching the first portion of the string. The name is the first word:

/^(\w+) is a programmer.

Next, you specify the name again. As you can see, we included it in parentheses in the preceding expression, which means that we create a reference to it. We can now recall that reference inside the regex itself and use it to our advantage:

/^(\w+) is a programmer. \1's specialty is programming.$/

If you try to match the preceding regex against the following sentence:

Marco is a programmer. Marco's specialty is programming.

Everything will work fine. However, if you try it against this sentence:

Marco is a programmer. John's specialty is programming.

The regex compiler will not return a match because the reference won't match.

To give you an idea of how powerful PCREs are and why it's worth trying to learn them, let me give you an alternative to the simple one-line expression using POSIX:

<?php

    $s = 'Marco is a programmer. Marco\'s specialty is programming.';

    if (ereg ('^([[:alpha:]]+) is a programmer', $s, $matches)) {
      if (ereg ('([[:alpha:]]+)\'s specialty is programming.$', $s, $matches2)) {
        if ($matches[1] === $matches[1]) {
          echo "MATCH\n";
        } else {
          echo "NO MATCH\n";
       } else {
          echo "NO MATCH\n";
    } else {
      echo "NO MATCH\n";
      }
?>

Now, this is a simple example, and the POSIX solution is definitely not as elegant as it could be, but you can see here that it takes three separate operations to approximate the power of just one PCRE.

I should note that the inability to use references within the regex itself is actually a limitation of PHP, rather than of the POSIX standardwhich, unfortunately, means that the PHP implementation of regex is not POSIX compliant.

The main PCRE function in PHP is preg_match():

preg_match (pattern, string[, matches[, flags]]);

As in the case of ereg(), this function causes the regular expression stored in pattern to be matched against string, and any references matches are stored in matches. The optional flags parameter can actually contain only the value PREG_OFFSET_CAPTURE. If this parameter is specified, it will cause preg_match() to change the format of matches so that it will contain both the text and the position of each reference inside string. Let's make an example:

<?php

    $s = 'Another beautiful day';

    preg_match ('/beautiful/', $s, $matches, PREG_OFFSET_CAPTURE);

    var_dump ($matches);

?>

If you execute this script, you should receive the following output:

array(1) {
  [0]=>
  array(2) {
    [0]=>
    string(9) "beautiful"
    [1]=>
    int(8)
  }
}

As you can see, the $matches array now contains another array for each reference. The latter, in turn, contains both the string matched and its position within $s.

Another function of the PCRE family is preg_match_all, which has the same syntax as preg_match(), but searches a string for all the occurrences of a regular expression, rather than for a specific one. Here's an example:

<?php

$s = 'A beautiful day and a beauty of a lake';

preg_match_all ('/beaut[^ ]+/', $s, $matches);

var_dump ($matches)

?>

If you execute this script, it will output the following:

array(1) {
  [0]=>
  array(2) {
    [0]=>
    string(9) "beautiful"
    [1]=>
    string(6) "beauty"
  }
}

As you can see, the $matches array contains an array whose elements are arrays that correspond to the matches found for each of the references. In this case, because no reference was specified, only the 0th element of the array is present, but it contains both the string "beautiful" and "beauty". By contrast, if you had executed this regex using preg_match(), only the word "beautiful" would have been returned.

Search-and-replace operations in the world of PCRE are handled by the preg_replace function:

preg_replace (pattern, replacement, string[, limit]);

Much like ereg_replace(), this function applies the regex pattern to string and then substitutes the placeholders in replacement with the references defined in it. The limit parameter can be used to limit the number of replacements to a maximum number. Here's an example, which will output marcot at tabini dot ca:

<?php

    $s = 'marcot@tabini.ca';

    echo preg_replace ('/^(\w+)@(\w+)\.(\w{2,4})/', '\1 at \2 dot \3', $s);

?>

Keep in mind that this is only one way of using preg_replace(), in which the entire input string is substituted by the replacement string. In fact, you can use this function to replace only small portions of text:

<?php

    $s = 'The pen is on the table';

    echo preg_replace ('/on/', 'over', $s);

?>

If you execute this script, preg_replace() will replace the word "on" with the word "over" in $s, resulting in the output The pen is over the table.

The last function that I want to bring to your attention is preg_split(), which is somewhat equivalent to the explode() function that we discussed earlier, with the difference that it takes a regular expression as a delimiter, rather than a straight string, and that it includes a few additional features:

preg_split (pattern, string[, limit[, flags]]);

The preg_split function works by breaking string in substrings delimited by sequences of characters delimited by pattern. The optional limit parameter can be used to specify a maximum number of splitting operations. The flags parameter, on the other hand, can be used to modify the behavior of the function as described in Table 3.2.

Table 3.2. preg_split() Flags

Reference Number

Value

PREG_SPLIT_NO_EMPTY

Causes empty substrings to be discarded.

PREG_SPLIT_DELIM_CAPTURE

Causes any references inside pattern to be captured and returned as part of the function's output.

PREG_SPLIT_OFFSET_CAPTURE

Causes the position of each substring to be returned as part of the function's output (similar to PREG_OFFSET_CAPTURE in preg_match()).


Here's an example of how preg_split() can be used:

<?php

    $s = 'Ten times he called, and ten times nobody answered';

    var_dump (preg_split ('/[ ,]/', $s));

?>

This script causes the string $s to be split whenever either a space or a comma is found, resulting in the following output:

array(10) {
  [0]=>
  string(3) "Ten"
  [1]=>
  string(5) "times"
  [2]=>
  string(2) "he"
  [3]=>
  string(6) "called"
  [4]=>
  string(0) ""
  [5]=>
  string(3) "and"
  [6]=>
  string(3) "ten"
  [7]=>
  string(5) "times"
  [8]=>
  string(6) "nobody"
  [9]=>
  string(8) "answered"
}

As you can imagine, the explode() function by itself would have been inadequate in this case, because it would have been able to split $s based only on a single character.

Named Patterns

An excellent and very useful addition to PCRE is the concept of named capturing groups (which everybody always refers to as named patterns). A named capturing group lets you refer to a subpattern of your expression by an arbitrary name, rather than by its position inside the regular expression. For example, consider the following regex:

/^Name=(.+)$/

Now, you would normally address the (.+) subpattern as the first item of the match array returned by preg_match() (or as $1 in a substitution performed through a call to preg_replace() or preg_replace_all()).

That's all well and goodat least as long as you have only a limited number of subpatterns whose position never changes. Heaven forbid, however, that you should ever find yourself in a position to have to add a capturing subpattern at the beginning of a regex that already has six of them!

Luckily, this problem can be solved once and for all by assigning a "name" to each of your subpatterns. Take a look at the following:

/^Name=(?P<thename>.+)$/

This will create a backreference inside your expression that can be explicitly retrieved by using the name thename. If you run this regex through preg_match(), the backreference will be inserted in the match array both by number (using the normal numbering rules) and by name. If, on the other hand, you run it through preg_replace(), you can backreference it by enclosing it in parentheses and prefixing it with ?P=. For example:

preg_replace ("/^Name=(?P<thename>.+)$/", "My name is (?P=thename)", $value);
you may want to include an example of this functionality.

    Team LiB
    Previous Section Next Section