Team LiB
Previous Section Next Section

The Basics of Regular Expressions

Regex is, essentially, a whole new language, with its rules, its structures, and its quirks. You'll also find that your knowledge of most other programming languages will have practically no bearing on learning regex, for the simple reason that regular expressions are highly specialized and follow their own rules.

As defined by Kleene, the basic regex axioms are the following:

  • A single character is a regular expression denoting itself.

  • A sequence of regular expressions is a regular expression.

  • Any regular expression followed by a * character (also known as "Kleene's Star") is a regular expression composed of zero or more instances of that regular expression.

  • Any pair of regular expressions separated by a pipe character (|) is a regular expression composed of either the left or the right regular expression.

  • Parentheses can be used to group regular expressions.

This may sound complicated to you, and I'm pretty positive that it scared me the first time I read through it. However, the basics are easy to understand. First, the simplest regular expression is a single character. For example, the regex a will match the character "a" of the word Marco. Notice that, under normal circumstances, regex are binary operations, so that "a" is not equivalent to "A". Therefore, the regex a will not match the "A" in MARCO.

Next, single-character regular expressions can be grouped by placing them next to each other. Thus, the regex wonderful will match the word "wonderful" in "Today is a wonderful day."

So far, regular expressions are not very different from normal search operations. However, this is where the similarities end. As I mentioned earlier, you can use Kleene's Star to create a regular expression that can be repeated any number of times (including none). For example, consider the following string:

seeking the treasures of the sea

The regex se* will be interpreted as "the letter s followed by zero or more instances of the letter e" and match the following:

  • The letters "see" of the word "seeking," where the regex e is repeated twice.

  • Both instances of the letter s in "treasures," where s is followed by zero instances of e.

  • The letters "se" of the word "sea," where the e is present once.

It's important to understand that, in the preceding expression, only the expression e is considered when dealing with the star. Although it's possible to use parentheses to group regular expressions, you should not be tempted to think that using (se)* is a good idea, because the regex compiler will interpret it as meaning "zero or more occurrences" of "se."

If you apply this regex to the preceding string, you will encounter a total of 30 matches, because every character in the string would match the expression. (Remember? Zero or more occurrences!)

You will find that parentheses are often useful in conjunction with the pipe operator to specify alternative regex specifications. For example, use the expression gr(u|a)b with the following string:

grab the grub and pull

to match both "grub" and "grab."

    Team LiB
    Previous Section Next Section