| I l@ve RuBoard |
|
18.4 Regular Expression MatchingSplitting and joining strings is a simple way to process text, as long as it follows the format you expect. For more general text analysis tasks, Python provides regular expression matching utilities. Regular expressions (REs) are simply strings that define patterns to be matched against other strings. You supply a pattern and a string, and ask if the string matches your pattern. After a match, parts of the string matched by parts of the pattern are made available to your script. That is, matches not only give a yes/no answer, but they can pick out substrings as well. Regular expression pattern strings can be complicated (let's be honest -- they can be downright gross to look at). But once you get the hang of them, they can replace larger hand-coded string search routines. In Python, regular expressions are not part of the syntax of the Python language itself, but are supported by extension modules that you must import to use. The modules define functions for compiling pattern strings into pattern objects, matching these objects against strings, and fetching matched substrings after a match. Beyond those generalities, Python's regular expression story is complicated a little by history:
Up until very recently, re was generally slower than regex, so you had to choose between speed and Perl-like RE syntax. Today, though, re has been optimized with the sre implementation, to the extent that regex no longer offers any clear advantages. Moreover, re in Python 2.0 now supports matching Unicode strings (strings with 16-bit wide characters for representing large character sets). Because of this migration, I've recoded RE examples in this text to use the new re module instead of regex. The old regex-based versions are still available on the book's CD (see http://examples.oreilly.com/python2), in directory PP2E\lang\old-regex. If you find yourself having to migrate old regex code, you can also find a document describing the translation steps needed at http://www.python.org. Both modules' interfaces are similar, but re introduces a match object and changes pattern syntax in minor ways. Having said that, I also want to warn you that REs are a complex topic that cannot be covered in depth here. If this area sparks your interest, the text Mastering Regular Expressions from O'Reilly is a good next step to take. 18.4.1 Using the re ModuleThe Python re module comes with functions that can search for patterns right away or make compiled pattern objects for running matches later. Pattern objects (and module search calls) in turn generate match objects, which contain information about successful matches and matched substrings. The next few sections describe the module's interfaces and some of the operators you can use to code patterns. 18.4.1.1 Module functionsThe top level of the module provides functions for matching, substitution, pre-compiling, and so on:
18.4.1.2 Compiled pattern objectsAt the next level, pattern objects provide similar attributes, but the pattern string is implied. The re.compile function in the previous section is useful to optimize patterns that may be matched more than once (compiled patterns match faster). Pattern objects returned by re.compile have these sorts of attributes:
18.4.1.3 Match objectsFinally, when a match or search function or method is successful, you get back a match object (None comes back on failed matches). Match objects export a set of attributes of their own, including:
18.4.1.4 Regular expression patternsRegular expression strings are built up by concatenating single-character regular expression forms, shown in Table 18-1. The longest-matching string is usually matched by each form, except for the non-greedy operators. In the table, R means any regular expression form, C is a character, and N denotes a digit.
Within patterns, ranges and selections can be combined. For instance, [a-zA-Z0-9_]+ matches the longest possible string of one or more letters, digits, or underscores. Special characters can be escaped as usual in Python strings: [\t ]* matches zero or more tabs and spaces (i.e., it skips whitespace). The parenthesized grouping construct, (R), lets you extract matched substrings after a successful match. The portion of the string matched by the expression in parentheses is retained in a numbered register. It's available through the group method of a match object after a successful match. In addition to the entries in this table, special sequences in Table 18-2 can be used in patterns, too. Due to Python string rules, you sometimes must double up on backslashes (\\) or use Python raw strings (r'...') to retain backslashes in the pattern.
The Python library manual gives additional details. But to demonstrate how the re interfaces are typically used, we'll turn to some short examples. 18.4.2 Basic PatternsTo illustrate how to combine RE operators, let's start with a few short test files that match simple pattern forms. Comments in Example 18-3 describe the operations exercised; check Table 18-1 to see which operators are used in these patterns. Example 18-3. PP2E\lang\re-basics.py# literals, sets, ranges (all print 2 = offset where pattern found)
import re # the one to use today
pattern, string = "A.C.", "xxABCDxx" # nonspecial chars match themself
matchobj = re.search(pattern, string) # '.' means any one char
if matchobj: # search returns match object or None
print matchobj.start( ) # start is index where matched
pattobj = re.compile("A.*C.*") # 'R*' means zero or more Rs
matchobj = pattobj.search("xxABCDxx") # compile returns pattern obj
if matchobj: # patt.search returns match obj
print matchobj.start( )
# selection sets
print re.search(" *A.C[DE][D-F][^G-ZE]G\t+ ?", "..ABCDEFG\t..").start( )
# alternatives
print re.search("A|XB|YC|ZD", "..AYCD..").start( ) # R1|R2 means R1 or R2
# word boundaries
print re.search(r"\bABCD", "..ABCD ").start( ) # \b means word boundary
print re.search(r"ABCD\b", "..ABCD ").start( ) # use r'...' to escape '\'
Notice that there are a variety of ways to kick off a match with re -- by calling module search functions and by making compiled pattern objects. In either event, you can hang on to the resulting match object or not. All the print statements in this script show a result of 2 -- the offset where the pattern was found in the string. In the first test, for example, "A.C." matches the "ABCD" at offset 2 in the search string (i.e., after the first "xx"): C:\...\PP2E\Lang>python re-basic.py 2 2 2 2 2 2 In Example 18-4, parts of the pattern strings enclosed in parentheses delimit groups ; the parts of the string they matched are available after the match. Example 18-4. PP2E\lang\re-groups.py# groups (extract substrings matched by REs in '( )' parts)
import re
patt = re.compile("A(.)B(.)C(.)") # saves 3 substrings
mobj = patt.match("A0B1C2") # each '( )' is a group, 1..n
print mobj.group(1), mobj.group(2), mobj.group(3) # group( ) gives substring
patt = re.compile("A(.*)B(.*)C(.*)") # saves 3 substrings
mobj = patt.match("A000B111C222") # groups( ) gives all groups
print mobj.groups( )
print re.search("(A|X)(B|Y)(C|Z)D", "..AYCD..").groups( )
patt = re.compile(r"[\t ]*#\s*define\s*([a-z0-9_]*)\s*(.*)")
mobj = patt.search(" # define spam 1 + 2 + 3") # parts of C #define
print mobj.groups( ) # \s is whitespace
In the first test here, for instance, the three (.) groups each match a single character, but retain the character matched; calling group pulls out the bits matched. The second test's (.*) groups match and retain any number of characters. The last test here matches C #define lines; more on this later. C:\...\PP2E\Lang>python re-groups.py
0 1 2
('000', '111', '222')
('A', 'Y', 'C')
('spam', '1 + 2 + 3')
Finally, besides matches and substring extraction, re also includes tools for string replacement or substitution (see Example 18-5). Example 18-5. PP2E\lang\re-subst.py# substitutions (replace occurrences of patt with repl in string)
import re
print re.sub('[ABC]', '*', 'XAXAXBXBXCXC')
print re.sub('[ABC]_', '*', 'XA-XA_XB-XB_XC-XC_')
In the first test, all characters in the set are replaced; in the second, they must be followed by an underscore: C:\...\PP2E\Lang>python re-subst.py X*X*X*X*X*X* XA-X*XB-X*XC-X* 18.4.3 Scanning C Header Files for PatternsThe script in Example 18-6 puts these pattern operators to more practical use. It uses regular expressions to find #define and #include lines in C header files and extract their components. The generality of the patterns makes them detect a variety of line formats; pattern groups (the parts in parentheses) are used to extract matched substrings from a line after a match. Example 18-6. PP2E\Lang\cheader.py#! /usr/local/bin/python
import sys, re
from string import strip
pattDefine = re.compile( # compile to pattobj
'^#[\t ]*define[\t ]+([a-zA-Z0-9_]+)[\t ]*(.*)') # "# define xxx yyy..."
pattInclude = re.compile(
'^#[\t ]*include[\t ]+[<"]([a-zA-Z0-9_/\.]+)') # "# include <xxx>..."
def scan(file):
count = 0
while 1: # scan line-by-line
line = file.readline( )
if not line: break
count = count + 1
matchobj = pattDefine.match(line) # None if match fails
if matchobj:
name = matchobj.group(1) # substrings for (...) parts
body = matchobj.group(2)
print count, 'defined', name, '=', strip(body)
continue
matchobj = pattInclude.match(line)
if matchobj:
start, stop = matchobj.span(1) # start/stop indexes of (...)
filename = line[start:stop] # slice out of line
print count, 'include', filename # same as matchobj.group(1)
if len(sys.argv) == 1:
scan(sys.stdin) # no args: read stdin
else:
scan(open(sys.argv[1], 'r')) # arg: input file name
To test, let's run this script on the text file in Example 18-7. Example 18-7. PP2E\Lang\test.h#ifndef TEST_H #define TEST_H #include <stdio.h> #include <lib/spam.h> # include "Python.h" #define DEBUG #define HELLO 'hello regex world' # define SPAM 1234 #define EGGS sunny + side + up #define ADDER(arg) 123 + arg #endif Notice the spaces after # in some of these lines; regular expressions are flexible enough to account for such departures from the norm. Here is the script at work, picking out #include and #define lines and their parts; for each matched line, it prints the line number, the line type, and any matched substrings: C:\...\PP2E\Lang>python cheader.py test.h 2 defined TEST_H = 4 include stdio.h 5 include lib/spam.h 6 include Python.h 8 defined DEBUG = 9 defined HELLO = 'hello regex world' 10 defined SPAM = 1234 12 defined EGGS = sunny + side + up 13 defined ADDER = (arg) 123 + arg 18.4.4 A File Pattern Search UtilityThe next script searches for patterns in a set of files, much like the grep command-line program. We wrote file and directory searchers earlier, in Chapter 5. Here, the file searches look for patterns instead of simple strings (see Example 18-8). The patterns are typed interactively separated by a space, and the files to be searched are specified by an input pattern for Python's glob.glob filename expansion tool we studied earlier, too. Example 18-8. PP2E\Lang\pygrep1.py#!/usr/local/bin/python
import sys, re, glob
from string import split
help_string = """
Usage options.
interactive: % pygrep1.py
"""
def getargs( ):
if len(sys.argv) == 1:
return split(raw_input("patterns? >")), raw_input("files? >")
else:
try:
return sys.argv[1], sys.argv[2]
except:
print help_string
sys.exit(1)
def compile_patterns(patterns):
res = []
for pattstr in patterns:
try:
res.append(re.compile(pattstr)) # make re patt object
except: # or use re.match
print 'pattern ignored:', pattstr
return res
def searcher(pattfile, srchfiles):
patts = compile_patterns(pattfile) # compile for speed
for file in glob.glob(srchfiles): # all matching files
lineno = 1 # glob uses re too
print '\n[%s]' % file
for line in open(file, 'r').readlines( ): # all lines in file
for patt in patts:
if patt.search(line): # try all patterns
print '%04d)' % lineno, line, # match if not None
break
lineno = lineno+1
if __name__ == '__main__':
apply(searcher, getargs( ))
Here's what a typical run of this script looks like; it searches all Python files in the current directory for two different patterns, compiled for speed. Notice that files are named by a pattern, too -- Python's glob module also uses reinternally: C:\...\PP2E\Lang>python pygrep1.py
patterns? >import.*string spam
files? >*.py
[cheader.py]
[finder2.py]
0002) import string, glob, os, sys
[patterns.py]
0048) mobj = patt.search(" # define spam 1 + 2 + 3")
[pygrep1.py]
[rules.py]
[summer.py]
0002) import string
[__init__.py]
|
| I l@ve RuBoard |
|