6.3 Regular Expressions and Character Decoding

Example 6-3 demonstrates the text-matching capabilities of the java.util.regex package. This BGrep class is a variant of the Unix "grep" command for searching files for text that matches a given regular expression. Unlike Unix grep, which is line-oriented, BGrep is block-oriented: the matched text can span multiple lines, and its location in the file is indicated by character number rather than line number. Invoke BGrep with the regular expression to search for and one or more filenames. Use -i to specify case-insensitive matching. If the files contain characters in some encoding other than UTF-8, use the -e option to specify the encoding. For example, you could use this command to search a bunch of Java source files for occurrences of "ByteBuffer", "CharBuffer", and the like.

java je3.nio.BGrep '[A-Z][a-z]*Buffer' *.java

The java.util.regex package uses a regular expression syntax that is much like that of Perl 5. Look up java.util.regex.Pattern in Sun's javadocs or in Java in a Nutshell for a summary of this syntax, and look up the Matcher class in the same package for details on how to use Pattern objects to match character sequences. If you are not already familiar with regular expressions, you can find complete details in the book Mastering Regular Expressions, by Jeffrey Friedl (O'Reilly).

This program also demonstrates an easy way to read the contents of a file: simply use the memory-mapping capabilities of FileChannel to map the contents of the entire file into a ByteBuffer. In order to perform pattern matching on the characters in a file, the bytes of the file must be decoded into characters; this example uses a simple Charset method to decode a complete ByteBuffer into a newly allocated CharBuffer all at once. This CharBuffer is then used with a java.util.regex.Matcher object to look for pattern matches. Later examples in this chapter will illustrate lower-level character decoding techniques.

Example 6-3. BGrep.java

package je3.nio;
import java.io.*;
import java.nio.*;
import java.nio.charset.*;
import java.nio.channels.*;
import java.util.regex.*;

/**
 * BGrep: a regular expression search utility, like Unix grep, but
 * block-oriented instead of line-oriented.  For any match found, the
 * filename and character position within the file (note: not the line
 * number) are printed along with the text that matched.
 *
 * Usage:
 *   java je3.nio.BGrep [options] <pattern> <files>...
 *
 * Options:
 *   -e <encoding> specifies and encoding. UTF-8 is the default
 *   -i enables case-insensitive matching.  Use -s also for non-ASCII text
 *   -s enables strict (but slower) processing of non-ASCII characters
 * 
 * This program requires that each file to be searched fits into main
 * memory, and so does not work with extremely large files.
 **/
public class BGrep {
    public static void main(String[  ] args) {
        String encodingName = "UTF-8";  // Default to UTF-8 encoding
        int flags = Pattern.MULTILINE;  // Default regexp flags

        try { // Fatal exceptions are handled after this try block
            // First, process any options
            int nextarg = 0;
            while(args[nextarg].charAt(0) == '-') { 
                String option = args[nextarg++];
                if (option.equals("-e")) {
                    encodingName = args[nextarg++];
                }
                else if (option.equals("-i")) {  // case-insensitive matching
                    flags |= Pattern.CASE_INSENSITIVE;
                }
                else if (option.equals("-s")) { // Strict Unicode processing
                    flags |= Pattern.UNICODE_CASE; // case-insensitive Unicode
                    flags |= Pattern.CANON_EQ;     // canonicalize Unicode
                }
                else {
                    System.err.println("Unknown option: " + option);
                    usage( );
                }
            }
            
            // Get the Charset for converting bytes to chars
            Charset charset = Charset.forName(encodingName);

            // Next argument must be a regexp. Compile it to a Pattern object
            Pattern pattern = Pattern.compile(args[nextarg++], flags);

            // Require that at least one file is specified
            if (nextarg == args.length) usage( );  

            // Loop through each of the specified filenames
            while(nextarg < args.length) {
                String filename = args[nextarg++];
                CharBuffer chars;  // This will hold complete text of the file
                try {  // Handle per-file errors locally
                    // Open a FileChannel to the named file
                    FileInputStream stream = new FileInputStream(filename);
                    FileChannel f = stream.getChannel( );
                
                    // Memory-map the file into one big ByteBuffer.  This is
                    // easy but may be somewhat inefficient for short files.
                    ByteBuffer bytes = f.map(FileChannel.MapMode.READ_ONLY,
                                             0, f.size( ));
                
                    // We can close the file once it is is mapped into memory.
                    // Closing the stream closes the channel, too.
                    stream.close( );

                    // Decode the entire ByteBuffer into one big CharBuffer
                    chars = charset.decode(bytes);
                }
                catch(IOException e) { // File not found or other problem
                    System.err.println(e);   // Print error message
                    continue;                // and move on to the next file
                }
                
                // This is the basic regexp loop for finding all matches in a
                // CharSequence. Note that CharBuffer implements CharSequence. 
                // A Matcher holds state for a given Pattern and text.
                Matcher matcher = pattern.matcher(chars);
                while(matcher.find( )) { // While there are more matches
                    // Print out details of the match
                    System.out.println(filename + ":" +       // file name
                                       matcher.start( )+": "+  // character pos
                                       matcher.group( ));      // matching text
                }
            }
        }
        // These are the things that can go wrong in the code above
        catch(UnsupportedCharsetException e) {    // Bad encoding name
            System.err.println("Unknown encoding: " + encodingName);
        }
        catch(PatternSyntaxException e) {         // Bad pattern
            System.err.println("Syntax error in search pattern:\n" +
                               e.getMessage( ));
        }
        catch(ArrayIndexOutOfBoundsException e) { // Wrong number of arguments
            usage( );
        }
    }
    
    /** A utility method to display invocation syntax and exit. */
    public static void usage( ) { 
        System.err.println("Usage: java BGrep [-e <encoding>] [-i] [-s]" +
                           " <pattern> <filename>...");
        System.exit(1);
    }
}

[ Team LiB ]