3.8 Tokenizing a Character Stream

Example 3-6 was a Reader implementation wrapped around another Reader. ReaderTokenizer (Example 3-7) is a Tokenizer implementation wrapped around a Reader. The Tokenizer interface was shown in Example 2-8, and the ReaderTokenizer class shown here is a subclass of the AbstractTokenizer class of Example 2-9.

As its name implies, ReaderTokenizer tokenizes the text it reads from a Reader stream. The class implements the abstract createBuffer( ) and fillBuffer( ) methods of its superclass, and you may want to reread Example 2-9 to refresh your memory about the interactions between ReaderTokenizer and AbstractTokenizer.

Example 3-7 includes an inner class named Test that reads and tokenizes characters from a FileReader, listing the tokens read on the standard output. It also writes the text of each token to a FileWriter, producing a copy of the input file and demonstrating that the tokenizer accounts for every character of the input file (as long as it is not configured to discard spaces, that is). Like its superclass, ReaderTokenizer uses the assert keyword, and must be compiled with the -source 1.4 option to javac.

Example 3-7. ReaderTokenizer.java

package je3.io;
import je3.classes.Tokenizer;
import je3.classes.AbstractTokenizer;
import java.io.*;

/**
 * This Tokenizer implementation extends AbstractTokenizer to tokenize a stream
 * of text read from a java.io.Reader.  It implements the createBuffer( ) and
 * fillBuffer( ) methods required by AbstractTokenizer.  See that class for
 * details on how these methods must behave.  Note that a buffer size may
 * be selected, and that this buffer size also determines the maximum token
 * length.  The Test class is a simple test that tokenizes a file and uses
 * the tokens to produce a copy of the file
 **/
public class ReaderTokenizer extends AbstractTokenizer {
    Reader in;

    // Create a ReaderTokenizer with a default buffer size of 16K characters
    public ReaderTokenizer(Reader in) { this(in, 16*1024); }

    public ReaderTokenizer(Reader in, int bufferSize) {
        this.in = in;  // Remember the reader to read input from
        // Tell our superclass about the selected buffer size.
        // The superclass will pass this number to createBuffer( )
        maximumTokenLength(bufferSize);
    }

    // Create a buffer to tokenize.
    protected void createBuffer(int bufferSize) {
        // Make sure AbstractTokenizer only calls this method once
        assert text == null;
        this.text = new char[bufferSize];  // the new buffer
        this.numChars = 0;                 // how much text it contains
    }

    // Fill or refill the buffer.
    // See AbstractTokenizer.fillBuffer( ) for what this method must do.
    protected boolean fillBuffer( ) throws IOException {
        // Make sure AbstractTokenizer is upholding its end of the bargain
        assert text!=null && 0 <= tokenStart && tokenStart <= tokenEnd &&
            tokenEnd <= p && p <= numChars && numChars <= text.length;

        // First, shift already tokenized characters out of the buffer
        if (tokenStart > 0) {
            // Shift array contents
            System.arraycopy(text, tokenStart, text, 0, numChars-tokenStart);
            // And update buffer indexes
            tokenEnd -= tokenStart; 
            p -= tokenStart;
            numChars -= tokenStart;
            tokenStart = 0; 
        }

        // Now try to read more characters into the buffer
        int numread = in.read(text, numChars, text.length-numChars);
        // If there are no more characters, return false
        if (numread == -1) return false;
        // Otherwise, adjust the number of valid characters in the buffer
        numChars += numread;
        return true;  
    }

    // This test class tokenizes a file, reporting the tokens to standard out
    // and creating a copy of the file to demonstrate that every input
    // character is accounted for (since spaces are not skipped).
    public static class Test {
        public static void main(String[  ] args) throws java.io.IOException {
            Reader in = new FileReader(args[0]);
            PrintWriter out = new PrintWriter(new FileWriter(args[0]+".copy"));
            ReaderTokenizer t = new ReaderTokenizer(in);
            t.tokenizeWords(true).tokenizeNumbers(true).tokenizeSpaces(true);
            while(t.next( ) != Tokenizer.EOF) {
                switch(t.tokenType( )) {
                case Tokenizer.EOF:
                    System.out.println("EOF"); break;
                case Tokenizer.WORD:
                    System.out.println("WORD: " + t.tokenText( )); break;
                case Tokenizer.NUMBER:
                    System.out.println("NUMBER: " + t.tokenText( )); break;
                case Tokenizer.SPACE:
                    System.out.println("SPACE"); break;
                default:
                    System.out.println((char)t.tokenType( ));
                }
                out.print(t.tokenText( ));  // Copy token to the file
            }
            out.close( );
        }
    }
}

[ Team LiB ]