[ Team LiB ] Previous Section Next Section

2.8 Tokenizing Text

We end this chapter with an extended (and more complex) example in three parts. Example 2-8 is a listing of Tokenizer.java. This Tokenizer interface defines an API for tokenizing text. Tokenizing simply means breaking into chunks; tokenizers are also known as lexers or scanners, and are commonly used when writing parsers. This Tokenizer interface is intended to provide an alternative to java.util.StringTokenizer, which is too simple for many uses, and java.io.StreamTokenizer, which is complex and poorly documented.

As an interface, Tokenizer doesn't do anything itself. But Example 2-8 is followed by an implementation in Examples Example 2-9 and Example 2-10. Following a pattern that you'll also see frequently in Java platform APIs, the implementation is broken into two classes: AbstractTokenizer, an abstract class that implements Tokenizer and implements its methods in terms of a small number of abstract methods, followed by CharSequenceTokenizer, a concrete subclass for tokenizing String and StringBuffer (or any CharSequence) objects. To demonstrate the flexibility of this implementation scheme, we'll see other Tokenizer implementations based on AbstractTokenizer throughout this book. ReaderTokenizer (for tokenizing character streams) is defined in Example 3-7, ChannelTokenizer (for tokenizing text read from high-performance "channels" of the New I/O API) is defined in Example 6-8, and MappedFileTokenizer (for tokenizing memory-mapped files) is defined in Example 6-7.

In addition to demonstrating the use of interfaces, abstract implementation classes, and concrete subclasses, Examples Example 2-8 through Example 2-10 are interesting because their public and protected members are fully documented using javadoc comments and javadoc tags. Space limitations prevent the use of this verbose documentation style elsewhere in the book, but these three classes provide a fully fleshed-out example of proper javadoc documentation. You can produce HTML javadoc documentation for these classes with the javadoc tool, using commands like the following:

cd ~/Examples/je3/classes
javadoc -source 1.4 -d api Tokenizer.java AbstractTokenizer.java \
      CharSequenceTokenizer.java

2.8.1 The Tokenizer Interface

Example 2-8 is the file Tokenizer.java. Because it contains complete javadoc comments, it is self-documenting. The documentation is a little hard to read in source-code form because it contains unformatted javadoc and HTML tags, but with a careful reading, you'll be able to understand the Tokenizer API. (And you should make sure you do understand it before moving on to the implementations that follow.) To get started, note the following things about the API:

  • tokenType( ) returns the type of the current token, and tokenText( ) returns the characters that comprise the token.

  • Token types are integers. Negative values represent special tokens, such as words and numbers. Most positive values are character codes that represent single-character tokens, such as punctuation characters. Positive values that match an opening quote character represent quote tokens instead.

  • next( ) reads the next token, making it the current token, and returns its type.

  • When a Tokenizer is first created, it returns every input character as a separate token. You must call various configuration methods to tell the tokenizer what kind of tokens (words, numbers, spaces, keywords, quotes) you are interested in.

Example 2-8. Tokenizer.java
package je3.classes;
import java.io.IOException;

/**
 * This interface defines basic character sequence tokenizing capabilities.
 * It can serve as the underpinnings of simple parsers.
 * <p>
 * The methods of this class fall into three categories:
 * <ul>
 * <li>methods to configure the tokenizer, such as {@link #skipSpaces} and
 *     {@link #tokenizeWords}.
 * <li>methods to read a token: {@link #next}, {@link #nextChar}, and
 *     {@link #scan(char,boolean,boolean,boolean)}.
 * <li>methods to query the current token, such as {@link #tokenType},
 *     {@link #tokenText} and {@link #tokenKeyword}.
 * </ul>
 * <p>
 * In its default state, a Tokenizer performs no tokenization at all: 
 * {@link #next} returns each input character as an individual token.
 * You must call one or more configuration methods to specify the type of
 * tokenization to be performed.  Note that the configuration methods all
 * return the Tokenizer object so that repeated method calls can be chained.
 * For example:
 * <pre>
 * Tokenizer t;
 * t.skipSpaces( ).tokenizeNumbers( ).tokenizeWords( ).quotes("'#","'\n");
 * </pre>
 * <p>
 * One particularly important configuration method is
 * {@link #maximumTokenLength}
 * which is used to specify the maximum token length in the input.  A
 * Tokenizer implementation must ensure that it can handle tokens at least
 * this long, typically by allocating a buffer at least that long.
 * <p>
 * The constant fields of this interface are token type constants.
 * Note that their values are all negative. Non-negative token types
 * always represent Unicode characters.
 * <p>
 * A tokenizer may be in one of three states: <ol>
 * <li>Before any tokens have been read.  In this state, {@link #tokenType}
 * always returns (@link #BOF}, and {@link #tokenLine} always returns 0.
 * {@link #maximumTokenLength} and {@link #trackPosition} may only be called
 * in this state.
 * <li>During tokenization.  In this state, {@link #next}, {@link #nextChar},
 * and {@link #scan(char,boolean,boolean,boolean)} are being called to tokenize
 * input characters, but none of these methods has yet returned {@link #EOF}.
 * Configuration methods other than those listed above may be called from this
 * state to dynamically change tokenizing behavior.
 * <li>End-of-file.  Once one of the tokenizing methods have returned EOF,
 * the tokenizer has reached the end of its input.  Any subsequent calls to
 * the tokenizing methods or to {@link #tokenType} will return EOF. Most 
 * methods may still be called from this state, although it is not useful 
 * to do so.
 * </ol>
 * @author David Flanagan
 */
public interface Tokenizer {
    // The following are token type constants.
    /** End-of-file.  Returned when there are no more characters to tokenize */
    public static final int EOF = -1;
    /** The token is a run of whitespace. @see #tokenizeSpaces( ) */
    public static final int SPACE = -2;
    /** The token is a run of digits. @see #tokenizeNumbers( ) */
    public static final int NUMBER = -3;
    /** The token is a run of word characters. @see #tokenizeWords( ) */
    public static final int WORD = -4;
    /** The token is a keyword. @see #keywords( ) */
    public static final int KEYWORD = -5;
    /** 
     * The token is arbitrary text returned by
     * {@link #scan(char,boolean,boolean,boolean)}.
     */
    public static final int TEXT = -6;
    /**
     * Beginning-of-file. This is the value returned by {@link #tokenType}
     * when it is called before tokenization begins.
     */
    public static final int BOF = -7;
    /** Special return value for {@link #scan(char,boolean,boolean,boolean)}.*/
    public static final int OVERFLOW = -8; // internal buffer overflow

    /**
     * Specify whether to skip spaces or return them.
     * @param skip If false (the default), then return whitespace characters
     *             or tokens.  If true, then next( ) never returns whitespace.
     * @return this Tokenizer object for method chaining.
     * @see #tokenizeSpaces
     */
    public Tokenizer skipSpaces(boolean skip);

    /**
     * Specify whether adjacent whitespace characters should be coalesced
     * into a single SPACE token.  This has no effect if spaces are being
     * skipped.  The default is false.
     * @param tokenize whether {@link #next} should colaesce adjacent 
     *    whitespace into a single {@link #SPACE} token.
     * @return this Tokenizer object for method chaining.
     * @see #skipSpaces
     */
    public Tokenizer tokenizeSpaces(boolean tokenize);

    /**
     * Specify whether adjacent digit characters should be coalesced into
     * a single token.  The default is false.
     * @param tokenize whether {@link #next} should colaesce adjacent digits
     *    into a single {@link #NUMBER} token.
     * @return this Tokenizer object for method chaining.
     */
    public Tokenizer tokenizeNumbers(boolean tokenize);
    
    /**
     * Specify whether adjacent word characters should be coalesced into
     * a single token.  The default is false. Word characters are defined by
     * a {@link WordRecognizer}.
     * @param tokenize whether {@link #next} should colaesce adjacent word
     *    characters into a single {@link #WORD} token.
     * @return this Tokenizer object for method chaining.
     * @see #wordRecognizer
     */
    public Tokenizer tokenizeWords(boolean tokenize);

    /**
     * Specify a {@link Tokenizer.WordRecognizer} to define what constitues a
     * word. If set to null (the default), then words are defined by
     * {@link Character#isJavaIdentifierStart} and
     * {@link Character#isJavaIdentifierPart}.
     * This has no effect if word tokenizing has not been enabled.
     * @param wordRecognizer the {@link Tokenizer.WordRecognizer} to use.
     * @return this Tokenizer object for method chaining.
     * @see #tokenizeWords
     */
    public Tokenizer wordRecognizer(WordRecognizer wordRecognizer);

    /**
     * Specify keywords to receive special recognition.
     * If a {@link #WORD} token matches one of these keywords, then the token 
     * type will be set to {@link #KEYWORD}, and {@link #tokenKeyword} will
     * return the index of the keyword in the specified array.
     * @param keywords an array of words to be treated as keywords, or null
     *                 (the default) for no keywords.
     * @return this Tokenizer object for method chaining.
     * @see #tokenizeWords
     */
    public Tokenizer keywords(String[  ] keywords);

    /**
     * Specify whether the tokenizer should keep track of the line number
     * and column number for each returned token.  The default is false.
     * If set to true, then tokenLine( ) and tokenColumn( ) return the
     * line and column numbers of the current token.  
     * @param track whether to track the line and column numbers for each
     *         token.  
     * @return this Tokenizer object for method chaining.
     * @throws java.lang.IllegalStateException
     *         if invoked after tokenizing begins
     * @see #tokenizeWords
     */
    public Tokenizer trackPosition(boolean track);

    /**
     * Specify pairs of token delimiters.  If the tokenizer encounters
     * any character in <tt>openquotes</tt>, then it will scan until it
     * encounters the corresponding character in <tt>closequotes</tt>.
     * When such a token is tokenized, {@link #tokenType} returns the character
     * from <tt>openquotes</tt> that was recognized and {@link #tokenText}
     * returns the characters between, but not including the delimiters.
     * Note that no escape characters are recognized. Quote tokenization occurs
     * after other types of tokenization so <tt>openquotes</tt> should not
     * include whitespace, number or word characters, if spaces, numbers, or
     * words are being tokenized.
     * <p>
     * Quote tokenization is useful for tokens other than quoted strings.
     * For example to recognize single-quoted strings and single-line
     * comments, you might call this method like this:
     * <code>quotes("'#", "'\n");</code>
     *
     * @param openquotes The string of characters that can begin a quote, 
     * @param closequotes The string of characters that end a quote
     * @return this Tokenizer object for method chaining.
     * @throws java.lang.NullPointerException if either argument is null
     * @throws java.lang.IllegalArgumentException if <tt>openquotes</tt> and 
     *         <tt>closequotes</tt> have different lengths.
     * @see #scan(char,boolean,boolean,boolean)
     */
    public Tokenizer quotes(String openquotes, String closequotes);

    /**
     * Specify the maximum token length that the Tokenizer is required to
     * accommodate. If presented with an input token longer than the specified
     * size, a Tokenizer behavior is undefined. Implementations must typically
     * allocate an internal buffer at least this large, but may use a smaller
     * buffer if they know that the total length of the input is smaller.
     * Implementations should document their default value, and are encouraged
     * to define constructors that take the token length as an argument.
     *
     * @param size maximum token length the tokenizer must handle. Must be > 0.
     * @return this Tokenizer object for method chaining.
     * @throws java.lang.IllegalArgumentException if <tt>size</tt> < 1.
     * @throws java.lang.IllegalStateException
     *         if invoked after tokenizing begins
     */
    public Tokenizer maximumTokenLength(int size);

    /**
     * This nested interface defines what a "word" is.
     * @see Tokenizer#tokenizeWords
     * @see Tokenizer#wordRecognizer
     */
    public static interface WordRecognizer {
        /**
         * Determine whether <tt>c</tt> is a valid word start character.
         * @param c the character to test
         * @return true if a word may begin with the character <tt>c</tt>.
         */
        public boolean isWordStart(char c);

        /**
         * Determine whether a word that begins with <tt>firstChar</tt> may
         * contain <tt>c</tt>.
         * @param c the character to test.
         * @param firstChar the character that started this word
         * @return true if a word that begins with <tt>firstChar</tt> may
         *         contain the character <tt>c</tt>
         */
        public boolean isWordPart(char c, char firstChar);
    }

    
    /**
     * Get the type of the current token. Valid token types are the token
     * type constants (all negative values) defined by this interface, and all
     * Unicode characters.  Positive return values typically represent 
     * punctuation characters or other single characters that were not 
     * tokenized.  But see {@link #quotes} for an exception.
     * @return the type of the current token, or {@link #BOF} if no tokens
     *     have been read yet, or {@link #EOF} if no more tokens are available.
     */
    public int tokenType( );

    /**
     * Get the text of the current token.
     * @return the text of the current token as a String, or null, when
     *   {@link #tokenType} returns {@link #BOF} or {@link #EOF}.
     *   Tokens delimited by quote characters (see {@link #quotes}) do not
     *   include the opening and closing delimiters, so this method may return
     *   the empty string when an empty quote is tokenized.  The same is
     *   possible after a call to {@link #scan(char,boolean,boolean,boolean)}.
     */
    public String tokenText( );

    /**
     * Get the index of the tokenized keyword.
     * @return the index into the keywords array of the tokenized word or 
     *   -1 if the current token type is not {@link #KEYWORD}.
     * @see #keywords
     */
    public int tokenKeyword( ); 

    /**
     * Get the line number of the current token.  
     * @return The line number of the start of the current token. Lines
     * are numbered from 1, not 0. This method returns 0 if the tokenizer is
     * not tracking token position or if tokenizing has not started yet, or if
     * the current token is {@link #EOF}.
     * @see #trackPosition
     */
    public int tokenLine( );

    /**
     * Get the column number of the current token.  
     * @return The column of the start of the current token. Columns
     * are numbered from 1, not 0. This method returns 0 if the tokenizer is
     * not tracking token position or if tokenizing has not started yet, or if
     * the current token is {@link #EOF}.
     * @see #trackPosition
     */
    public int tokenColumn( );

    /**
     * Make the next token of input the current token, and return its type.
     * Implementations must tokenize input using the following algorithm, and
     * must perform each step in the order listed. <ol>
     *
     * <li>If there are no more input characters, set the current token to
     * {@link #EOF} and return that value.
     * 
     * <li>If configured to skip or tokenize spaces, and the current character
     * is whitespace, coalesce any subsequent whitespace characters into a 
     * token.  If spaces are being skipped, start tokenizing a new token;
     * otherwise, make the spaces the current token and return {@link #SPACE}.
     * See {@link #skipSpaces}, {@link #tokenizeSpaces}, and
     * {@link Character#isWhitespace}.
     * 
     * <li>If configured to tokenize numbers and the current character is a 
     * digit, coalesce all adjacent digits into a single token, make it the
     * current token, and return {@link #NUMBER}. See {@link #tokenizeNumbers}
     * and {@link Character#isDigit}
     *
     * <li>If configured to tokenize words, and the current character is a
     * word character, coalesce all adjacent word characters into a single
     * token, and make it the current token. If the word matches a registered
     * keyword, determine the keyword index and return {@link #KEYWORD}.
     * Otherwise return {@link #WORD}. Determine whether a character is a 
     * word character using the registered {@link WordRecognizer}, if any, 
     * or with {@link Character#isJavaIdentifierStart} and
     * {@link Character#isJavaIdentifierPart}.  See also
     * {@link #tokenizeWords} and {@link #wordRecognizer}.
     * 
     * <li>If configured to tokenize quotes or other delimited tokens, and the
     * current character appears in the string of opening delimiters, then
     * scan until the character at the same position in the string of closing
     * delimiters is encountered or until there is no more input or the
     * maximum token size is reached.  Coalesce the characters between (but
     * not including) the delimiters into a single token, set the token type
     * to the opening delimiter, and return this character.
     * See {@link #quotes}.
     * 
     * <li>If none of the steps above has returned a token, then make the
     * current character the current token, and return the current character.
     * </ol>
     *
     * @return the type of the next token, or {@link #EOF} if there are 
     *         no more tokens to be read.
     * @see #nextChar @see #scan(char,boolean,boolean,boolean) */
    public int next( ) throws IOException;

    /**
     * Make the next character of input the current token, and return it.
     * @return the next character or {@link #EOF} if there are no more.
     * @see #next @see #scan(char,boolean,boolean,boolean)
     */
    public int nextChar( ) throws IOException;

    /** 
     * Scan until the first occurrence of the specified delimiter character.
     * Because a token scanned in this way may contain arbitrary characters,
     * the current token type is set to {@link #TEXT}.
     * @param delimiter the character to scan until.
     * @param extendCurrentToken if true, the scanned characters extend the
     *   current token.  Otherwise, they are a token of their own.
     * @param includeDelimiter if true, then the delimiter character is
     *   included at the token.  If false, then see skipDelimiter.
     * @param skipDelimiter if <tt>includeDelimiter</tt> is false, then this
     *     parameter specifies whether to skip the delimiter or return it in
     *     the next token.
     * @return the token type {@link #TEXT} if the delimiter character is
     *   successfully found.  If the delimiter is not found, the return value
     *   is {@link #EOF} if all input was read, or {@link #OVERFLOW} if the
     *   maximum token length was exceeded.  Note that even when this method
     *   does not return {@link #TEXT}, {@link #tokenType} does still return
     *   that value, and {@link #tokenText} returns as much of the token
     *   as could be read.
     * @see #scan(java.lang.String,boolean,boolean,boolean,boolean)
     * @see #next @see #nextChar
     */
    public int scan(char delimiter, boolean extendCurrentToken,
                    boolean includeDelimiter, boolean skipDelimiter)
        throws IOException;

    /**
     * This method is just {@link #scan(char,boolean,boolean,boolean)} except
     * that it uses a String delimiter, possibly containing more than one
     * character.
     * @param delimiter the string of characters that will terminate the scan.
     *     This argument must not be null, and must be of length 1 or greater.
     * @param matchall true if all characters of the delimiter must be matched
     *     sequentially.  False if any one character in the string will do.
     * @param extendCurrentToken add scanned text to current token if true.
     * @param includeDelimiter include delimiter text in token if true.
     * @param skipDelimiter if <tt>includeDelimiter</tt> is false, then this
     *     parameter specifies whether to skip the delimiter or return it in
     *     the next token.
     * @return {@link #TEXT}, {@link #EOF}, or {@link #OVERFLOW}.  See
     *     {@link #scan(char,boolean,boolean,boolean)} for details.
     * @throws java.lang.NullPointerException if delimiter is null.
     * @throws java.lang.IllegalArgumentException if delimiter is empty.
     * @throws java.lang.IllegalArgumentException if matchall is true and
     *    includeDelimiter and skipDelimiter are both false.
     * @see #scan(char,boolean,boolean,boolean)
     */
    public int scan(String delimiter, boolean matchAll,
                    boolean extendCurrentToken, boolean includeDelimiter,
                    boolean skipDelimiter)
        throws IOException;
}

2.8.2 The AbstractTokenizer Implementation

Example 2-9 defines the AbstractTokenizer class. As its name implies, this class implements the Tokenizer interface but is abstract, so it cannot be instantiated. The class begins by declaring a number of protected fields that hold its state. It then declares two abstract methods that subclasses must implement. The javadoc comments for these methods describe exactly what their implementations must do, and how those implementations should modify the protected fields of AbstractTokenizer. The rest of the class is a Tokenizer implementation that uses those protected fields and methods. This is the relatively complex code that does the actual tokenizing. Note that it tokenizes text stored in a character array named text. text is one of the protected fields defined by AbstractTokenizer. It is allocated and filled by concrete subclasses of AbstractTokenizer.

Example 2-9. AbstractTokenizer.java
package je3.classes;
import java.util.*;
import java.io.IOException;

/**
 * This class implements all the methods of the Tokenizer interface, and
 * defines two new abstract methods, {@link #createBuffer} and
 * {@link #fillBuffer} which all concrete subclasses must implement.
 * By default, instances of this class can handle tokens of up to 16*1024
 * characters in length.
 * @author David Flanagan
 */
public abstract class AbstractTokenizer implements Tokenizer {
    boolean skipSpaces;
    boolean tokenizeSpaces;
    boolean tokenizeNumbers;
    boolean tokenizeWords;
    boolean testquotes;
    Tokenizer.WordRecognizer wordRecognizer;
    Map keywordMap;
    String openquotes, closequotes;
    boolean trackPosition;

    int maximumTokenLength = 16 * 1024;

    int tokenType = BOF;
    int tokenLine = 0; 
    int tokenColumn = 0; 
    int tokenKeyword = -1;

    int line=0, column=0;  // The line and column numbers of text[p]

    // The name of this field is a little misleading. If eof is true, it
    // means that no more characters are available. But tokenType and tokenText
    // may still be valid until the next call to next( ), nextChar( ), or scan( ).
    boolean eof;           // Set to the return value of fillBuffer( )

    // The following fields keep track of the tokenizer's state
    // Invariant:  tokenStart <= tokenEnd <= p <= numChars <= text.length

    /**
     * The start of the current token in {@link #text}.
     * Subclasses may need to update this field in {@link #fillBuffer}.
     */
    protected int tokenStart = 0;

    /**
     * The index in {@link #text} of the first character after the current
     * token. Subclasses may need to update this field in {@link #fillBuffer}.
     */
    protected int tokenEnd = 0;

    /**
     * The position of the first untokenized character in {@link #text}.
     * Subclasses may need to update this field in {@link #fillBuffer}.
     */
    protected int p = 0;

    /**
     * The number of valid characters of input text stored in {@link #text}.
     * Subclasses must implement {@link #createBuffer} and {@link #fillBuffer}
     * to set this value appropriately.
     */
    protected int numChars = 0;

    /**
     * A buffer holding the text we're parsing.  Subclasses must implement
     * {@link #createBuffer} to set this field to a character array, and
     * {@link #fillBuffer} to refill the array.
     */
    protected char[  ] text = null;

    /**
     * Create the {@link #text} buffer to use for parsing.  This method may
     * put text in the buffer, but it is not required to.  In either case, it
     * should set {@link #numChars} appropriately.  This method will be called
     * once, before tokenizing begins.
     * 
     * @param bufferSize the minimum size of the created array, unless the 
     * subclass knows in advance that the input text is smaller than this, in 
     * which case, the input text size may be used instead.
     * @see #fillBuffer
     */
    protected abstract void createBuffer(int bufferSize);

    /**
     * Fill or refill the {@link #text} buffer and adjust related fields.
     * This method will be called when the tokenizer needs more characters to
     * tokenize. Concrete subclasses must implement this method to put
     * characters into the @{link #text} buffer, blocking if necessary to wait
     * for characters to become available.  This method may make room in the
     * buffer by shifting the contents down to remove any characters before
     * tokenStart.  It must preserve any characters after {@link #tokenStart}
     * and before {@link #numChars}, however.  After such a shift, it must
     * adjust {@link #tokenStart}, {@link #tokenEnd} and {@link #p}
     * appropriately.  After the optional shift, the method should add as many
     * new characters as possible to {@link #text} (and always at least 1) and
     * adjust {@link #numChars} appropriately.
     * 
     * @return false if no more characters are available; true otherwise.
     * @see #createBuffer
     */
    protected abstract boolean fillBuffer( ) throws IOException;

    public Tokenizer skipSpaces(boolean skip) {
        skipSpaces = skip;
        return this;
    }

    public Tokenizer tokenizeSpaces(boolean tokenize) {
        tokenizeSpaces = tokenize;
        return this;
    }

    public Tokenizer tokenizeNumbers(boolean tokenize) {
        tokenizeNumbers = tokenize;
        return this;
    }
    
    public Tokenizer tokenizeWords(boolean tokenize) {
        tokenizeWords = tokenize;
        return this;
    }

    public Tokenizer wordRecognizer(Tokenizer.WordRecognizer wordRecognizer) {
        this.wordRecognizer = wordRecognizer;
        return this;
    }

    public Tokenizer quotes(String openquotes, String closequotes) {
        if (openquotes == null || closequotes == null) 
            throw new NullPointerException("arguments must be non-null");
        if (openquotes.length( ) != closequotes.length( )) 
            throw new IllegalArgumentException("argument lengths differ");
        this.openquotes = openquotes;
        this.closequotes = closequotes;
        this.testquotes = openquotes.length( ) > 0;
        return this;
    }

    public Tokenizer trackPosition(boolean track) {
        if (text != null) throw new IllegalStateException( );
        trackPosition = track;
        return this;
    }

    public Tokenizer keywords(String[  ] keywords) {
        if (keywords != null) {
            keywordMap = new HashMap(keywords.length);
            for(int i = 0; i < keywords.length; i++) 
                keywordMap.put(keywords[i], new Integer(i));
        }
        else keywordMap = null;
        return this;
    }

    public Tokenizer maximumTokenLength(int size) {
        if (size < 1) throw new IllegalArgumentException( );
        if (text != null) throw new IllegalStateException( );
        maximumTokenLength = size;
        return this;
    }

    public int tokenType( ) { return tokenType; }

    public String tokenText( ) {
        if (text == null || tokenStart >= numChars) return null;
        return new String(text, tokenStart, tokenEnd-tokenStart);
    }

    public int tokenLine( ) {
        if (trackPosition && tokenStart < numChars) return tokenLine;
        else return 0;
    }

    public int tokenColumn( ) {
        if (trackPosition && tokenStart < numChars) return tokenColumn;
        else return 0;
    }

    public int tokenKeyword( ) {
        if (tokenType == KEYWORD) return tokenKeyword;
        else return -1;
    }
                                 
    public int next( ) throws IOException {
        int quoteindex;
        beginNewToken( );
        if (eof) return tokenType = EOF;

        char c = text[p];

        if ((skipSpaces||tokenizeSpaces) && Character.isWhitespace(c)) {
            tokenType = SPACE;
            do {
                if (trackPosition) updatePosition(text[p]);
                p++;
                if (p >= numChars) eof = !fillBuffer( );
            } while(!eof && Character.isWhitespace(text[p]));

            // If we don't return space tokens, then recursively call 
            // this method to find another token. Note that the next character
            // is not space, so we will not get into infinite recursion
            if (skipSpaces) return next( );
            tokenEnd = p;
        }
        else if (tokenizeNumbers && Character.isDigit(c)) {
            tokenType = NUMBER;
            do {
                if (trackPosition) column++;
                p++;
                if (p >= numChars) eof = !fillBuffer( );
            } while(!eof && Character.isDigit(text[p]));
            tokenEnd = p;
        }
        else if (tokenizeWords && 
                 (wordRecognizer!=null
                      ?wordRecognizer.isWordStart(c)
                      :Character.isJavaIdentifierStart(c))) {
            tokenType = WORD;
            do {
                if (trackPosition) column++;
                p++;
                if (p >= numChars) eof = !fillBuffer( );
            } while(!eof &&
                    (wordRecognizer!=null
                         ?wordRecognizer.isWordPart(text[p], c)
                         :Character.isJavaIdentifierPart(text[p])));

            if (keywordMap != null) {
                String ident = new String(text,tokenStart,p-tokenStart);
                Integer index = (Integer) keywordMap.get(ident);
                if (index != null) {
                    tokenType = KEYWORD;
                    tokenKeyword = index.intValue( );
                }
            }
            tokenEnd = p;
        }
        else if (testquotes && (quoteindex = openquotes.indexOf(c)) != -1) {
            // Notes: we do not recognize any escape characters.
            // We do not include the opening or closing quote.
            // We do not report an error on EOF or OVERFLOW.
            if (trackPosition) column++;
            p++;
            // Scan until we find a matching quote, but do not include
            // the opening or closing quote.  Set the token type to the 
            // opening delimiter
            char closequote = closequotes.charAt(quoteindex);
            scan(closequote, false, false, true);
            tokenType = c;
            // the call to scan set tokenEnd, so we don't have to
        }
        else {
            // Otherwise, the character itself is the token
            if (trackPosition) updatePosition(text[p]);
            tokenType = text[p];
            p++;
            tokenEnd = p;
        }
            
        // Check the invariants before returning
        assert text != null && 0 <= tokenStart && tokenStart <= tokenEnd && 
            tokenEnd <= p && p <= numChars && numChars <= text.length;
        return tokenType;
    }

    public int nextChar( ) throws IOException {
        beginNewToken( );
        if (eof) return tokenType = EOF;
        tokenType = text[p];
        if (trackPosition) updatePosition(text[p]);
        tokenEnd = ++p;
        // Check the invariants before returning
        assert text != null && 0 <= tokenStart && tokenStart <= tokenEnd && 
            tokenEnd <= p && p <= numChars && numChars <= text.length;
        return tokenType;
    }

    public int scan(char delimiter, boolean extendCurrentToken,
                    boolean includeDelimiter, boolean skipDelimiter)
        throws IOException 
    {
        return scan(new char[  ] { delimiter }, false,
                    extendCurrentToken, includeDelimiter, skipDelimiter);
    }

    public int scan(String delimiter, boolean matchall,
                    boolean extendCurrentToken,
                    boolean includeDelimiter, boolean skipDelimiter)
        throws IOException 
    {
        return scan(delimiter.toCharArray( ), matchall,
                    extendCurrentToken, includeDelimiter, skipDelimiter);
    }

    protected int scan(char[  ] delimiter, 
                       boolean matchall, boolean extendCurrentToken,
                       boolean includeDelimiter, boolean skipDelimiter)
        throws IOException 
    {
        if (matchall && !includeDelimiter && !skipDelimiter) 
            throw new IllegalArgumentException("must include or skip " +
                                          "delimiter when matchall is true");

        if (extendCurrentToken) ensureChars( );
        else beginNewToken( );

        tokenType = TEXT; // Even if return value differs
        if (eof) return EOF;

        int delimiterMatchIndex = 0;
        String delimString = null;
        if (!matchall && delimiter.length > 0)
            delimString = new String(delimiter);

        while(!eof) {
            // See if we've found the delimiter.  There are 3 cases here:
            // 1) single-character delimiter
            // 2) multi-char delimiter, and all must be matched sequentially
            // 3) multi-char delimiter, must match any one of them.
            if (delimiter.length == 1) {
                if (text[p] == delimiter[0]) break;
            }
            else if (matchall) {
                if (text[p] == delimiter[delimiterMatchIndex]) {
                    delimiterMatchIndex++;
                    if (delimiterMatchIndex == delimiter.length) break;
                }
                else delimiterMatchIndex = 0;
            }
            else {
                if (delimString.indexOf(text[p]) != -1) break;
            }

            if (trackPosition) updatePosition(text[p]);
            p++;
            if (p >= numChars) {    // Do we need more text?
                if (tokenStart > 0)     // Do we have room for more?
                    eof = !fillBuffer( ); // Yes, so go get some
                else {                  // No room for more characters
                    tokenEnd = p;       // so report an overflow
                    return OVERFLOW;
                }
            }
        }

        if (eof) {
            tokenEnd = p;
            return EOF;
        }

        if (includeDelimiter) {
            if (trackPosition) updatePosition(text[p]);
            p++;
            tokenEnd = p;
        }
        else if (skipDelimiter) {
            if (trackPosition) updatePosition(text[p]);
            p++;
            if (matchall) tokenEnd = p - delimiter.length;
            else tokenEnd = p - 1;
        }
        else {
            // we know the delimiter length is 1 in this case
            tokenEnd = p;
        }

        // Check the invariants before returning
        assert text != null && 0 <= tokenStart && tokenStart <= tokenEnd && 
            tokenEnd <= p && p <= numChars && numChars <= text.length;
        return TEXT;
    }

    private void ensureChars( ) throws IOException {
        if (text == null) {
            createBuffer(maximumTokenLength);  // create text[  ], set numChars
            p = tokenStart = tokenEnd = 0;     // initialize other state
            if (trackPosition) line = column = 1;
        }
        if (!eof && p >= numChars) // Fill the text[  ] buffer if needed
            eof = !fillBuffer( );  

        // Make sure our class invariants hold true before we start a token
        assert text != null && 0 <= tokenStart && tokenStart <= tokenEnd && 
            tokenEnd <= p && (p < numChars || (p == numChars && eof)) &&
            numChars <= text.length;
    }

    private void beginNewToken( ) throws IOException {
        ensureChars( );
        if (!eof) {
            tokenStart = p;
            tokenColumn = column;
            tokenLine = line;
        }
    }

    private void updatePosition(char c) {
        if (c == '\n') {
            line++;
            column = 1;
        }
        else column++;
    }
}

2.8.3 A Concrete CharSequenceTokenizer

Example 2-10 is a concrete implementation of the Tokenizer interface that subclasses AbstractTokenizer to tokenize any CharSequence (i.e., any String, StringBuffer, or java.nio.CharBuffer object). The code is refreshingly simple, which shows the power of the three-way interface/abstract class/concrete class design. The class includes an inner class named Test that tokenizes its command-line arguments.

Example 2-10. CharSequenceTokenizer.java
package je3.classes;

/**
 * This trivial subclass of AbstractTokenizer is suitable for tokenizing input
 * stored in a String, StringBuffer, CharBuffer, or any other class that 
 * implements CharSequence.  Because CharSequence instances may be mutable,
 * the construtor makes an internal copy of the character sequence.  This means
 * that any subsequent changes to the character sequence will not be seen
 * during tokenizing.
 *
 * @author David Flanagan
 */
public class CharSequenceTokenizer extends AbstractTokenizer {
    char[  ] buffer;  // a copy of the characters in the sequence

    /** 
     * Construct a new CharSequenceTokenizer to tokenize <tt>sequence</tt>.
     * This constructor makes an internal copy of the characters in the
     * specified sequence.
     * @param sequence the character sequence to be tokenized.
     */
    public CharSequenceTokenizer(CharSequence sequence) {
        buffer = sequence.toString( ).toCharArray( );
    }

    /**
     * Set the inherited {@link #text} and {@link #numChars} fields.
     * This class knows the complete length of the input text, so it ignores
     * the <tt>bufferSize</tt> argument and uses the complete input sequence.
     * @param bufferSize ignored in this implementation
     */
    protected void createBuffer(int bufferSize) {
        assert text == null;       // verify that we're only called once
        text = buffer;
        numChars = buffer.length;
    }

    /**
     * Return false to indicate no more input is available.
     * {@link #createBuffer} fills the buffer with the complete input sequence,
     * so this method returns false to indicate that no more text is available.
     * @return always returns false.
     */
    protected boolean fillBuffer( ) { return false; }

    public static class Test {
        public static void main(String[  ] args) throws java.io.IOException {
            StringBuffer text = new StringBuffer( );
            for(int i = 0; i < args.length; i++) text.append(args[i]+" ");
            CharSequenceTokenizer t=new CharSequenceTokenizer(text.toString( ));
            t.tokenizeWords(true).quotes("'&","';").skipSpaces(true);
            while(t.next( ) != Tokenizer.EOF)
                System.out.println(t.tokenText( ));
        }
    }
}
    [ Team LiB ] Previous Section Next Section