Section 24.2. Parsing XML with SAX

24.2. Parsing XML with SAX

In most cases, the best way to extract information from an XML document is to parse the document with an event-driven parser compliant with SAX, the Simple API for XML. SAX defines a standard API that can be implemented on top of many different underlying parsers. The SAX approach to parsing has similarities to most of the HTML parsers covered in Chapter 23. As the parser encounters XML elements, text contents, and other significant events in the input stream, the parser calls back to methods of your classes. Such event-driven parsing, based on callbacks to your methods as relevant events occur, also has similarities to the event-driven approach that is almost universal in GUIs and in some of the best, most scalable networking frameworks, such as Twisted, mentioned in Chapter 19. Event-driven approaches in various programming fields may not appear natural to beginners, but enable high performance and particularly high scalability, making them very suitable for high-workload cases.

To use SAX, you define a content handler class, subclassing a library class and overriding some methods. Then you build a parser object p, install an instance of your class as p's handler, and feed p the input stream to parse. p calls methods on your handler to reflect the document's structure and contents. Your handler's methods perform application-specific processing. The xml.sax package supplies a factory function to build p, and convenience functions for simpler operation in typical cases. xml.sax also supplies exception classes, raised in cases of invalid input and other errors.

Optionally, you can also register with parser p other kinds of handlers besides the content handler. You can supply a custom error handler to use an error diagnosis strategy different from normal exception raising, for example in order to diagnose several errors during a parse. You can supply a custom DTD handler to receive information about notation and unparsed entities from the XML document's Document Type Definition (DTD). You can supply a custom entity resolver to handle external entity references in advanced, customized ways. These advanced possibilities are rarely used, and I do not cover them further in this book.

24.2.1. The xml.sax Package

The xml.sax package supplies exception class SAXException and subclasses of it to support fine-grained exception handling. xml.sax also supplies three functions.

make_parser
(parsers_list=[])

parsers_list is a list of strings, which are the names of modules from which you would like to build your parser. make_parser TRies each module in sequence until it finds one that defines a function create_parser. After the modules in parsers_list, if any, make_parser continues by trying a list of default modules. make_parser terminates as soon as it can generate a parser p, and returns p.
parse
parse(file,handler,error_handler=None)

file is either a filename string or a file-like object open for reading, that contains an XML document. handler is an instance of your own subclass of class ContentHandler, covered in "ContentHandler" on page 594. error_handler, if given, is an instance of your own subclass of class ErrorHandler. You don't necessarily have to subclass ContentHandler and/or ErrorHandler; you just need to provide the same interfaces as the classes do. Subclassing is a convenient means to this end.

Function parse is equivalent to the code:

p = make_parser( ) p.setContentHandler(handler) if error_handler is not None: p.setErrorHandler(error_handler) p.parse(file)

This idiom is quite frequent in SAX parsing, so having it in a single function is convenient. When error_handler is None, the parser reacts to errors by propagating an exception that is an instance of some subclass of SAXException.
parseString
parseString(string,handler,error_handler=None)

Like parse, except that string is the XML document in string form.

xml.sax also supplies a class, which you subclass to define your content handler.
ContentHandler
class ContentHandler( )

A subclass of ContentHandler (whose instance we name h in the following) may override several methods, of which the most frequently useful are the following:

h.characters( data)

Called when textual content data (a unicode string) is parsed. The parser may split each range of text in the document into any number of separate callbacks to h.characters. Therefore, your implementation of method characters usually buffers data, generally by appending it to a list attribute. When your class knows from some other event that all relevant data has arrived, your class calls ''.join on the list and processes the resulting string.

h.endDocument( )

Called once when the document finishes.

h.endElement( tag)

Called when the element named tag finishes.

h.endElementNS( name, qname)

Called when an element finishes and the parser is handling namespaces. name and qname are the same for startElementNS, covered below.

h.startDocument( )

Called once when the document begins.

h.startElement( tag, attrs)

Called when the element named tag begins. attrs is a mapping of attribute names to values, as covered in "Attributes" on page 595.

h.startElementNS( name, qname, attrs)

Called when an element begins and the parser is handling namespaces. name is a pair (uri,localname), where uri is the namespace's URI or None, and localname is the name of the tag. qname (which stands for qualified name) is either None, if the parser does not supply the namespace prefixes feature, or the string prefix:name used in the document's text for this tag. attrs is a mapping of attribute names to values, as covered in "Attributes" on page 595.

24.2.1.1. Attributes

The last argument of methods startElement and startElementNS is an attributes object attr, a read-only mapping of attribute names to attribute values. For method startElement, names are identifier strings. For method startElementNS, names are pairs (uri,localname), where uri is the namespace's URI or None, and localname is the name of the tag. In addition to some mapping methods, attr also supports methods that let you work with the qname (qualified name) of each attribute.

getValueByQ-Name
attr.getValueByQName(name)

Returns the attribute value for a qualified name name.
etNameByQ-Name
attr.getNameByQName(name)

Returns the (namespace, localname) pair for a qualified name name.
getQNameBy-Name
attr.getQNameByName(name)

Returns the qualified name for name, which is a (namespace, localname) pair.
getQNames
attr.getQNames( )

Returns the list of qualified names of all attributes.

For startElement, each qname is the same string as the corresponding name. For startElementNS, a qname is the corresponding local name for attributes not associated with a namespace (i.e., attributes whose uri is None); otherwise, the qname is the string prefix:name used in the document's text for this attribute.

The parser may reuse in later processing the attr object that it passes to methods startElement and startElementNS. If you need to keep a copy of the attributes of an element, call attr.copy( ) to get the copy.

24.2.1.2. Incremental parsing

All parsers support a method parse, which you call with the XML document as either a string or a file-like object open for reading. parse does not return until the end of the XML document. Most SAX parsers, though not all, also support incremental parsing, letting you feed the XML document to the parser a little at a time, as the document arrives from a network connection or other source; good incremental parsers perform all possible callbacks to your handler class's methods as soon as possible, so you don't have to wait for the whole document to arrive before you start processing it (the processing can instead proceed as incrementally as the parsing itself does, which is a great idea for asynchronous networking approaches, covered in "Event-Driven Socket Programs" on page 533). A parser p that is capable of incremental parsing supplies three more methods.

close
p.close( )

Call when the XML document is finished.
feed
p.feed(data)

Passes to the parser a part of the document. The parser processes some prefix of the text and holds the rest in a buffer until the next call to p.feed or p.close.
reset
p.reset( )

Call after an XML document is finished or abandoned, before you start feeding another XML document to the parser.

24.2.1.3. The xml.sax.saxutils module

The saxutils module of package xml.sax supplies two functions and a class that provide handy ways to generate XML output based on an input XML document.

escape
escape(data,entities={})

Returns a copy of string data with characters <, >, and & changed into entity references <, >, and &. entities is a dictionary with strings as keys and values; each substring s of data that is a key in entities is changed in escape's result string into string entities[s]. For example, to escape single- and double-quote characters, in addition to angle brackets and ampersands, you can call:

xml.sax.saxutils.escape(data, {'"': '"', "'": '''})

quoteattr
quoteattr(data,entities={})

Same as escape, but also quotes the result string to make it immediately usable as an attribute value and escapes any quote characters that have to be escaped.
XMLGenerator
class XMLGenerator(out=sys.stdout, encoding='iso-8859-1')

Subclasses xml.sax.ContentHandler and implements all that is needed to reproduce the input XML document on the given file-like object out with the specified encoding. When you must generate an XML document that is a small modification of the input one, you can subclass XMLGenerator, overriding methods and delegating most of the work to XMLGenerator's implementations of the methods. For example, if all you need to do is rename some tags according to a dictionary, XMLGenerator makes it extremely simple, as shown in the following example:

import xml.sax, xml.sax.saxutils def tagrenamer(infile, outfile, renaming_dict): base = xml.sax.saxutils.XMLGenerator class Renamer(base): def rename(self, name): return renaming_dict.get(name, name) def startElement(self, name, attrs): base.startElement(self, self.rename(name), attrs) def endElement(self, name): base.endElement(self, self.rename(name)) xml.sax.parse(infile, Renamer(outfile))

24.2.2. Parsing XHTML with xml.sax

The following example uses xml.sax to perform a typical XHTML-related task that is very similar to the tasks performed in the examples of Chapter 22. The example fetches an XHTML page from the Web with urllib, parses it, and outputs all unique links from the page to other sites. The example uses urlparse to examine the links for the given site and outputs only the links whose URLs have an explicit scheme of 'http'.

import xml.sax, urllib, urlparse

class LinksHandler(xml.sax.ContentHandler):
    def startDocument(self):
        self.seen = set( )
    def startElement(self, tag, attributes):
        if tag != 'a': return
        value = attributes.get('href')
        if value is not None and value not in self.seen:
            self.seen.add(value)
            pieces = urlparse.urlparse(value)
            if pieces[0] != 'http': return
            print urlparse.urlunparse(pieces)

p = xml.sax.make_parser( )
p.setContentHandler(LinksHandler( ))
f = urllib.urlopen('http://www.w3.org/MarkUp/')
BUFSIZE = 8192

while True:
    data = f.read(BUFSIZE)
    if not data: break
    p.feed(data)

p.close( )

This example is quite similar to the HTMLParser example in Chapter 22. With the xml.sax module, the parser and the handler are separate objects (while in the examples of Chapter 22 they coincided). Method names differ (startElement in this example versus handle_starttag in the HTMLParser example). The attributes argument is a mapping here, so its method get immediately gives us the attribute value we're interested in, while in the examples of Chapter 22, attributes were given as a sequence of (name,value) pairs, so we had to loop on the sequence until we found the right name. Despite these differences in detail, the overall structure is very close, and typical of simple event-driven parsing tasks.