XML Parser Functions

In this section, we will examine PHP's event-based XML parser functions. Prior to PHP 5, these were based on Jim Clarke's Expat library (XML Parser Toolkit), which is available from http://www.jclark.com/xml/expat.html. As of PHP 5, all PHP's XML functions use libxml2 (http://www.xmlsoft.org/). Event-based models for parsing XML are not the easiest to use, but they can be very efficient. Handler functions are invoked as XML elements are encountered, whereas alternatives such as DOM require that entire documents are modeled in memory before you work with them.

Acquiring a Parser Resource

To begin parsing a document, you need a parser resource. You can acquire one of these with the xml_parser_create() function. xml_parser_create() does not require any arguments and returns a parser resource if all goes well; otherwise, it returns false. The function optionally accepts a string containing one of three character encodings: ISO-8859-1, which is the default; US-ASCII; and UTF-8. We will stick to the default:

$parser = xml_parser_create();

When you have finished working with the parser resource, you might want to free up the memory it is using to reduce your script's overhead. xml_parser_free() requires a valid parser resource and returns a boolean—true if the operation was successful, and false otherwise:

xml_parser_free( $parser );

Setting XML Handlers

Seven XML events can be associated with a handler; of these, we will cover the three you are most likely to use frequently. That is, the start and end of an element and character data.

To associate a function with element events, you should use the xml_set_element_handler() function. This requires three arguments: a valid parser resource, the name of handler for start elements, and the name of a handler for end elements.

You should build the functions in question, designing the start element handler to accept three arguments. The first is a parser resource, the second is a string containing the element's name, and the third is an associative array of attributes. The end element handler should be designed to accept two arguments—the parser resource and the name of the element. Unless you have specified otherwise, all element and attribute names are converted to uppercase characters:

// ...
xml_set_element_handler( $parser, "start_handler", "end_handler" );
// ...
function start_handler( $parser, $el_name, $attribs ) {
  print "START: $el_name: <br />\n";
  foreach( $attribs as $at_name=>$at_val ) {
    print "\t$at_name=>\"$at_val\"<br />\n";
  }
}

function end_handler( $parser, $el_name ) {
  print "END: $el_name<br />\n";
}

The previous fragment illustrates two very simple element handlers. The start element handler prints the element name and a list of attribute names and values. This is called for the beginning of every element encountered in an XML document. The end handler merely prints the element name again.

Now that we know where elements begin and end, it would be nice to access any text they might contain. We can do this by setting up a character handler with the xml_set_character_data_handler() function, which requires a valid parser resource and the name of a handler function. The handler function should be designed to accept a parser resource and the found string, like so:

function char_data( $parser, $data ) {
  print "\tchar data:<i>".trim($data)."</i><br />\n";
}

You can read about the other XML events supported by PHP at the appropriate PHP manual page (http://www.php.net/manual/en/ref.xml.php). You can also see the complete list in Table 22.1.

Table 22.1. The XML Handler Functions
Function
Trigger Event
xml_set_character_data_handler()
Character data
xml_set_default_handler()
Events not covered by specific handlers
xml_set_element_handler()
Element start and end
xml_set_external_entity_ref_handler()
External entities
xml_set_notation_decl_handler()
Notation declaration
xml_set_processing_instruction_handler()
Processing instructions
xml_set_unparsed_entity_decl_handler()
Unparsed entity (NDATA)

`xml_parser_set_option()`

I mentioned that element names are passed to handlers as uppercase strings by default. This is not advisable because element names should be case sensitive. You can turn off this feature using the xml_parser_set_option() function. This function requires a parser resource, an integer that determines which option is to be set, and the value for the option itself. To turn off the feature that renders element names uppercase (also called case folding), you can use the built-in constant XML_OPTION_CASE_FOLDING and pass 0 to the function:

xml_parser_set_option( $parser, XML_OPTION_CASE_FOLDING, 0 );

You can also change the target character encoding using this function. To do this, you call xml_parser_set_option() with a $parser resource, the constant XML_OPTION_TARGET_ENCODING, and a string value set to one of ISO-8859-1, US-ASCII, or UTF-8. This makes the parser convert character encoding before passing data to your handlers. By default, the target encoding is the same as that set for the source encoding (ISO-8859-1 by default, or whatever you set with the xml_parser_create() function).

Parsing the Document

So far, we've merely been setting the correct conditions for a parse. To actually begin the parse process, we need a function called xml_parse(). xml_parse() requires a valid parser resource and a string containing the XML to be parsed. You can call xml_parse() repeatedly, and it will treat additional data as part of the same document. If you want to inform the parser that it should treat any subsequent call to xml_parse() as the start of a new document, you should pass it a positive integer as an optional third argument:

$xml_data="<?xml version=\"1.0\"?><banana-news><test /></banana-news>";
xml_parse( $parser, $xml_data, 1 );

xml_parse() returns a boolean—true if the parse was successful and false if an error was encountered.

Reporting Errors

When parsing an XML document, you should make allowances for the possibility of errors in the document. If an error is encountered, the parser stops working with your document, but it does not output a message to the browser. It is up to you to generate an informative error message, including the nature of the error and line number at which it occurred.

The parser only reports errors in well-formedness—that is, errors in XML syntax. It is not capable of validating an XML document against a DTD.

We can detect whether an error has occurred by testing the return value of xml_parse(). If a failure has occurred, the parser stores an error number, which you can access with the xml_get_error_code() function. xml_get_error_code() requires a valid parser resource:

$code = xml_get_error_code( $parser );

The code is an integer that should match an error constant provided for you by PHP, such as XML_ERROR_TAG_MISMATCH. Rather than work our way through all the relevant constants to produce an error message, we can simply pass the code to another function, xml_error_string(). xml_error_string() requires only an XML error code and produces a clear error report:

$str = xml_error_string( $code );

Now all we need is to find the line number at which the error occurred. We can do this with xml_get_current_line_number(), which requires a parser resource and returns the current line number. Because the parser stops at any error it finds, the current line number is the line number at which the error is to be found:

$line = xml_get_current_line_number( $parser );

We can now create a function to report on errors:

function format_error( $p ) {
    $code = xml_get_error_code( $p );
    $str = xml_error_string( $code );
    $line = xml_get_current_line_number ( $p );
    return "XML ERROR ($code): $str at line $line";
}

All the previous fragments are brought together in Listing 22.2.

Listing 22.2 Parsing an XML Document

 1: <?php
 2: $parser = xml_parser_create();
 3:
 4: xml_parser_set_option( $parser, XML_OPTION_CASE_FOLDING, 0 );
 5: xml_set_element_handler( $parser, "start_handler", "end_handler" );
 6: xml_set_character_data_handler( $parser, "char_data" );
 7:
 8: $xml_str = file_get_contents( "listing22.1.xml", 0 );
 9:
10: xml_parse( $parser, $xml_str )
11: or die( format_error( $parser ) );
12:
13: function start_handler( $parser, $el_name, $attribs ) {
14:   print "START: $el_name: <br />\n";
15:   foreach( $attribs as $at_name=>$at_val ) {
16:     print "\t$at_name=>\"$at_val\"<br />\n";
17:   }
18:   print "\t<blockquote><div>\n";
19: }
20:
21: function end_handler( $parser, $el_name ) {
22:   print "\t</div></blockquote>\n";
23:   print "END: $el_name<br />\n";
24: }
25:
26: function char_data( $parser, $data ) {
27:   print "\tchar data:<i>".trim($data)."</i><br />\n";
28: }
29:
30: function format_error( $p ) {
31:   $code = xml_get_error_code( $p );
32:   $str = xml_error_string( $code );
33:   $line = xml_get_current_line_number ( $p );
34:   return "XML ERROR ($code): $str at line $line";
35: }
36: ?>

We create a parser on line 2 and establish our handlers (lines 5 and 6). We also declare the handler functions themselves, start_handler() on line 13, end_handler() on line 21, and char_data() on line 26. Listing 22.2 simply dumps all the data it encounters to the browser. This illustrates the parser code in action, but it is not very useful. In the next section, we will discuss a small script that outputs something more sensible.

An Example

We are running a banana-related news site. Our partner provides us with a news feed, consisting of an XML document. We would like to extract only the headlines and article authors for display on our site.

We already have all the tools we need to achieve this. The only new feature we will be introducing is a technique. You can see the code in Listing 22.3.

Listing 22.3 An Example: Parsing an XML Document

 1: <?php
 2: $open_stack = array();
 3: $parser = xml_parser_create();
 4: xml_set_element_handler( $parser, "start_handler", "end_handler" );
 5: xml_set_character_data_handler( $parser, "character_handler");
 6: xml_parser_set_option( $parser, XML_OPTION_CASE_FOLDING, 0 );
 7: xml_parser( $parser, file_get_contents( "listing22.1.xml" ))
 8:   or die( format_error( $parser ) );
 9: xml_parser_free( $parser );
10:
11: function start_handler( $p, $name, $atts ) {
12:   global $open_stack;
13:   $open_stack[] = array($name, "");
14: }
15:
16: function character_handler( $p, $txt ) {
17:   global $open_stack;
18:   $cur_index = count($open_stack)-1;
19:   $open_stack[$cur_index][1] =
20:     $open_stack[$cur_index][1].$txt;
21: }
22:
23: function end_handler( $p, $name ) {
24:     global $open_stack;
25:     $el = array_pop( $open_stack );
26:   if ( $name == "headline") {
27:     print "<p><b>$el[1]</b><br />\n";
28:   }
29:   if ( $name == "byline") {
30:     print "<i>$el[1]</i></p>\n\n";
31:   }
32: }
33:
34: function format_error( $p ) {
35:   $code = xml_get_error_code( $p );
36:   $str = xml_error_string( $code );
37:   $line = xml_get_current_line_number ( $p );
38:   return "XML ERROR ($code): $str at line $line";
39: }
40:
41: ?>

We begin by establishing a global array variable, $open_stack, on line 2. We will be treating this as a way of determining the current enclosing element at any time. The parser is initialized and the handlers are set, as you have already seen (lines 3–6). When an element is encountered, start_handler() (declared on line 11) is called. We create a two-element array consisting of the element name and an empty string and add it to the end of the $open_stack() array on line 13. As character data is encountered, the character_handler() function is called. We can access the most recently opened XML element by looking at the last array element in $open_stack. We add the character data to the second element of the array representing the currently open XML element (line 19). When the end of an element is encountered, the end_handler() function (declared on line 23) is called. We first remove the last element of the $open_stack array on line 25. The array returned to us should contain two elements—first, the name of the XML element that has just been closed and, second, any character data that was contained by that element. If the element in question is one we want to print, we can do so, adding any formatting we want.

You can see the output from Listing 22.3 (using a more substantial XML document) in Figure 22.2.

Figure 22.2. XML input parsed and formatted for output.

graphics/22fig02.gif

[ Team LiB ]