An Introduction to the DOM XML Functions

The XML Parser functions are event based—that is, the document is read from top to bottom and handlers are triggered as and when the relevant features are encountered. The document object model (DOM) approach is tree-based. The entire XML document is read and rendered as a tree of objects. This means you can traverse the tree at your leisure, manipulating its nodes if you want. You can also construct your own document trees that can then be output to XML text.

PHP support for DOM was still undergoing some development at the time of writing, and the PHP manual (http://www.php.net/manual/en/ref.domxml.php) was temporarily out of date. However, the syntax is being brought into line with the official W3C specification for DOM at http://www.w3.org/TR/2003/WD-DOM-Level-3-Core-20030226/, so if in doubt, you can always go directly to the rule book! If you have worked with the DOM functions and objects in previous versions of PHP, the main change you will notice is that most method or property names have been altered now to use camel case—that is, firstChild rather than first_child, for example.

The DOM functions rely on the libxml2 library, which is bundled with the PHP 5 distribution. You shouldn't have to specify any configuration settings to gain access to DOM.

The first thing you need if you are going to work with the DOM functions is a DomDocument object. The DomDocument object is a container for all elements, which are themselves represented by objects.

Acquiring a `DomDocument` Object

You can create a DomDocument object directly using the new keyword. The constructor accepts a string containing the XML version number with which you will be working. This is always 1.0, so you can omit the argument altogether and the parser will provide the default for you:

$doc = new DomDocument();

Before we construct our own tree of XML elements, let's look briefly at the mechanism for loading XML from a file:

$doc = new DomDocument();
$doc->loadXML( file_get_contents("listing22.1.xml"));
print $doc->saveXML();

In the previous fragment, we create a DomDocument() method and call a method named loadXML(), which accepts an XML string and builds a model of it in memory. We then output raw XML again by calling saveXML(), which builds an output string from the XML tree.

In the following examples, we will not load XML from a document. We will instead use the DOM methods to construct our own tree of XML element.

The Root Element

Just as the DOM model provides an analog for an XML document, it also provides an object to represent an element. The DomElement and DomDocument objects derive from a common parent class (DomNode) and are therefore similar in structure.

To create a root element in our document, we must first acquire a DomElement object and then add it. The DomDocument object acts as a factory, generating DomElement objects on request:

$rootel = $doc->createElement("banana-news");

We use the createElement() method, which accepts a string and generates an element named accordingly. Acquiring an element is not enough on its own. We must add any element we acquire with DomDocument::createElement() to a parent node in the same tree. In this case, we want to add $rootel to the document node itself:

$doc->appendChild( $rootel );

The DomNode::appendChild() method accepts an element and adds it to the end of the node's children. So, we have created a root element and added it as a child of the document object. Remember that the DomDocument class extends DomNode, which is why we can call appendChild() on our $doc object. Let's bring that all together:

$doc = new DomDocument();
$rootel = $doc->createElement("banana-news");
$doc->appendChild( $rootel );
print $doc->saveXML();

We do nothing new in the previous fragment. We create a DomDocument object and then use it to generate a DomElement object, which we add as the root element of the document using the appendChild() method. Finally, we write the minimal XML document to the browser. The output should look like this:

<?xml version="1.0"?>
<banana-news/>

Adding New `DomElement` Objects to the Tree

Now that we have a root element, we can repeat the mechanism we covered previously to add new elements to the XML document:

$headline = $doc->createElement("headline");
$rootel->appendChild( $headline );

We create a new <headline> element using DomDocument::createElement(). We then add the element to the root (<banana-news>) element. So far, we have worked with only two kinds of nodes: elements and documents. We need to add some text to the <headline> element. To do this we must create a text node:

$text = $doc->createTextNode("Banana related disasters");
$headline->appendChild( $text );

The DomDocument::createTextNode() method automates this process. We are given a DomText object, which extends DomNode, so we can add it to an element in the normal way.

We now have enough information to use the DOM functions to create the XML document in Listing 22.1. We use data from an associative array (declared on line 2), but it could just as easily have been pulled from a database. You can see the code in Listing 22.4.

Listing 22.4 Constructing an XML Document with the DOM Functions

 1: <?php
 2:
 3: $news = array(
 4:   array( "headline" => "arf arf, mcGraph",
 5:     "image" => "/res/high.gif",
 6:     "byline" => "William Curvey",
 7:     "article" => "Research published today by...",
 8:     "type" => "world"
 9:     ),
10:
11:   array( "headline" => "Banana sales",
12:     "image" => "/res/high.gif",
13:     "byline" => "William Curvey",
14:     "article" => "Research published today by...",
15:     "type" => "world"
16:     ),
17:   array( "headline" => "Domestic banana use beggars belief",
18:     "image" => "/res/use.gif",
19:     "byline" => "Charles Split",
20:     "article" => "Bananas are for more than eating...",
21:     "type" => "world"
22:     )
23: );
24:
25:
26: $doc = new DomDocument("1.0");
27: $root = $doc->appendChild( $doc->createElement("banana-news") );
28: foreach( $news as $newselement ) {
29:   $item = $root->appendChild( $doc->createElement( "newsitem") );
30:   $item->setAttribute( "type", $newselement['type'] );
31:   foreach( array("headline", "image", "byline") as $tagname ) {
32:
33:     // PHP 5 let's us do this in one unreadable line:
34:     // $item->appendChild( $doc->createElement( $tagname ) )
35:     // ->appendChild(
36:     //   $doc->createTextNode( $newselement[$tagname] ) );
37:     // But we will use temporary variables:
38:
39:     $el = $doc->createElement( $tagname );
40:     $item->appendChild( $el );
41:     $text = $doc->createTextNode( $newselement[$tagname] );
42:     $el->appendChild( $text );
43:
44:   }
45: }
46:
47: print $doc->saveXML( );

There is very little that is new in Listing 22.4. On line 3 we set up an associative array to hold our news data. On line 26 we instantiate a DomDocument() before adding a <banana-news> root element on line 27. We then loop through our news array, using the createElement(), createTextNode(), and appendChild() elements to build up our tree.

We do introduce a new method on line 30—the setAttribute() function is defined in the DomElement class. It requires name and value arguments and adds an attribute node to the element. An attribute modifies an element in some way, consisting of a name/value pair included in the element tag. In this case we are adding the type="world" attribute to <headline> elements.

Getting Information from `DomElement` Objects

Usually, the first thing you will want to know about a DomElement is its name. This is stored in the $tagName property:

print "I am a ".$el->tagName." element";

After you know the name of an element, you will want to know whether it has any attributes, which are stored in DomAttr objects. You can acquire an array of DomAttr objects associated with an element by accessing the DomNode::attributes property. This property is an associative array, the keys of which are attribute names, and the values of which are DomAttr objects:

$type_attr = $el->attributes['type'];

To access the name and value of each DomAttr object, you can use the conveniently named $name and $value properties:

$atts = $el->attributes;

if ( ! empty( $atts ) ) {
  foreach( $atts as $name=>$att_ob ) {
    print $att_ob->name.": ".$att_ob->value."<br />\n";
  }
}

To navigate an XML tree, you must take advantage of the methods that DOM objects provide about their place in the structure.

Given a DomElement object, you can discover whether it has child elements with the hasChildNodes() method. This method returns a boolean:

if ( $el->hasChildNodes() ) {
  print "I am blessed with progeny";
}

If the element has children, you can access the first child with the $firstChild property. If the element does not have children, $firstChild contains null, as shown here:

if ( $el->hasChildNodes() ) {
    $child = $el->firstChild;
}

You can traverse the tree vertically, but what about horizontally? Elements know about their siblings as well. You can access an element's next sibling with the $nextSibling property and its previous sibling with the $previousSibling property. Both of these properties contains null if there is no sibling to be found:

$sib = $el->firstChild;
do {
  print $sib->tagName."<br />";
} while( $sib = $sib->nextSibling );

A parent, of course, can access all its children. The $childNodes property contains an array of DomNode objects, but if the element is childless, it contains null:

$kids = $el->childNodes;
foreach( $kids as $child ) {
  print $child->tagName."<br />";
}

Children also know about their parents. The $parentNode property contains an element's parent element.

Examining Text Nodes

Armed with the methods we have covered, we can now swing about an XML tree pretty well. But we haven't gotten down to the most important features of the tree. An element is not the only kind of node we want to deal with. Among its children are text nodes, comment nodes, and others beyond the scope of this book.

Our main concern is text nodes, which we use to acquire document content. The first thing we need to be able to do is to distinguish between DomElement objects and DomText objects. The DomElement and DomText classes share a common parent class: DomNode. All DomNode objects have a $nodeType property that contains an identifying integer. These integers can be tested using built-in constants. For DomElement and DomText objects, we use XML_ELEMENT_NODE and XML_TEXT_NODE, respectively:

if ( $child->nodeType == XML_ELEMENT_NODE ) {
  // work with the element
} elsif ( $child->nodeType == XML_TEXT_NODE ) {
  // work with the text node
}

After we have located a text node, we still need to access its contents. We can do this with the $nodeValue method:

if ( $child->nodeType == XML_TEXT_NODE ) {
  print $child->nodeValue;
}

Traversing a Tree: Two Approaches

We now have enough information to work our way through a tree, but how do we go about it? In this section, we examine two approaches to this task.

The first approach is designed to do the work of acquiring each node in turn and return it to the calling code. Listing 22.5 demonstrates.

Listing 22.5 Traversing a Tree of XML Nodes Using On-Demand Functions

 1: <?php
 2:
 3: $doc = new DomDocument("1.0");
 4: $doc->loadXML( file_get_contents("listing22.1.xml") );
 5: $root = $doc->firstChild;
 6: $pointer = $root;
 7:
 8: do {
 9:   print $pointer->tagName."<br />\n";
10: } while ( $pointer = next_element( $pointer ) );
11:
12: function next_element( DomNode $pointer ) {
13:   while ( $pointer = next_node( $pointer ) ) {
14:     if ( $pointer->nodeType == XML_ELEMENT_NODE ) {
15:       return $pointer;
16:     }
17:   }
18:   return false;
19: }
20:
21: function next_node( DomNode $pointer ) {
22:   if ( $pointer->hasChildNodes() ) {
23:     return $pointer->firstChild ;
24:   }
25:   if ( $next = $pointer->nextSibling ) {
26:     return $next;
27:   }
28:   while( $pointer = $pointer->parentNode ) {
29:     if ( $next=$pointer->nextSibling ) {
30:       return $next;
31:     }
32:   }
33: }
34: ?>

As you can see, the real work is done by the next_node() function on line 21. This accepts a node object and tests it to see whether it has any children. If so, it returns the first one on line 23. If the node has no children, we then look for a sibling, returning it on line 26 if it is found. If the node has no children or siblings, we then climb back up the tree in a while loop starting on line 28, looking for siblings as we go. As soon as we find a sibling object on our climb, we return it on line 26. By repeatedly calling next_node(), we will eventually traverse the entire tree.

The next approach traverses the tree in the same way. It differs from the previous example in that the calling code does not repeatedly request the next node. Instead, the traversing function calls itself recursively until the tree has been completely explored. You can see this in action in Listing 22.6.

Listing 22.6 Traversing a Tree of XML Nodes Using Recursion

 1: <?php
 2:
 3: $doc = new DomDocument("1.0");
 4: $doc->loadXML( file_get_contents("listing22.1.xml") );
 5: $root = $doc->firstChild;
 6: traverse( $root );
 7:
 8: function traverse( DomNode $node, $level=0 ){
 9:   handle_node( $node, $level );
10:  if ( $node->hasChildNodes() ) {
11:    $children = $node->childNodes;
12:    foreach( $children as $kid ) {
13:      if ( $kid->nodeType == XML_ELEMENT_NODE ) {
14:        traverse( $kid, $level+1 );
15:      }
16:    }
17:  }
18: }
19:
20: function handle_node( DomNode $node, $level ) {
21:   for ( $x=0; $x<$level; $x++ ) {
22:     print " ";
23:   }
24:   if ( $node->nodeType == XML_ELEMENT_NODE ) {
25:     print $node->tagName."<br />\n";
26:   }
27: }
28: ?>

The traverse() function on line 6 does all the work. Passed a node object, it looks for children. If children are present, it then works through them using a foreach loop on line 12, calling itself recursively with each child node in turn. Every time traverse() is called, it calls handle_node() (declared on line 20) where application-specific code can work with the node.

[ Team LiB ]