Using the Tidy Parser

Beyond tidy's capabilities to validate and repair HTML, tidy itself also provides a powerful parsing API. To understand how tidy can be used to parse documents, you first must understand a little about how the document is stored internally.

How Documents Are Stored in Tidy

When tidy parses a document, it creates a tree structure in memory to store the contents. This structure (called the document tree), reflects the hierarchical nature of the document and consists of a collection of nodes. For example, consider the following HTML document:

<HTML>
    <HEAD>
        <TITLE>Example basic HTML document</TITLE>
    </HEAD>
    <BODY BGCOLOR=#FFFFFF>
        <B>Hello, World!</B>
     <I>This is italic text.</I>
   </BODY>
</HTML>

Note the way the tags that compose the document have a hierarchy to them. That is to say, the <HEAD> tag is "inside" (or a child of) the <HTML> tag. Likewise, the <TITLE> tag is a child of the <HEAD> tag and a grandchild of the <HTML> tag.

When you examine a graphical representation of the document tree created by tidy, the relationship between the document and the tree it generates becomes obvious. To begin the tree, there is a root node. This node is created by tidy and does not correlate to the document itself; rather, it serves as a "handle" to the entire tree. From this root node a single child node HTML exists, representing the <HTML> tag. From there, the <HTML> node has two children, <HEAD> and <BODY>, and so on.

When a document is actually parsed using tidy, through an object-based parsing API, PHP is capable of "screen scraping" (removing selected portions of another HTML document) without the use of such things as regular expressions.

The Tidy Node

To begin using the parsing API in tidy, a starting point in the document tree must be selected. Specifically, you can begin from one of the following nodes: root, html, head, and body by calling the appropriate method from a valid tidy document resource. An example of this process is shown in Listing 15.7.

Listing 15.7. Retrieving an Entrance Node in Tidy

<?php
    $tidy = tidy_parse_file("http://www.php.net/");
    /* Retrieve the root node of the document tree */
    $root = $tidy->root();
?>

When the preceding call to the $tidy->root() method is executed, PHP retrieves the appropriate node and returns an instance of the tidyNode class representing that node. The tidyNode class is an internal class created by tidy that provides all the API used to navigate the document tree. For the sake of clarity, the following is a pseudo-class definition of the tidyNode class:

class tidyNode {

      public $value;
      public $name;
      public $type;
      public $id;

      public $attribute[];
      public $child[];

      public function hasChildren();
      public function hasSiblings();

      public function isComment();
      public function isHtml();
      public function isText();
      public function isJste();
      public function isAsp();
      public function isPhp();
}

NOTE

The preceding class definition is not a valid PHP class, nor is it even correct syntax for PHP! The tidyNode class definition provided is merely pseudo code to provide a listing of available properties and methods of the tidyNode class. For reference, those properties that have two brackets at the end of their names

public $attributes[];

represent properties that are arrays (either associative or indexed).

As you can see, the tidyNodeclass itself is fairly simple. Beginning with the properties available, Table 15.1 is a description of each individual aspect:

Table 15.1. Properties of the tidyNode Class
$value
The text value of this node, including the values of all this node's children.
$name
The name of the node. Should be the same name as the markup tag (that is, HEAD, HTML, BODY, and so on).
$type
One of the following constants indicating the type of the node:

TIDY_NODETYPE_ROOT
The special root node of the tree
TIDY_NODETYPE_DOCTYPE
The node representing the DOCTYPE tag
TIDY_NODETYPE_COMMENT
A comment within the document
TIDY_NODETYPE_PROCINS
XML processing instructions
TIDY_NODETYPE_TEXT
A text element node
TIDY_NODETYPE_START
The start of a block-level tag
TIDY_NODETYPE_END
The end of a block-level tag
TIDY_NODETYPE_STARTEND
An in-line tag
TIDY_NODETYPE_CDATA
A <![CDATA[ ]> block
TIDY_NODETYPE_SECTION
A section block
TIDY_NODETYPE_ASP
An ASP code block
TIDY_NODETYPE_JSTE
A JSTE code block
TIDY_NODETYPE_PHP
A PHP code block
TIDY_NODETYPE_XMLDECL
An XML Declaration
$id
The ID of the node (available only on nodes of type TIDY_NODETYPE_START, TIDY_NODETYPE_END, and TIDY_NODETYPE_STARTEND).
$attribute[]
An associative array of attributes for the given node in attribute name/value pairs.
$child[]
An indexed array of all the children of this node.

Although it may not be immediately obvious, these properties represent an incredible amount of power to parse and extract the contents of any HTML, XHTML, or even an XML document. When dealing with HTML or XHTML documents, the $id property is useful when searching for specific (or a set of) HTML tags. For those nodes that are actual HTML tags (of type TIDY_NODETYPE_START, TIDY_NODETYPE_END, or TIDY_NODETYPE_STARTEND) the value of the $id property will be a constant identified by

TIDY_TAG_<TAGNAME>

<TAGNAME> is the string name of the HTML tag in question. For instance, TIDY_TAG_IMG represents an <IMG> tag, and TIDY_TAG_BODY represents the <BODY> tag. This method of searching for a particular node type will be used later in the chapter to assist in URL extraction.

In the preceding pseudo-class definition, note the $value property. This property, as its name implies, is the value of the node. However, it is important to note that the value of a node is not only the contents of the node itself, but of its children as well. Thus, it is ideal for extracting entire HTML tables from within a document without concerning yourself with a single regular expression. Simply use the $value attribute of a node whose $id property is TIDY_TAG_TABLE.

NOTE

Tidy node objects are overloaded internally and will evaluate to the $value property when treated as a string. This means that tidy nodes can be treated like a string whenever such behavior is desired. This is shown in the code that follows; in both cases, the output will be identical:

echo $mynode->value; echo $mynode;