Using the Tidy ParserBeyond tidy's capabilities to validate and repair HTML, tidy itself also provides a powerful parsing API. To understand how tidy can be used to parse documents, you first must understand a little about how the document is stored internally. How Documents Are Stored in TidyWhen tidy parses a document, it creates a tree structure in memory to store the contents. This structure (called the document tree), reflects the hierarchical nature of the document and consists of a collection of nodes. For example, consider the following HTML document: <HTML> <HEAD> <TITLE>Example basic HTML document</TITLE> </HEAD> <BODY BGCOLOR=#FFFFFF> <B>Hello, World!</B> <I>This is italic text.</I> </BODY> </HTML> Note the way the tags that compose the document have a hierarchy to them. That is to say, the <HEAD> tag is "inside" (or a child of) the <HTML> tag. Likewise, the <TITLE> tag is a child of the <HEAD> tag and a grandchild of the <HTML> tag. When you examine a graphical representation of the document tree created by tidy, the relationship between the document and the tree it generates becomes obvious. To begin the tree, there is a root node. This node is created by tidy and does not correlate to the document itself; rather, it serves as a "handle" to the entire tree. From this root node a single child node HTML exists, representing the <HTML> tag. From there, the <HTML> node has two children, <HEAD> and <BODY>, and so on. When a document is actually parsed using tidy, through an object-based parsing API, PHP is capable of "screen scraping" (removing selected portions of another HTML document) without the use of such things as regular expressions. The Tidy NodeTo begin using the parsing API in tidy, a starting point in the document tree must be selected. Specifically, you can begin from one of the following nodes: root, html, head, and body by calling the appropriate method from a valid tidy document resource. An example of this process is shown in Listing 15.7. Listing 15.7. Retrieving an Entrance Node in Tidy<?php $tidy = tidy_parse_file("http://www.php.net/"); /* Retrieve the root node of the document tree */ $root = $tidy->root(); ?> When the preceding call to the $tidy->root() method is executed, PHP retrieves the appropriate node and returns an instance of the tidyNode class representing that node. The tidyNode class is an internal class created by tidy that provides all the API used to navigate the document tree. For the sake of clarity, the following is a pseudo-class definition of the tidyNode class: class tidyNode { public $value; public $name; public $type; public $id; public $attribute[]; public $child[]; public function hasChildren(); public function hasSiblings(); public function isComment(); public function isHtml(); public function isText(); public function isJste(); public function isAsp(); public function isPhp(); }
As you can see, the tidyNodeclass itself is fairly simple. Beginning with the properties available, Table 15.1 is a description of each individual aspect: Although it may not be immediately obvious, these properties represent an incredible amount of power to parse and extract the contents of any HTML, XHTML, or even an XML document. When dealing with HTML or XHTML documents, the $id property is useful when searching for specific (or a set of) HTML tags. For those nodes that are actual HTML tags (of type TIDY_NODETYPE_START, TIDY_NODETYPE_END, or TIDY_NODETYPE_STARTEND) the value of the $id property will be a constant identified by TIDY_TAG_<TAGNAME> <TAGNAME> is the string name of the HTML tag in question. For instance, TIDY_TAG_IMG represents an <IMG> tag, and TIDY_TAG_BODY represents the <BODY> tag. This method of searching for a particular node type will be used later in the chapter to assist in URL extraction. In the preceding pseudo-class definition, note the $value property. This property, as its name implies, is the value of the node. However, it is important to note that the value of a node is not only the contents of the node itself, but of its children as well. Thus, it is ideal for extracting entire HTML tables from within a document without concerning yourself with a single regular expression. Simply use the $value attribute of a node whose $id property is TIDY_TAG_TABLE. ![]() |