Applications of TidyThroughout this chapter, I have provided (to the best of my ability) a number of examples of how tidy can be used within your applications. These examples should serve as a guide for your own uses for tidy. Many of the following examples are an excellent reference to some of the more useful tidy configuration options as well. Tidy as an Output BufferAlthough not appropriate for every circumstance, tidy also provides an output buffer function that can be used with PHP's output buffering capabilities. This feature can be enabled by default by setting the tidy.clean_output php.ini configuration directive or by registering the ob_tidyhandler() function when calling ob_start(). In either case, tidy will parse, clean, and repair all output passed through the PHP output buffer using either the default configuration or that specified by the tidy.default_config directive. Converting Documents to CSSIn the modern HTML specification, the <FONT> tag is considered a deprecated method of specifying the font, size, and color of text within the document. Instead, cascading style sheets (CSS) should be used to specify these layout details. Tidy supports the capability to strip a document of these deprecated tags and replace them with an embedded style sheet. To take advantage of this feature, set the clean tidy configuration option to true. Consider the following HTML input and output generated by the code in Listing 15.8: <!-- clean.html //--> <HTML> <HEAD><TITLE><TITLE></HEAD> <BODY> <FONT COLOR="red">Hello, World!</FONT><BR/> <B><FONT SIZE=4 COLOR=#c0c0c0>More Text...</FONT></B> </BODY> </HTML> Listing 15.8. Replacing <FONT> Tags with CSS<?php $tidy = tidy_parse_file("clean.html", array("clean" => true)); tidy_clean_repair($tidy); echo $tidy; ?> When executed, the output generated by tidy is as follows: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN"> <html> <head> <title></title> <style type="text/css"> b.c2 {color: #C0C0C0; font-size: 120%} span.c1 {color: red} </style> </head> <body> <span class="c1">Hello, World!</span><br> <b class="c2">More Text...</b> </body> </html>
Reducing Bandwidth UsageBecause tidy parses the entire document, when output is generated it is also incredibly useful at reducing the overall size of your HTML documents. This reduction can be maximized (at the cost of readability) by setting a number of configuration options as shown in Listing 15.9: Listing 15.9. Reducing Bandwidth Usage Using Tidy<?php $options = array("clean" => true, "drop-proprietary-attributes" => true, "drop-font-tags" => true, "drop-empty-paras" => true, "hide-comments" => true, "join-classes" => true, "join-styles" => true); $tidy = tidy_parse_file("http://www.php.net/", $options); tidy_clean_repair($tidy); echo $tidy; ?> Although the reduction of the file depends largely on the file itself, for sites that handle a great deal of traffic, even a single kilobyte reduction can mean megabytes a day in saved bandwidth. Furthermore, with this capability, source HTML can be stored in a human-readable format (complete with bandwidth-wasting comments) without the waste. Beautifying DocumentsIn the previous section, I showed you how to use tidy to reduce the overall size of your documents. Tidy can also be used to take a document that is difficult to maintain or read and beautify it. An example is shown in Listing 15.10: Listing 15.10. Beautifying HTML Using Tidy<?php $options = array("indent" => true, /* Turn on beautification */ "indent-spaces" => 4, /* Spaces per indenting level */ "wrap" => 4096); /* Line length before wrapping */ $tidy = tidy_parse_file("http://www.php.net/", $options); tidy_clean_repair($tidy); echo $tidy; ?>
Extracting URLs from a DocumentThe first application of tidy I'll discuss is a script to extract all the URLs from a given document (without a single regular expression). This script (really a functiondump_urls()) is found in Listing 15.11. Listing 15.11. Extracting URLs Using Tidy<?php function dump_urls(tidy_node $node, &$urls = NULL) { $urls = (is_array($urls)) ? $urls : array(); if(isset($node->id)) { if($node->id == TIDY_TAG_A) { $urls[] = $node->attribute['href']; } } if($node->hasChildren()) { foreach($node->child as $child) { dump_urls($child, $urls); } } return $urls; } $tidy = tidy_parse_file("http://www.php.net/"); $urls = dump_urls($tidy->body()); print_r($urls); ?> Although a relatively small script, its size reflects the power to parse HTML that the tidy extension provides. Beginning at the top of the function, you can see that the dump_urls() function accepts two parametersthe node from where the URL extraction will begin (meaning it and all of its children), $node, and an optional parameter, $urls. This second parameter is not intended to be passed directly, but is instead used internally by the dump_urls() function. After the function has been called, the dump_urls() function begins by initializing the $urls parameter to ensure that it is an array and starts examining the passed $node object by checking the $id property to see if it is an anchor tag (TIDY_TAG_A). If the node is indeed an anchor, we then store the HREF attribute of that tag (the URL) in the $urls array. At this point in the function, the current node's URL (if it existed) has been extracted and the function now moves on to examine the child nodes (if any exist). This is the process that takes place in the foreach loop, where each child node is passed to the dump_urls() function again recursively (along with the $urls array containing already found URLs) until all the children have been examined. When the dump_urls() function finally returns to the initial caller, it returns the $urls array filled with all the URLS found. This script can be easily modified to look for other types of data within HTML documents as well. For instance, by searching for the TIDY_TAG_IMG type instead of TIDY_TAG_A (and looking for the src attribute instead of HRef), this function will extract links to all the images within the document. ![]() |