Applications of Tidy

Throughout this chapter, I have provided (to the best of my ability) a number of examples of how tidy can be used within your applications. These examples should serve as a guide for your own uses for tidy. Many of the following examples are an excellent reference to some of the more useful tidy configuration options as well.

Tidy as an Output Buffer

Although not appropriate for every circumstance, tidy also provides an output buffer function that can be used with PHP's output buffering capabilities. This feature can be enabled by default by setting the tidy.clean_output php.ini configuration directive or by registering the ob_tidyhandler() function when calling ob_start(). In either case, tidy will parse, clean, and repair all output passed through the PHP output buffer using either the default configuration or that specified by the tidy.default_config directive.

CAUTION

Do not turn on the tidy.clean_output configuration directive on websites that output nonmarkup content dynamically! tidy does not check the type of content being processed by PHP; thus, it will attempt to parse and repair anything generated and buffered by PHP, such as dynamically generated images, PDF documents, and so on. Use this option only when you are sure that all the output being generated is HTML or similar.

Converting Documents to CSS

In the modern HTML specification, the <FONT> tag is considered a deprecated method of specifying the font, size, and color of text within the document. Instead, cascading style sheets (CSS) should be used to specify these layout details. Tidy supports the capability to strip a document of these deprecated tags and replace them with an embedded style sheet. To take advantage of this feature, set the clean tidy configuration option to true. Consider the following HTML input and output generated by the code in Listing 15.8:

<!-- clean.html //-->
<HTML>
    <HEAD><TITLE><TITLE></HEAD>
<BODY>
        <FONT COLOR="red">Hello, World!</FONT><BR/>
        <B><FONT SIZE=4 COLOR=#c0c0c0>More Text...</FONT></B>
</BODY>
</HTML>

Listing 15.8. Replacing `<FONT>` Tags with CSS

<?php
    $tidy = tidy_parse_file("clean.html", array("clean" => true));
    tidy_clean_repair($tidy);
    echo $tidy;
?>

When executed, the output generated by tidy is as follows:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title></title>

<style type="text/css">
 b.c2 {color: #C0C0C0; font-size: 120%}
 span.c1 {color: red}
</style>

</head>
<body>
<span class="c1">Hello, World!</span><br>
<b class="c2">More Text...</b>
</body>
</html>

NOTE

Currently, this cleaning works only sporadically. Check the bug report http://bugs.php.net/bug.php?id=28841 for updates on this issue.

Reducing Bandwidth Usage

Because tidy parses the entire document, when output is generated it is also incredibly useful at reducing the overall size of your HTML documents. This reduction can be maximized (at the cost of readability) by setting a number of configuration options as shown in Listing 15.9:

Listing 15.9. Reducing Bandwidth Usage Using Tidy

<?php
    $options = array("clean" => true,
              "drop-proprietary-attributes" => true,
              "drop-font-tags" => true,
              "drop-empty-paras" => true,
              "hide-comments" => true,
              "join-classes" => true,
              "join-styles" => true);

    $tidy = tidy_parse_file("http://www.php.net/", $options);
    tidy_clean_repair($tidy);
    echo $tidy;
?>

Although the reduction of the file depends largely on the file itself, for sites that handle a great deal of traffic, even a single kilobyte reduction can mean megabytes a day in saved bandwidth. Furthermore, with this capability, source HTML can be stored in a human-readable format (complete with bandwidth-wasting comments) without the waste.

Beautifying Documents

In the previous section, I showed you how to use tidy to reduce the overall size of your documents. Tidy can also be used to take a document that is difficult to maintain or read and beautify it. An example is shown in Listing 15.10:

Listing 15.10. Beautifying HTML Using Tidy

<?php
      $options = array("indent" => true,       /* Turn on beautification */
                       "indent-spaces" => 4,   /* Spaces per indenting level */
                       "wrap" => 4096);        /* Line length before wrapping */
      $tidy = tidy_parse_file("http://www.php.net/", $options);
      tidy_clean_repair($tidy);
      echo $tidy;
?>

NOTE

When passing configuration options to tidy at runtime, Boolean values must be represented using the PHP Boolean types TRue or false. However, when representing Boolean values from a tidy configuration file, yes, no, true, or false may be used.

Extracting URLs from a Document

The first application of tidy I'll discuss is a script to extract all the URLs from a given document (without a single regular expression). This script (really a functiondump_urls()) is found in Listing 15.11.

Listing 15.11. Extracting URLs Using Tidy

<?php
      function dump_urls(tidy_node $node, &$urls = NULL) {
            $urls = (is_array($urls)) ? $urls : array();

            if(isset($node->id)) {
                  if($node->id == TIDY_TAG_A) {
                        $urls[] = $node->attribute['href'];
                  }
            }

            if($node->hasChildren()) {
                  foreach($node->child as $child) {
                        dump_urls($child, $urls);
                  }
            }
                       return $urls;
      }

      $tidy = tidy_parse_file("http://www.php.net/");
      $urls = dump_urls($tidy->body());
      print_r($urls);
?>

Although a relatively small script, its size reflects the power to parse HTML that the tidy extension provides. Beginning at the top of the function, you can see that the dump_urls() function accepts two parametersthe node from where the URL extraction will begin (meaning it and all of its children), $node, and an optional parameter, $urls. This second parameter is not intended to be passed directly, but is instead used internally by the dump_urls() function.

After the function has been called, the dump_urls() function begins by initializing the $urls parameter to ensure that it is an array and starts examining the passed $node object by checking the $id property to see if it is an anchor tag (TIDY_TAG_A). If the node is indeed an anchor, we then store the HREF attribute of that tag (the URL) in the $urls array.

At this point in the function, the current node's URL (if it existed) has been extracted and the function now moves on to examine the child nodes (if any exist). This is the process that takes place in the foreach loop, where each child node is passed to the dump_urls() function again recursively (along with the $urls array containing already found URLs) until all the children have been examined. When the dump_urls() function finally returns to the initial caller, it returns the $urls array filled with all the URLS found.

This script can be easily modified to look for other types of data within HTML documents as well. For instance, by searching for the TIDY_TAG_IMG type instead of TIDY_TAG_A (and looking for the src attribute instead of HRef), this function will extract links to all the images within the document.