Basic Tidy Usage

Parsing Input and Retrieving Output

To begin using tidy in PHP, let's take a look at the most basic usage of the extension. When you use tidy, all actions begin with the parsing of an input document. The primary function to accomplish this task is the tidy_parse_file() function whose syntax is as follows:

tidy_parse_file($document [, $options [, $encoding
                [, $use_include_path]]);

$document is the file to process (local or remote). The remaining optional parameters are discussed later in the chapter, except for the $use_include_path parameter. This optional parameter is a Boolean indicating whether PHP should search the PHP include path if the document specified by $document was not initially found.

When this function is executed against a document (such as an HTML file) a number of things occur. First, the document is read into memory by tidy and the contents are parsed. During this process, tidy identifies the type of document being parsed (HTML 3.2, HTML 4.0, XHTML 1.0, for example) and corrects invalid syntax for the standard. That is, tag attributes without quotes are quoted, tags in the wrong order are corrected, and so on. To accomplish this, tidy uses a complex intelligent parsing system that attempts to correct any errors without changing the way the document will be interpreted. When parsing of the document is complete, tidy_parse_file() returns a resource representing the document in memory, which can be used with the other tidy functions.

NOTE

By default, tidy treats all input documents as if they are complete, standalone documents. This means that document fragments (such as the string "<B>this is a HTML fragment</B>"), after parsing by tidy, will include all the necessary tags to make a valid HTML document, including the <HTML>, <HEAD>, <TITLE>, and <BODY> tags. If you would like to retrieve only a corrected version of the input fragment, see the "Tidy Configuration Options" section in this chapter (specifically the show-body-only option).

Along with the tidy_parse_file() function, a similar function, tidy_parse_string(), is also available with the following syntax:

tidy_parse_string($data [, $options [, $encoding]]);

$data is a string containing the markup to parse. As was the case earlier, the optional parameters $options and $encoding are discussed later in the chapter. When executed, the tidy_parse_string() function returns a tidy resource representing the document.

After the document has been parsed, it can immediately be retrieved as a string using one of two methods. The first method is to use the tidy_get_output() function with the following syntax:

tidy_get_output($tidy);

$tidy is a valid tidy document resource. Alternatively, the $tidy resource itself is designed to be treated as a string, allowing you to echo its contents directly. An example of this behavior is shown in Listing 15.1:

Listing 15.1. Retrieving a Document from Tidy

<?php
     /* Parse a string */
     $tidy = tidy_parse_string("<B>This is a string</B>");
     /* Get the document as modified by tidy using tidy_get_output() */
     $data = tidy_get_output($tidy);

     /*
     The $tidy resource can be passed directly to echo
     to output the contents

     The following will output the contents of the modified document
     */
     echo $tidy;
?>

Listing 15.2 demonstrates using tidy to parse a remote HTML document using the tidy_parse_file() function.

Listing 15.2. Using the `tidy_parse_file()` Function

<?php
    $remote_tidy = tidy_parse_file("http://www.coggeshall.org/");
    echo $remote_tidy;
?>

Cleaning and Repairing Documents

After a document has been parsed, you can be sure that the document is valid from a standpoint of syntax; however, it does not mean the document has been brought up to the Web standards. For instance, a valid HTML 3.2 document requires a DOCTYPE declaration (among other things) to be standards compliant. To complete the transition between the input and standards-compliant output, we must introduce another functionthe tidy_clean_repair() function.

tidy_clean_repair($tidy);

$tidy is a valid tidy resource. When executed, tidy attempts to bring the provided document up to Web standards based on the current tidy configuration. An example of its use (with the same input as Listing 15.1) is shown in Listing 15.3.

Listing 15.3. Using `tidy_clean_repair()`

<?php
    /* Parse a local file */
    $tidy = tidy_parse_string("<B>This is a simple,</I> but
                               <B>malformed</B> <U>HTML Document<U>");
    tidy_clean_repair($tidy);
    echo $tidy;
?>

When the code in Listing 15.3 is executed, the result is a standards-compliant HTML 3.2 document as follows:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title></title>
</head>
<body>
<b>This is a simple,</b> but <b>malformed</b> <u>HTML Document</u>
</body>
</html>

Identifying Problems Within Documents

When a document is processed by tidy, the extension creates a log of potential (and corrected) problems in the input. This log (called the error buffer) can be retrieved at any point by executing the tidy_get_error_buffer() function.

tidy_get_error_buffer($tidy);

$tidy is a valid tidy document resource. When executed, this function returns a string containing a log of all the warnings and errors that occurred during the processing of the document, separated by a newline \n character. An example of a tidy log follows:

line 1 column 1 - Warning: missing <!DOCTYPE> declaration
line 1 column 1 - Warning: replacing unexpected i by </i>
line 1 column 43 - Warning: <u> is probably intended as </u>
line 1 column 1 - Warning: inserting missing 'title' element

As you can see, tidy has identified four potential problems with the given input. Each problem is identified by line number and column (relative to the original document).

Along with syntax and standards-related errors, the tidy extension also informs you of potential accessibility issues (such as the omission of an ALT attribute in an <IMG> tag).

To further add to tidy's capabilities regarding the error log, three functions assist you in determining the types of errors that have occurred without processing the error buffer directly. These three functions are as follows:

tidy_error_count($tidy);
tidy_warning_count($tidy);
tidy_access_count($tidy);

In all three cases, $tidy is a valid tidy document resource. When executed, these functions return an integer representing the number of errors of the specified type encountered for this document. For example, the number of accessibility warnings that exist can be determined by calling the tidy_access_count() function.