Optimizing Your PHP Scripts

When writing Web-based applications, optimizations and efficiency of your code must always be a concern. After all, the goal of any website is to have as many people visit your site as possible. Although PHP isn't particularly slow as a scripting language, any substantial amount of traffic can bring your website to its knees without appropriate consideration to the efficiency of your scripts.

Like debugging, there are many different tools and techniques to use; each is useful in its own respect. Thus, the best any single chapter on this topic can do is educate you on the fundamental general-purpose techniques to making your scripts as lean and efficient as possible.

The Secret to Finding OptimizationsProfiling

When the subject of optimizations is discussed, regardless of language, the single most important thing that must be done is to determine where the bottlenecks are in your application. After all, if you do not know what exactly is slowing your applications down (called profiling), there is no way to fix the problem.

Although many professional-quality tools exist to assist you in profiling your applications, the techniques employed in optimization often don't require them. For all but the most heavily-used applications, profiling can be accomplished using nothing more than a simple PHP script. For your convenience, I have written such a script that is designed to be used as a template for your basic profiling needs (see Listing 10.6). This script will be used to profile all the optimization techniques I will be discussing in this chapter.

Listing 10.6. An Effective Basic PHP Profiler

<?php
    set_time_limit(0);
    class simple_profiler {

        private $start_time;

        private function get_time() {
            list($usec, $seconds) = explode(" ", microtime());
            return ((float)$usec + (float)$seconds);
        }

        function start_timer() {
            $this->start_time = $this->get_time();
        }

        function end_timer() {
            return ($this->get_time() - $this->start_time)
            ;
        }
    }
    $timer = new simple_profiler();

    /******************************************
     * Insert untimed initialization code here
     ******************************************/

    $timer->start_timer();
        /*********************************
         * Insert code for Method #1 here
         *********************************/
    $old_time = $timer->end_timer();

    $timer->start_timer();
        /*********************************
         * Insert code for Method #2 here
         *********************************/
    $new_time = $timer->end_timer();

    echo "Method one took $old_time seconds.\n";
    echo "Method two took $new_time seconds.\n\n";

    if($old_time > $new_time) {

        $percent = number_format(100 - (($new_time / $old_time) * 100), 2);
        echo "Method two was faster than Method one by $percent%<BR/>\n";

    } else {

        $percent = number_format(100 - (($old_time / $new_time) * 100), 2);
        echo "Method one was faster than Method two by $percent%<BR/>\n";

    }


?>

In Listing 10.6, I have defined a simple class suitable for most profiling needs, simple_profiler. Functionally, this profiler is nothing more than a fairly accurate clock that can be used to measure how long a particular segment of PHP code takes to execute. The remainder of this script serves as the template, which can be used to compare two different techniques used to accomplish the same task to determine the faster method.

In use, this template script has three segments to it that are of importance to us (identified by the comment placeholders). The first segment is the initialization segment, which provides a useful location by which to initialize portions of your script that you are not concerned with profiling. This segment is particularly useful to create dummy data (if profiling data processing), including files, and so on. The second and third segments serve as the placeholders for the actual code that is being profiled. There is no difference between the two segments, other than each should contain a different method of accomplishing the same task.

When this script is executed, it will record the amount of time taken to execute each of the two methods and then compare each to determine the faster method. To give us some sort of idea of exactly how much more efficient one method is to another, along with the execution times a percentage is also generated, showing the total improvement.

For the remainder of this section of the chapter I will not repeat the profiling code found in Listing 10.6.

Common PHP Bottlenecks and Solutions

In Web development, many different bottlenecks can exist for a given website. Following are a few of the more common bottlenecks encountered:

Processor (CPU)
Memory (RAM)
Bandwidth
Storage (hard disk)

Dealing with these bottlenecks to achieve the best performance from your Web applications is by no means an easy task. As will become clear later in the discussion, relieving one bottleneck often is done at the cost of increasing bottlenecks elsewhere. For instance, almost all optimizations that use less of your processor's resources do so at the cost of additional RAM or hard disk space.

It is because of this space-time complexity (to coin a term from computer science) that optimizations must be done on a case-by-case basis and with a firm understanding of the resource utilization of the application.

When dealing specifically with PHP, developers make a number of common mistakes that lead to inefficient programs or unnecessary resource bottlenecks. Sometimes these mistakes can be nothing more than a single line of code; other times, they can be slightly more complex. Compiled in this chapter are some of the more common optimization-related mistakes made by developers and possible solutions to them.

NOTE

Many factors contribute to how a particular script will perform. It is important to note that generally there is a standard deviation of +- 5% on any time measurement taken.

Regular Expressions

One of the most common optimization mistakes made by PHP developers is the overuse or misuse of regular expressions in their PHP scripts. Compared to other text-manipulation operations, using regular expressions represents the most costly operation that can be done. Thus, any use of regular expressions should be done with great care. To illustrate this, consider searching 10,000 random strings for any combination of three characters "a" through "g" (that is, agb, bbb, cab, and so on).

I will compare the two types of regular expression solutions provided by the ereg() and preg_match() functions. Let's start with the code required to solve the problem using ereg():

for($i = 0; $i < 10000; $i++) {
    if(ereg(".*[abcdefg]{3}.*", $strings[$i])) {
        $found++;
    }
}

Similarly, here is the solution to the problem using the preg_match() function instead:

for($i = 0; $i < 10000; $i++) {
    if(preg_match("/.*[abcdefg]{3}.*/", $strings[$i])) {
        $found++;
    }
}

When profiling these two methods against each other (the ereg() method is #1 and preg_match is #2), here is how they measured up:

Method one took 0.21848797798157 seconds.
Method two took 0.15077900886536 seconds.

Method two was faster than Method one by 30.99%

As you can see, method #2 (preg_match()) was approximately 30% faster than the comparable ereg() method. In general, you will find that preg_* functions are always faster in text processing than their ereg() counterparts.

Although sometimes regular expressions are the only reasonable method of parsing and processing text, many times nonregular expression solutions are considerably faster than either regular expression flavor. This is particularly true when attempting to find or replace string constants. To illustrate this, let's look at the profiles of preg_match() and strstr() to count the number of strings that contain the substring jjj.

For method one, we'll use a regular expression and the preg_match() function similar to that found in the previous example:

for($i = 0; $i < 10000; $i++) {
    if(preg_match("/.*jjj.*/i", $strings[$i])) {
        $found++;
    }
}

For method two, we'll use the strstr() function (which finds a constant substring within a string):

for($i = 0; $i < 10000; $i++) {
       if(strstr($strings[$i], "jjj")) {
           $found++;
       }
   }

Profiling these two methods, we find the following:

Method one took 0.11128091812134 seconds.
Method two took 0.05986499786377 seconds.

Method two was faster than Method one by 46.20%

Obviously, with a 46% performance increase against the faster of the two regular expressions, it is strongly recommended that the standard PHP string manipulation functions be used whenever possible.

Invariant Loop Optimization

Looping in any programming language is an absolutely fundamental tool. However, this same technique that makes our lives so much easier can also result in substantially slower code. For this illustration, consider a script that takes a string and creates a new shuffled version of that string. One solution to this problem is as follows:

                $shuffled = array();

for($i = 0; $i < (strlen($string)-1); $i++) {
    $shuffled[] = $string[rand(0, (strlen($string)-1))];
}
$new_string = implode($shuffled);

Notice that in this solution the strlen() function is called for every iteration of the for loop. In this case, the value returned from strlen() is constant for this loop (invariant) and needs to be calculated only once. Method #2 removes the strlen() calculation from the loop itself by calculating the value once and storing the result in a variable:

                                 $str_len = strlen($string) -1 ;
$shuffled = array();

for($i = 0; $i < $str_len; $i++) {
   $shuffled[] = $string[rand(0, $str_len)];
}
$new_string = implode($shuffled);

When profiling these two methods against each other, we find the following results:

NOTE

For this particular code snippet, it is notable that the amount of time taken to execute the segment was so small that profiling information was inaccurate. To provide more accurate profiling information, both methods were executed 100 times.

Method one took 0.04446005821228 seconds.
Method two took 0.035489916801453 seconds.

Method two was faster than Method one by 20.18%

As you can see, something as simple as removing an invariant function call from a loop can provide a 20% increase. More importantly, failing to recognize and remove these invariants from loops can substantially slow down your applications (especially if you do it in many different places).

In this case, the invariant value was a call to the strlen() function. However, any nonscalar value used within a loop is a potential candidate for optimization. A very common example is looping based on the value of an array value such as the following (assume all variables are defined appropriately):

$myarray['myvalue'] = 1000000;
for($i = 0; $i < $myarray['myvalue']; $i++) {
    $count++;
}

Although a function is not called, every access to the $myarray array requires a hash-table lookup internally within the engine. This is substantially slower than the access time required for scalar values:

$myarray['myvalue'] = 1000000;
$myscalar = $myarray['myvalue'];
for($i = 0; $i < $myscalar; $i++) {
    $count++;
}

The profiling results of these two methods provides the following information:

Method one took 3.676020026207 seconds.
Method two took 2.6184829473495 seconds.

Method two was faster than Method one by 28.77%

As you can see, a near 30% performance increase is achieved by assigning an invariant array value to a scalar when in loops (even more dramatic than our original strlen() optimization).

Output Optimizations

Thus far, we have discussed only optimizations that pertain specifically to CPU usage. The topic of output optimization, however, pertains not only to CPU usage but to bandwidth usage as well.

For bandwidth usage, the rule is quite obvious: The more output you have, the more bandwidth you will use. This is bad in many respects, such as slower pages, higher costs, and so on. Although reducing the amount of output your scripts require depends largely on the application, a number of things can be done regardless of application to reduce the bandwidth requirements, such as:

Storing client-side code (JavaScript, style sheets) in a separate file that is included on every page.
Taking advantage of the properties of HTML tags to avoid unnecessarily duplicating attributes.
Removing all unnecessary whitespace from output.
Compressing output before sending it to the client.

Again, it may seem that some of these optimizations are trivial (such as the removal of whitespace). However, consider a site that receives 200,000 hits a month, which saves 300 bytes per hit by removing whitespace from its HTML documents. This simple optimization will save 60,000,000 bytes a month and 720,000,000 bytes a year in bandwidth. This improvement can be even more substantial by storing common cacheable things such as JavaScript code or style sheets in a separate file. More importantly, these simple optimizations tend also to equate to not only a faster, but a cheaper, website.

From a PHP perspective, documents can be optimized through the use of output buffering and the zlib compression filter. HTML documents can be compressed prior to being sent to the browser. Although this does put an additional strain on the CPU, for sites that have limited bandwidth, the trade-off may be reasonable. Depending on the document, upward of 80% of the normal bandwidth can be saved by compressing the document prior to sending it to the client.

Caching and PHP

Throughout computing, the technique of caching has proven itself as a viable method of increasing the efficiency of computer programs. In fact, not only has caching been a viable method, it has been an extremely effective one. Websites, by their very nature, lend themselves quite nicely to the caching model, which is effective only when multiple requests for the same information are made.

Consider a website that sells books for an example of how caching can improve performance. On this website is a complete catalogue of all the books that can be purchased, each on its own page with the details of the book in question. As you would expect, the basic implementation of this book catalogue is a script that executes the following operations:

Do any initialization, session management, etc.
Determine the book the user requested to view
Retrieve the relevant information about the book from the database
Create the HTML document for the book and output

Assuming the online bookstore is successful, chances are that any given page containing a particular book's details are being viewed quite frequently by potential consumers. However, we must askhow often is the content of a book's detail page actually updated? Generally speaking, a book doesn't change much after it has been published, and thus chances are that the Web server is wasting all sorts of resources regenerating the same content for every request.

This extremely common situation in Web development is exactly when caching can have remarkable effects at reducing the unnecessary waste of server resources. Using caching, the same book-generating script would operate something like the following:

Do any initialization, session management, and so on.
Determine the book the user requested to viewed.
Check to see whether the cache has the required HTML for the request and output the cached HTML if it does.
If the cache entry doesn't exist, retrieve the relevant information about the book from the database.
Create the HTML document for the book.
Cache the output for future use and output.

By using caching in a situation like this, notice that the two most expensive steps of every request (the retrieval of the data from the database and the output generation) have in most cases been removed. Instead, the script will generate the content once and save it for future use, updating it only after the cached copy has expired. The performance that can be gained from this technique can be staggering (sometimes upwards of 88% faster).

In PHP, caching has been made extremely simple thanks to the PEAR Cache library. Through the use of this library, not only can the entire output of a page be cached, but so can individual components such as function calls and database requests. You can even extend the functionality to create your own custom caches. To start, you will need PEAR Cache installed on your system. This can be done one of two ways. The first is to go to the PEAR website (http://pear.php.net/) and download the package, or use the pear command:

[user@localhost]# pear install Cache

After installation, place the directory where you installed the Cache library into your include path. That's it!

Caching Entire Documents

After PEAR is installed, using it to cache the output of your pages is incredibly easy. Listing 10.7 outlines the basic skeleton of a page cached using PEAR Output Cache:

Listing 10.7. Using PEAR Output Caching

<?php
     require_once("Cache/Output.php");
     $cache = new Cache_Output('file', array('cache_dir' => '.'));

     /* Base the cache ID on the url, get variables and cookies */
     $key_params = array('url' => $_SERVER['REQUEST_URI'],
                    'get' => $_GET,
                    'cookies' => $_COOKIES);
     $cache_id = $cache->generateID($key_params);

     if($content = $cache->start($cache_id)) {
          echo $content;
          exit();
     }

     /* Generate content for page here */

     echo $cache->end();
?>

As you can see, it does not take a great deal of code to cache the output of your documents. To begin, the PEAR Cache must be loaded (generally Cache/Output.php by default) and an instance of a output cache option must be created. The constructor for this class accepts two parameters, the first indicating how the cached output should be stored (called the container) and the second the parameters to pass to that container represented as an array. Although for the purposes of our discussion I will be using only the file container to store cached data, the PEAR Cache supports a large variety of containers including databases (db) and shared memory (shm). For usage of these containers, consult the PEAR Cache documentation on the PEAR website (http://pear.php.net/).

One thing that must always be passed to a new Cache object is a unique identifier for the particular data being cached. For this purpose the PEAR Cache provides a generateID() method that accepts an array of variables to construct a unique identifier from. In the case of caching dynamically generated HTML, often this key is (as shown) generated based on passed GET or POST parameters, cookie values, and the URL requested. The important thing when generating a key is that all input variables that determine the page to be displayed are included in the key. Failing to do so will result in cached pages that do not represent the appropriate page on your site being returned.

To actually begin the process of using the Cache after it has been created, use the start() method. This method accepts two parameters: the first is the cache id generated by the generateID() method and the optional second parameter is a string representing the "group" to cache the data in. This parameter allows you to group large amounts of cached data into separate groups, reducing the amount of time required to find any particular cache ID. When the start() method is executed, it looks to see if any data is associated with the provided cache ID (in the optional group). If any is found, it is returned as a string and is ready to be displayed to the browser.

In the event the cache does not contain data for the given cache ID (or it is expired), start() will return an empty string and begin an output buffer to capture the generated output. After the output has been generated, the end() method must be called to actually display the generated content and save the content to the cache.

The end() method accepts an optional parameterthe amount of time, in seconds, before the cached version of the content expires. The result to the end user is that the page is displayed as usual; however, from an optimization standpoint the benefits are indisputable.

NOTE

Although it is only necessary that you realize Listing 10.8 is a relatively slow operation to perform to understand the power of the PEAR Cache, if you are interested in dynamic image generation please refer to Chapter 27, "Working with Images."

Caching Function Calls

Beyond caching entire dynamically generated HTML documents, the PEAR cache also has facilities to cache smaller portions of your PHP scripts, such as the results from particularly expensive function calls. For instance, consider the following function, which uses the GD library to generate an image (Listing 10.8):

Listing 10.8. Image Generation Function Example

<?php

     define('FONT', 4);
     function make_image_word($wordfile, $width, $height) {

          $words = array_flip(file($wordfile));
          $word = trim(array_rand($words));

          $img = imagecreate($width, $height);

          $black = imagecolorallocate($img, 0x00, 0x00, 0x00);
          $white = imagecolorallocate($img, 0xFF, 0xFF, 0xFF);
          imagefill($img, 0, 0, $white);
          for($i = 0; $i < 20; $i++) {

               $start_x = rand(0, $width);
               $start_y = rand(0, $height);
               $c_width   = rand(0, $width/2);
               $c_height = rand(0, $height/2);
               $color   = imagecolorallocate($img, rand(0x00, 0xFF),
                                         rand(0x00, 0xFF), rand(0x00, 0xFF));

               imageellipse($img, $start_x, $start_y,
                                        $c_width, $c_height, $color);

          }

          return $data;

     }

     $data = make_image_word("/usr/share/dict/words", 100, 50);
?>

This function is a particularly expensive operation that returns an image with a word in an array suitable for preventing autoregistration by Web bots. To cache the data generated by this function, we'll use the PEAR function cache as shown in Listing 10.9 (assume the function is still defined):

Listing 10.9. Caching Function Calls Using PEAR

 <?php
     require_once('Cache/Function.php');
     define('CACHE_EXPIRE', 30);
     $cache = new Cache_Function('file',
                                    array('cache_dir'       => '.',
                                          'filename_prefix' => 'cache_'),
                                    CACHE_EXPIRE);
     $data = $cache->call('make_image_word', '/usr/share/dict/words', 100, 50);
?>

As you can see, caching the results of a function call is a very straightforward process. Unlike the output cache example in Listing 10.7, the Cache_Function class requires not only a container and its parameters, but takes a third parameter representing the time in seconds to cache the function result. Furthermore, unlike the Output Cache function, caches do not have any sort of unique cache identifier associated with them. Rather, the function name and its parameters are automatically used. Calling the function using the PEAR function cache is done using the call() method of the Cache_Function class, where the first parameter is the name of the function, and each following parameter represents the parameters to pass to the function being called.

Now that you have an idea of how caching functions works, let's take a look at the performance gains by profiling the standard function call against the cached version:

Method one took 6.4777460098267 seconds.
Method two took 0.78488004207611 seconds.

Method two was faster than Method one by 87.88%

As is quite clear by the profiling statistics, caching provides staggering gains (nearly 90% faster) for high-cost operations.

NOTE

Many professional and open source tools exist that provide excellent tools to both debug and profile your PHP scripts. Zend's PHP development environment (called ZDE) is an excellent commercial product that does both these things and much more. For those of you interested in a more open source approach, the XDebug extension for PHP 5 provides much of the same functionality as the Zend IDEeven if it does so without as nice an interface. See http://pecl.php.net/xdebug for more information regarding the Xdebug extension for PHP.