Previous Section  < Day Day Up >  Next Section

Hack 42. Compare Google's Results with Other Search Engines

Compare Google search results with results from other search engines.

True Google fanatics might not like to think so, but there's really more than one search engine. Google's competitors include the likes of Teoma and Yahoo!.

Equally surprising to the average Google fanatic is the fact that Google doesn't index the entire Web. There are, at the time of this writing, over eight billion web pages in the Google index, but that's just a fraction of the Web. You'd be amazed how much nonoverlapping content there is in each search engine. Some queries that bring only a few results on one search engine bring plenty on another search engine.

This hack gives you a program that compares counts for Google and several other search engines, with an easy way to plug in new search engines that you want to include. This version of the hack searches different domains for the query, in addition to getting the full count for the query itself.

This hack requires the LWP::Simple (http://search.cpan.org/search?query=LWP%3A%3ASimple) module to run.


2.24.1. The Code

Save the following code as a CGI script ["How to Run the Hacks" in the Preface] named google_compare.cgi in your web site's cgi-bin directory:

#!/usr/local/bin/perl

# google_compare.cgi

# Compares Google results against those of other search engines.

     

# Your Google API developer's key.

my $google_key='insert key here';

     

# Location of the GoogleSearch WSDL file.

my $google_wdsl = "./GoogleSearch.wsdl";

     

use strict;

     

use SOAP::Lite;

use LWP::Simple qw(get);

use CGI qw{:standard};

     

my $googleSearch = SOAP::Lite->service("file:$google_wdsl");

     

# Set up our browser output.

print "Content-type: text/html\n\n";

print "<html><title>Google Compare Results</title><body>\n";

     

# Ask and we shell receive.

my $query = param('query');

unless ($query) {

   print "<h1>No query defined.</h1></body></html>\n\n";

   exit; # If there's no query there's no program. 

}

     

# Spit out the original before we encode.

print "<h1>Your original query was '$query'.</h1>\n";

     

$query =~ s/\s/\+/g ;  #changing the spaces to + signs

$query =~ s/\"/%22/g;  #changing the quotes to %22

     

# Create some hashes of queries for various search engines.  

# We have four types of queries ("plain", "com", "edu", and "org"), 

# and three search engines ("Google", "AlltheWeb", and "Altavista"). 

# Each engine has a name, query, and regular expression used to 

# scrape the results.

my $query_hash = { 

   plain => {

      Google => { name => "Google", query => $query, },

      AlltheWeb => {

         name   => "AlltheWeb",

         regexp => '<span class="ofSoMany">(.*)</span>',

         query  => "http://www.alltheweb.com/search?cat=web&q=$query",

      },

      Altavista => {

         name  => "Altavista", 

         regexp => 'AltaVista found (.*) results',

         query => "http://www.altavista.com/sites/search/web?q=$query",

      }

   },

   com => {

      Google => { name => "Google", query => "$query site:com", },

      AlltheWeb => {

         name   => "AlltheWeb",

         regexp => '<span class="ofSoMany">(.*)</span>',

         query  => "http://www.alltheweb.com/ search?cat=web&q=$query+domain%3Acom",

      },

      Altavista => {

         name  => "Altavista", 

         regexp => 'AltaVista found (.*) results',

         query => "http://www.altavista.com/sites/search/web?q=$query+domain%3Acom",

      }

   },

   org => {

      Google => { name => "Google", query => "$query site:org", },

      AlltheWeb => {

         name   => "AlltheWeb",

         regexp => '<span class="ofSoMany">(.*)</span>',

         query  => "http://www.alltheweb.com/

         search?cat=web&q=$query+domain%3Aorg",

      },

      Altavista => {

         name  => "Altavista", 

         regexp => 'AltaVista found (.*) results',

         query => "http://www.altavista.com/sites/search/web?q=$query+domain%3Aorg",

      }

   },

   net => {

      Google => { name => "Google", query => "$query site:net", },

      AlltheWeb => {

         name   => "AlltheWeb",

         regexp => '<span class="ofSoMany">(.*)</span>',

         query  => "http://www.alltheweb.com/search?cat=web&q=$query+domain%3Anet",

      },

      Altavista => {

         name  => "Altavista", 

         regexp => 'AltaVista found (.*) results',

         query => "http://www.altavista.com/sites/search/web?q=$query+domain%3Anet",

      }

   }

};

     

# Now we loop through each of our query types

# under the assumption there's a matching

# hash that contains our engines and string.

foreach my $query_type (keys (%$query_hash)) {

   print "<h2>Results for a '$query_type' search:</h2>\n";

     

   # Now, loop through each engine we have and get/print the results.

   foreach my $engine (values %{$query_hash->{$query_type}}) {

      my $results_count; 

     

      # If this is Google, we use the API and not port 80.

      if ($engine->{name} eq "Google") {

         my $result = $googleSearch->doGoogleSearch(

             $google_key, $engine->{query}, 0, 1,

             "false", "", "false", "", "latin1", "latin1");

         $results_count = $result->{estimatedTotalResultsCount};

         # The Google API doesn't format numbers with commas.

         my $rresults_count = reverse $results_count;

         $rresults_count =~ s/(\d\d\d)(?=\d)(?!\d*\.)/$1,/g;

         $results_count = scalar reverse $rresults_count;

      }

     

      # It's not Google, so we GET like everyone else.

      elsif ($engine->{name} ne "Google") {

         my $data = get($engine->{query}) or print "ERROR: $!";

         $data =~ /$engine->{regexp}/; $results_count = $1 || 0;

      }

     

      # and print out the results.

      print "<strong>$engine->{name}</strong>: $results_count<br />\n";

   }

}

2.24.2. Running the Hack

This hack runs as a CGI script, called from your web browser as google_compare.cgi?query=your query keywords.

2.24.3. Why?

You might be wondering why you would want to compare result counts across search engines. It's a good idea to follow what different search engines offer in terms of results. While you might find that a phrase on one search engine provides only a few results, another engine might return results aplenty. It makes sense to spend your time and energy using the latter for the research at hand.

Tara Calishain and Kevin Hemenway

    Previous Section  < Day Day Up >  Next Section