Recipe 22.14. Using robots.txt to Control Web Crawlers

22.14.1 Problem

You like that search engines find your web sites, and index them, and make it easy for visitors to find you. But some of them are driving you nuts by visiting too often, so you want to exclude them. Also, there are some directories or pages on your site that you don't want indexed.

22.14.2 Solution

Write a robots.txt file containing your restrictions, and place it in your root web site directory. A robots.txt file looks like this:

# all spiders/bots/crawlers etc.

# are not allowed to index these pages

User-agent: *

Disallow: /error/

Disallow: /users/

Disallow: /cgi-bin/

Disallow: /*.doc$

Disallow: /tmp/

# exclude these ill-mannered bots that

# suck up bandwidth

User-agent: BadBot/ 

Disallow: /

User-agent: VeryBadBot/ 

Disallow: /

22.14.3 Discussion

The two conventions used in a robots.txt file are User-agent and Disallow. Cruise your web logs to find specific user agents. Some examples are:

"GET /robots.txt HTTP/1.1" "http://www.whois.sc/" "SurveyBot/2.3 (Whois Source)"

"GET /foo.htm HTTP/1.0" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)"

"GET /foo HTTP/1.0"  "-" "msnbot/0.11 (+http://search.msn.com/msnbot.htm)"

The user agents are SurveyBot/, Googlebot/, and msnbot/. If you wished to name them specifically, your entries would look like this:

User-agent: SurveyBot/ 

Disallow: /cgi-bin/

User-agent: msnbot/

Disallow: /

Search engines may not behave in accordance with your robots.txt file. The major ones are pretty good, but there are many scammy ones that follow no rules. If there are files you don't want indexed, robots.txt is useful, but don't make the mistake of thinking you can protect sensitive pages this way. Sensitive information should not be on a public web site.

Keep an eye on your logs. If someone is really giving you fits, see the next recipe to learn how to exclude them completely using Apache's built-in access rules.

22.14.4 See Also

http://www.robotstxt.org, the authoritative source for information on bots and robots.txt, including a database of web robots at http://www.robotstxt.org/wc/active/html/index.html

< Day Day Up >