< Day Day Up > |
Recipe 22.14. Using robots.txt to Control Web Crawlers22.14.1 ProblemYou like that search engines find your web sites, and index them, and make it easy for visitors to find you. But some of them are driving you nuts by visiting too often, so you want to exclude them. Also, there are some directories or pages on your site that you don't want indexed. 22.14.2 SolutionWrite a robots.txt file containing your restrictions, and place it in your root web site directory. A robots.txt file looks like this: # all spiders/bots/crawlers etc. # are not allowed to index these pages User-agent: * Disallow: /error/ Disallow: /users/ Disallow: /cgi-bin/ Disallow: /*.doc$ Disallow: /tmp/ # exclude these ill-mannered bots that # suck up bandwidth User-agent: BadBot/ Disallow: / User-agent: VeryBadBot/ Disallow: / 22.14.3 DiscussionThe two conventions used in a robots.txt file are User-agent and Disallow. Cruise your web logs to find specific user agents. Some examples are: "GET /robots.txt HTTP/1.1" "http://www.whois.sc/" "SurveyBot/2.3 (Whois Source)" "GET /foo.htm HTTP/1.0" "Googlebot/2.1 (+http://www.googlebot.com/bot.html)" "GET /foo HTTP/1.0" "-" "msnbot/0.11 (+http://search.msn.com/msnbot.htm)" The user agents are SurveyBot/, Googlebot/, and msnbot/. If you wished to name them specifically, your entries would look like this: User-agent: SurveyBot/ Disallow: /cgi-bin/ User-agent: msnbot/ Disallow: / Search engines may not behave in accordance with your robots.txt file. The major ones are pretty good, but there are many scammy ones that follow no rules. If there are files you don't want indexed, robots.txt is useful, but don't make the mistake of thinking you can protect sensitive pages this way. Sensitive information should not be on a public web site. Keep an eye on your logs. If someone is really giving you fits, see the next recipe to learn how to exclude them completely using Apache's built-in access rules. 22.14.4 See Also
|
< Day Day Up > |