Web Robots and Information
What is a robot?
WWW Robots, often known as crawlers, 'bots, scrubbers, indexers or spiders, are programs that visit websites on the World Wide Web and recursively retrieve linked pages for the purpose of searching for resources or indexing sites for search engines. Though the vast majority of website owners are anxious for robots to visit and index the site pages for the purposes of a higher ranking in the search engines, there are times when you do not want a robot to visit a part of your website. Using the information below and selection one of two methods for exclusion, you can make sure that an obscure page or directory deep within your site is not indexed before your main pages.
How can I stop a robot from indexing my site or a portion of it?
The first method used to exclude robots from a server is to create a file on the server which specifies an access policy for robots. This file must be placed in your ~/www/htdocs directory and must be called "robots.txt". The robots.txt file starts with a User-Agent line, followed by one or more Disallow lines. This method is usually used where entire directories are to be disallowed. The syntax of these lines is detailed below:
User Agent: This field contains either the name of a specific robot you want the record to apply to, or '*', to make it apply to all robots.
Disallow: This field specifies a partial URL that is not to be visited. This can be a full path or a partial path. Any URL that starts with this value will not be retrieved by the robot. For example, Disallow: /test would disallow both /test.html and /test/index.html, whereas Disallow /test/ would disallow only /test/index.html. Disallow /index.html would disallow only /index.html. (Note there is not a "/" after an individual filename.)
Let's look at a couple of simple and useful examples. In this example "/robots.txt" files specifies that no robots should visit any URL starting with "/test/documents/" or "/tmp/":
This example specifies that no robot should visit "/test/documents/" except for the one called WebCrawler:
This example specifies that no robot should visit "/document.html":
This example tells all robots to go away:
I use a sitemap to identify my site's contents to one or more search engines. Is there any ROBOT command for that? Is it universal?
As of the first quarter of 2007, yes (or as universal as anything in computing gets...)! In a rare agreement among the industry giants, Micorosoft, Yahoo and Google have set the standard for such actions. On top of that, it is easy and simple. It goes on a line in the robots.txt file.
The ONLY command
The second method deals with META tags. It is more of a per document answer.
- Sitemap: http://www.yoursite.com/sitemapname.xml
Where do I put a Robots META tag?
Like any META tag it should be placed in the HEAD section of an HTML page:
Normal HTML commands
What do I put into the Robots META tag?
- <meta name="robots" content="noindex,nofollow">
- <meta name="description" content="This page ....">
- ... the entire body of your page ...
The content of the Robots META tag contains directives separated by commas. The currently defined (as of July, 2005) directives are [NO]INDEX and [NO]FOLLOW. The INDEX directive specifies if an indexing robot should index the page or ignore it. The FOLLOW directive specifies if a robot is to follow links on the page or ignore them. The defaults are INDEX and FOLLOW. The default condition if no robot tags are in place is that a robot will index the page and follow up with all links on that page IF the robot does not find other conditions that make the page illegal or out of the guidelines set by the robot's controller. The values ALL and NONE set all directives on or off: ALL=INDEX,FOLLOW and NONE=NOINDEX,NOFOLLOW. Some examples:
More normal HTML commands
Be aware that the "robots" (name of the tag) and the subsequent content are case insensitive, though not all robotics software programmers have followed this directive. All robots to not respond to the last four of the above options. Some robots have attitudes! Some don't like to see multiple directives in the same statement or command. In that case, use multiple commands. TANS! (There ain't no standard!)
- <meta name="robots" content="all">
- <meta name="robots" content="none">
- <meta name="robots" content="index,follow">
- <meta name="robots" content="noindex,follow">
- <meta name="robots" content="index,nofollow">
- <meta name="robots" content="noindex,nofollow">
Abnormal HTML commands for robot attitudes
You must take care to not specify conflicting or repeating directives such as:
- <meta name="robots" content="index">
- <meta name="robots" content="follow">
A conflicting HTML command
Avoid this problem as the results may be unpredictable. (We do not know of any cases of robots eating a document in anger but it could happen. Remember the attitude!) Double check each entry for clarity. If you do NOT mind if a robot indexes and links the page, it is best to make no entry.
- <meta name="robots" content="ALL,INDEX,NOINDEX,NOFOLLOW,FOLLOW,FOLLOW,NONE">
How do I know what works on a Robots META tag?
The only way is trial and error. Experiment. Since the guidelines for robots are ambiguous at best, it is hard to follow a standard that really isn't a standard. Since not all servers are UNIX or similar operating systems, even the robots.txt file is not guaranteed. The commands below are as close to a standard as it gets.
A formal syntax for the Robots META tag content is:
- content = all | none | directives
- all = "ALL"
- none = "NONE"
- directives = directive ["," directives]
- directive = index | follow
- index = "INDEX" | "NOINDEX"
- follow = "FOLLOW" | "NOFOLLOW"
A working example of a no-robot page on our site
FAQs about robots
Co-operating Sponsors and Technology used on our Website
The above links were last checked on 3/17/2007.