Additions to the robots.txt Standard
Introduction
The robots.txt standard is a very useful tool for both webmasters and
the people who run web crawlers. This standard could be even more
useful with several additions. The additions suggested below were
inspired both by comments from webmasters and by front-line experience
developing and running the Excite web crawler.
Site naming
Site naming poses several problems for maintainers of web indexes.
Sites can be referenced by many names, and it can be hard to determine
which name the webmaster prefers. Also, large sites can be referenced
by many different physical IP addresses.
Multiple names
Most sites can be referenced by several names. To avoid duplication
crawlers usually canonicalize these names by converting them to ip
addresses. When presenting the results of a search, it is desirable to
use the name instead of the ip address. Sometimes it is obvious which
of several names to use (e.g. the one that starts with www), but in
many cases it is not. The robots.txt file should have an entry that
states the preferred name for the site.
Multiple IP addresses
Many high traffic sites use multiple servers. Machines are added
frequently and their ip addresses often change. Crawlers do not have a
good inexpensive way to understand and keep track of the everchanging
mapping of servers to logical sites. This causes needless duplication
of effort by the crawler and higher traffic at the sites. The
robots.txt file is an ideal place to include a list of ip addresses
that map to a logical site.
Freshness of content
HTTP provides mechanisms for determining how recently a file has been
modified; it even provides mechanisms for avoiding data transfer costs
if the file has not changed since the last visit of a browser or
crawler. However, the performance of both crawlers and the sites they
visit could be improved by providing higher-level information about
when content on a site has changed.
Freshness of web pages
One addition that could dramatically reduce traffic would be a
representation of modified dates for various parts of the site. Today
the only way to tell what pages you want to update is to use the
If-Modified-Since request-header field. This costs a connection per
page. Having this information centralized in the robots.txt file would
decrease server loads. This information could be presented at a
directory or file level depending on the size of the site and the
granularity of information the webmaster wants to present. A useful
representation might be a reverse-chronological list of files and the
dates that they were last modified.
Freshness of the robots.txt file
The robots.txt file needs to include a time-to-live (TTL) value. This
tells crawlers how often they should update the robots.txt information
for that site. Some sites very rarely change their robots.txt files
and do not want the extra traffic of having them frequently re-read
by multiple crawlers. Even if the If-Modified-Since request-header
field is used, a connection still has to be created each time. On the
other hand, some
sites change their robots.txt files regularly and often. They are
often hurt by extensive caching of robots.txt information by crawlers.
Having an explicit TTL value would help crawlers satisfy each site's
requirements.
Flexibility of the robots.txt file
Although the simplicity of the robots.txt file is a benefit, many
sites on the internet today have structures that are too complex to
represent with the current robots.txt format.
Multiple content providers
In some instances many people might provide the content for a single
site. A good example is a university site which has a
separate area for each student. Each of these individuals might want
to control access to his or her own section of the site. It is often
unreasonable to allow all of the individuals to edit one global
robots.txt file. The robots.txt file should have a way to redirect
the crawler to read separate robots.txt files from further down in the
site. This allows different robots.txt information to be specified
for separate parts of the site.
Complex directory structures
The disallow statement of the current robots.txt standard could be
made more powerful. For various reasons some sites cannot change
their on-disk layout and may have very large directories. It is very
cumbersome to exclude part of a large directory using the current
disallow statement. A more powerful regular expression syntax or
an 'allow' directive to override the disallow for specific files would
be useful.
Description of the site
An optional description of the site would also be a welcome addition
to the robots.txt standard. A brief human-readable statement about
the site's purpose and the kind of content it contains would provide
useful information to the end users of the various repositories
created by web crawlers.
Conclusion
We have described several improvements to the robots.txt standard.
These would improve the performance and usefulness of both web
crawlers and web sites. We have not provided details on what the
changes to the format should look like but none of the improvements
seem difficult to specify.
Mike Frumkin (mfrumkin@excite.com)
Graham Spencer (gspencer@excite.com)
Excite Inc.
This page is part of the DISW 96 workshop.
Last modified: Thu Jun 20 18:20:11 EST 1996.