Robots are no different from any other HTTP client program. They too need to abide by the rules of the HTTP specification. A robot is making HTTP requests and advertising itself as an HTTP/1.1 client needs to use the appropriate HTTP request headers.
Many robots try to implement the minimum amount of HTTP needed to request the content they seek. This can lead to problems; however, it's unlikely that this behavior will change anytime soon. As a result, many robots make HTTP/1.0 requests, because that protocol has few requirements.
Despite the minimum amount of HTTP that robots tend to support, most do implement and send some identification headersmost notably, the User -Agent HTTP header. It's recommended that robot implementors send some basic header information to notify the site of the capabilities of the robot, the robot's identity, and where it originated.
This is useful information both for tracking down the owner of an errant crawler and for giving the server some information about what types of content the robot can handle. Some of the basic indentifying headers that robot implementors are encouraged to implement are:
User-Agent
Tells the server the name of the robot making the request.
From
Provides the email address of the robot's user/administrator. [8]
[8] An RFC 822 email address format.
Accept
Tells the server what media types are okay to send. [9] This can help ensure that the robot receives only content in which it's interested (text, images, etc.).
[9] Section 3.5.2.1 lists all of the accept headers; robots may find it useful to send headers such as Accept-Charset if they are interested in particular versions.
Referer
Provides the URL of the document that contains the current request-URL. [10]
[10] This can be very useful to site administrators that are trying to track down how a robot found links to their sites' content.
Robot implementors need to support the Host header. Given the prevalence of virtual hosting ( Chapter 5 discusses virtually hosted servers in more detail), not including the Host HTTP header in requests can lead to robots identifying the wrong content with a particular URL. HTTP/1.1 requires the use of the Host header for this reason.
Most servers are configured to serve a particular site by default. Thus, a crawler not including the Host header can make a request to a server serving two sites, like those in Figure 9-5 ( www.joes-hardware.com and www.foo.com ) and, if the server is configured to serve www.joes-hardware.com by default (and does not require the Host header), a request for a page on www.foo.com can result in the crawler getting content from the Joe's Hardware site. Worse yet, the crawler will actually think the content from Joe's Hardware was from www.foo.com . I am sure you can think of some more unfortunate situations if documents from two sites with polar political or other views were served from the same server.
Given the enormity of some robotic endeavors, it often makes sense to minimize the amount of content a robot retrieves. As in the case of Internet search-engine robots, with potentially billions of web pages to download, it makes sense to re-retrieve content only if it has changed.
Some of these robots implement conditional HTTP requests, [11] comparing timestamps or entity tags to see if the last version that they retrieved has been updated. This is very similar to the way that an HTTP cache checks the validity of the local copy of a previously fetched resource. See Chapter 7 for more on how caches validate local copies of resources.
[11] Section 3.5.2.2 gives a complete listing of the conditional headers that a robot can implement.
Because many robots are interested primarily in getting the content requested through simple GET methods , often they don't do much in the way of response handling. However, robots that use some features of HTTP (such as conditional requests), as well as those that want to better explore and interoperate with servers, need to be able to handle different types of HTTP responses.
In general, robots should be able to handle at least the common or expected status codes. All robots should understand HTTP status codes such as 200 OK and 404 Not Found. They also should be able to deal with status codes that they don't explicitly understand based on the general category of response. Table 3-2 in Chapter 3 gives a breakdown of the different status-code categories and their meanings.
It is important to note that some servers don't always return the appropriate error codes. Some servers even return 200 OK HTTP status codes with the text body of the message describing an error! It's hard to do much about thisit's just something for implementors to be aware of.
Along with information embedded in the HTTP headers, robots can look for information in the entity itself. Meta HTML tags, [12] such as the meta http-equiv tag, are a means for content authors to embed additional information about resources.
[12] Section 9.4.7.1 lists additional meta directives that site administrators and content authors can use to control the behavior of robots and what they do with documents that have been retrieved.
The http-equiv tag itself is a way for content authors to override certain headers that the server handling their content may serve:
<meta http-equiv="Refresh" content="1;URL=index.html">
This tag instructs the receiver to treat the document as if its HTTP response header contained a Refresh HTTP header with the value "1, URL=index.html". [13]
[13] The Refresh HTTP header sometimes is used as a means to redirect users (or in this case, a robot) from one page to another.
Some servers actually parse the contents of HTML pages prior to sending them and include http-equiv directives as headers; however, some do not. Robot implementors may want to scan the HEAD elements of HTML documents to look for http-equiv information. [14]
[14] Meta tags must occur in the HEAD section of HTML documents, according to the HTML specification. However, they sometimes occur in other HTML document sections, as not all HTML documents adhere to the specification.
Web administrators should keep in mind that many robots will visit their sites and therefore should expect requests from them. Many sites optimize content for various user agents , attempting to detect browser types to ensure that various site features are supported. By doing this, the sites serve error pages instead of content to robots. Performing a text search for the phrase "your browser does not support frames " on some search engines will yield a list of results for error pages that contain that phrase, when in fact the HTTP client was not a browser at all, but a robot.
Site administrators should plan a strategy for handling robot requests. For example, instead of limiting their content development to specific browser support, they can develop catch-all pages for non-feature rich browsers and robots. At a minimum, they should expect robots to visit their sites and not be caught off guard when they do. [15]
[15] Section 9.4 provides information for how site administrators can control the behavior of robots on their sites if there is content that should not be accessed by robots.