Hack 97 Making Your Resources Scrapable with a REST Interface

Hack 97 Making Your Resources Scrapable with a REST Interface

figs/moderate.gif figs/hack97.gif

Consider offering alternative versions of site documents for a variety of human and machine visitors , based on how they present themselves .

Another way of looking at web resources is to consider them for the information they represent, rather than as applications that need automated access through scraping. Think about the two main facets of web site scraping: navigation and extraction. These facets reflect the architecture of the Web itself as a network of navigable links. Further, these navigable links point to resources from which we can acquire representations, among a few other things we can do.

This is a loose description of the Representational State Transfer (REST) architectural style (http://conveyor.com/RESTwiki/moin.cgi). In a way, the principles of REST seem like where we'd arrive if we took accommodating scrapers as far as we could go. Ultimately, REST is a fancy term for using URLs and HTTP GET / POST to represent ways of accessing resources.

Navigation between resources to find the resource we want is simple, because representations of resources (i.e., HTML pages, XML documents, etc.) contain clear links to other resources. And extracting data is simple too, since every resource should have multiple representations and the client can specify what it accepts in order to get what it needs. That is, even though people using browsers ask for HTML, a robot looking for something closer to its needs can ask for it using HTTP content negotiation.

There's a lot more to the REST style involving the manipulation of resources. But since we're concerned only with making existing resources easier to acquire, we'll leave the rest of REST to another book.

Navigating One URI at a Time

Part of the REST philosophy is careful and clean URI design. Since every resource on the Web should be addressable via a URI, good organization and understandable hierarchy make URIs more useful. Also, URIs should be put together in a way that tries to focus on the organization of the resources, hiding any underlying mechanisms such as CGI-BIN directories or file extensions. And since URIs are the way resources on your site are found, their structure and what they point to on your site should be well-documented, akin to a programming interface.

For example, you could establish a URI structure like this to acquire the current weather conditions in your, or any, area:

 http://my.weatherexample.com/locations/48103/conditions/current 

This shows a logical path from topic to topic, finally specifying current conditions in a specific Zip Code area. An application could take things a step at a time, though, first by requesting /locations for a list of available Zip Codes:

 <locations xmlns:xlink="http://www.w3.org/1999/xlink">   ...   <location xlink:href="48001/" code="48101" label="Algonac, MI" />   <location xlink:href="48104/" code="48104" label="Ann Arbor, MI" />   <location xlink:href="48105/" code="48105" label="Ann Arbor, MI" />   ... </locations> 

One of the relative Zip Code links could be followed /locations/48103 , for exampleto list the different categories of weather information available:

 <reports xmlns:xlink="http://www.w3.org/1999/xlink">   <report xlink:href="conditions/" label="Conditions" />   <report xlink:href="forecasts/" label="Forecasts" />   <report xlink:href="warnings/" label="Warnings" /> </reports> 

At each step of the way, XML documents containing URI links would lead the application to discover further resources, either automatically or perhaps by selection from a GUI.

But REST simply defines the architecture and the philosophy. It doesn't specify any file formats in particular as representations of resources. We could use XML, RDF triples, or even comma-separated lines. REST simply suggests that, whatever the format is, when other resources are mentioned, they must be referred to by URI in order to form the web of navigable links.

Negotiating Better Content

So, now that we have a simple, solid philosophy toward providing an easily navigated web of resources identified by URI, how can we get at the data? Well, in the REST architecture, representations of resources are acquired from the Web. A single resource can have many representations, such as HTML and XML, both of which are different views of the same conceptual thing. So, with this in mind, wouldn't it be nice if, instead of scraping data from an HTML file intended for a browser, our application could simply ask for something more appropriate?

The Apache HTTP Server's description of content negotiation (http://httpd.apache.org/docs/ content-negotiation .html) states:

A resource may be available in several different representations. For example, it might be available in different languages or different media types, or a combination. One way of selecting the most appropriate choice is to give the user an index page, and let them select. However it is often possible for the server to choose automatically. This works because browsers can send, as part of each request, information about what representations they prefer.

For example, when a web browser is used to view the current weather conditions, the browser might send this as one of the headers in its request for the resource:

 Accept: text/html; q=1.0, text/*; q=0.8, image/gif; q=0.6,         image/jpeg; q=0.6, image/*; q=0.5, */*; q=0.1 

On the other hand, an application sent to extract the raw weather conditions data itself should send a header like this:

 Accept: application/weatherml+xml; q=1.0,          application/xml; q=0.8,         text/*; q=0.6, */*; q=0.1 

These two headers each present the HTTP server with a priority list of content types preferred by each client. Accordingly, the web browser wants some form of HTML or text, possibly falling back to an image type. On the other hand, our application wants raw weather data, so it asks for a specific data format (presumably already well-defined ), followed by generic XML, then text, and possibly whatever else is laying around.

On the server side, with the Apache HTTP Server, these preferences can be handled by first adding a content type and file extension for application/weatherml+xml in the server's mime.types file, like this:

 AddType application/weatherml+xml wea 

Then, the MultiViews option can be turned on for a directory:

 Options +MultiViews 

As long as Apache has the mod_negotiation module installed, turning on MultiViews will cause the server to try matching preferences up with resources available in the directory. Content types are identified automatically using file extensions and the mime.types mappings.

For example, the root folder of the weather site could contain two files: current.html and current.wea . When the web browser requests http://my.weatherexample.com/current , it will receive the preferred HTML representation. However, when our data-extracting client requests the same URL, it will receive the contents of current.wea , as per its preferences in the request.

The REST architecture is a good alternative to handling information on the Web. Resources identified by URI, representations of which can contain links to further resources, provide well for the construction of interrelated data that is far more complex than initially planned for. And using a form of transparent content negotiation allows us to serve both human and machine visitors from the same resources. It's all a bit more complicated than simple scraping, but in terms of scalability and compatibility between future applications it is worth it.

See Also

  • For an example of one service that provides a REST interface, as well as code to access it, check out [Hack #66].

  • If you want to read more about the REST architectural style, an open -access wiki and list of resources is available at http://internet.conveyor.com/RESTwiki/moin.cgi/FrontPage.

l.m.orchard



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net