Hack 7 Finding the Patterns of Identifiers

figs/beginner.gif figs/hack07.gif

If you find that the online database or resource you want uses unique identification numbers , you can stretch what it does by combining it with other sites and identification values .

Some online data collections are just thathuge collections, put together in one place, and relying on a search engine or database program to provide organization. These collections have no unique ID numbers, no rhyme or reason to their organization. But that's not always the case.

As more and more libraries put their collection information online, more and more records and pages have their own unique identification numbers.

So what? Here's what: when a web site uses an identifying method for its information that is recognized by other web sites, you can scrape data across multiple sites using that identifying method. For example, say you want to tour the country playing golf but you're afraid of pollution, so you want to play only in the cleanest areas. You could write a script that searches for golf courses at http://www.golfcourses.com, then takes the Zip Codes of the courses returned and checks them against http://www.scorecard.org to see which have the most (or least) polluted environment.

This is a silly example, but it shows how two different online data sources (a golf course database and an environmental pollution guide) can be linked together with a unique identifying number (a Zip Code, in this case).

Speaking generally , there are three types of deliberate web data organization:

  • Arbitrary classification systems within a collection

  • Classification systems that use an established universal taxonomy within a collection

  • Classification systems that identify documents across a wide number of collections

Arbitrary Classification Systems Within a Collection

An arbitrary classification system is either not based on an established taxonomy or only loosely based on an established taxonomy. If I give 10 photographs unique codes based on their topics and how much blue they have in them, I have established an arbitrary classification system.

The arbitrary classification system's usefulness is limited. You cannot use its identifying code on other sites. You might be able to detect a pattern in it that allows you to spider large volumes of data, but, on the other hand, you may not. (In other words, files labeled 10A , 10B , 10C , and 10D might be useful, but files labeled CMSH113 , LFFD917 , and MDFS214 would not.)

Classification Systems that Use an Established Universal Taxonomy Within a Collection

The most overt example of classification systems that use an established universal taxonomy is a library card catalog that follows the Dewey Decimal, Library of Congress, or other established classification.

The benefit of such systems is mixed. Say I look up Google Hacks at the University of Tulsa. I'll discover that the LOC number is ZA4251.G66 C3 2003. Now, if I plug that number into Google, I'll find about 13 results. Here's the cool part: the results will be from a variety of libraries. So, if I wanted to, I could plug that search into Google and find other libraries that carry Google Hacks and extend that idea into a spidering script [Hack #65].

That's the good thing. The bad thing is that such a search won't list all libraries that carry Google Hacks . Other libraries have different classification systems, so if you're looking for a complete list of libraries carrying the book, you're not going to find it solely with this method. But you may find enough libraries to work well enough for your needs.

Classification Systems that Identify Documents Across a Wide Number of Collections

Beyond site classifications that are based on an established taxonomy, there are systems that use an identification number that is universally recognized and applied. Examples of such systems include:


ISBN (International Standard Book Number)

As you might guess, this number is an identification system for books. There are similar numbers for serials, music, scientific reports , etc. You'll see ISBNs used everywhere from library catalogs to Amazon.comanywhere books are listed.


EIN (Employer Identification Number)

Used by the IRS. You'll see this number in tax-filing databases, references to businesses and nonprofits, and so on.


Zip Code

Allows the U.S. Post Office to identify unique areas.

This list barely scratches the surface. Of course, you can go even further, including unique characteristics such as latitude and longitude, other business identification numbers, or even area codes. The challenge is identifying a unique classification number that requires context that's minimal enough to make it usable by your spider. " 918 " is a three-number string that has plenty of search results beyond just those related to area codes. So, you might not be able to find a way to eliminate your false positives when building a spider that depends on area codes for results.

On the other hand, an extended classification number, such as an LOC catalog number or ISBN, is going to have few, if any, false positives. The longer or more complicated an identifying number is, the better it serves the purposes of spidering.

Some Large Collections with ID Numbers

There are several places online that use unique classification numbers, which you can use across other sites. Here are a few you might want to play with:


Amazon.com (http://www.amazon.com), Abebooks (http://www.abebooks.com)

These sites both use ISBN numbers. Combining the two result sets can aggregate seller information to find you the cheapest price on used books.


The International Standard Serial Number Register (http://www.issn.org)

You need to be a subscriber to use this site, but trials are available. ISSNs are used for both online and offline magazines.


United States Post Office (http://www.usps.com)

This site allows you to do both standard and nine-digit Zip Code lookups, allowing you to find more specific areas within a Zip Code (and eliminate false positive results from your spider).

GuideStar, a database of nonprofit organizations, has a search page that allows you to search by EIN (http://www.guidestar.org/search/). A variety of other business databases also allow you to search by EIN.



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net