Main Page

  Table of Contents
  Reader Reviews
Spidering Hacks
By Tara  Calishain, Kevin  Hemenway
Publisher : O'Reilly
Pub Date : October 2003
ISBN : 0-596-00577-6
Pages : 424

Written for developers, researchers, technical assistants, librarians, and power users, Spidering Hacks provides expert tips on spidering and scraping methodologies. You'll begin with a crash course in spidering concepts, tools (Perl, LWP, out-of-the-box utilities), and ethics (how to know when you've gone too far: what's acceptable and unacceptable). Next, you'll collect media files and data from databases. Then you'll learn how to interpret and understand the data, repurpose it for use in other applications, and even build authorized interfaces to integrate the data into your own content.

  Table of Contents
  Reader Reviews
Spidering Hacks
By Tara  Calishain, Kevin  Hemenway
Publisher : O'Reilly
Pub Date : October 2003
ISBN : 0-596-00577-6
Pages : 424
        About the Authors
        Why Spidering Hacks?
        How This Book Is Organized
        How to Use This Book
        Conventions Used in This Book
        How to Contact Us
        Got a Hack?
        Chapter 1.   Walking Softly
        Hacks #1-7
        Hack  1.   A Crash Course in Spidering and Scraping
        Hack  2.   Best Practices for You and Your Spider
        Hack  3.   Anatomy of an HTML Page
        Hack  4.   Registering Your Spider
        Hack  5.   Preempting Discovery
        Hack  6.   Keeping Your Spider Out of Sticky Situations
        Hack  7.   Finding the Patterns of Identifiers
        Chapter 2.   Assembling a Toolbox
        Hacks #8-32
        Perl Modules
        Resources You May Find Helpful
        Hack  8.   Installing Perl Modules
        Hack  9.   Simply Fetching with LWP::Simple
        Hack  10.   More Involved Requests with LWP::UserAgent
        Hack  11.   Adding HTTP Headers to Your Request
        Hack  12.   Posting Form Data with LWP
        Hack  13.   Authentication, Cookies, and Proxies
        Hack  14.   Handling Relative and Absolute URLs
        Hack  15.   Secured Access and Browser Attributes
        Hack  16.   Respecting Your Scrapee's Bandwidth
        Hack  17.   Respecting robots.txt
        Hack  18.   Adding Progress Bars to Your Scripts
        Hack  19.   Scraping with HTML::TreeBuilder
        Hack  20.   Parsing with HTML::TokeParser
        Hack  21.   WWW::Mechanize 101
        Hack  22.   Scraping with WWW::Mechanize
        Hack  23.   In Praise of Regular Expressions
        Hack  24.   Painless RSS with Template::Extract
        Hack  25.   A Quick Introduction to XPath
        Hack  26.   Downloading with curl and wget
        Hack  27.   More Advanced wget Techniques
        Hack  28.   Using Pipes to Chain Commands
        Hack  29.   Running Multiple Utilities at Once
        Hack  30.   Utilizing the Web Scraping Proxy
        Hack  31.   Being Warned When Things Go Wrong
        Hack  32.   Being Adaptive to Site Redesigns
        Chapter 3.   Collecting Media Files
        Hacks #33-42
        Hack  33.   Detective Case Study: Newgrounds
        Hack  34.   Detective Case Study: iFilm
        Hack  35.   Downloading Movies from the Library of Congress
        Hack  36.   Downloading Images from Webshots
        Hack  37.   Downloading Comics with dailystrips
        Hack  38.   Archiving Your Favorite Webcams
        Hack  39.   News Wallpaper for Your Site
        Hack  40.   Saving Only POP3 Email Attachments
        Hack  41.   Downloading MP3s from a Playlist
        Hack  42.   Downloading from Usenet with nget
        Chapter 4.   Gleaning Data from Databases
        Hacks #43-89
        Hack  43.   Archiving Yahoo! Groups Messages with yahoo2mbox
        Hack  44.   Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups
        Hack  45.   Gleaning Buzz from Yahoo!
        Hack  46.   Spidering the Yahoo! Catalog
        Hack  47.   Tracking Additions to Yahoo!
        Hack  48.   Scattersearch with Yahoo! and Google
        Hack  49.   Yahoo! Directory Mindshare in Google
        Hack  50.   Weblog-Free Google Results
        Hack  51.   Spidering, Google, and Multiple Domains
        Hack  52.   Scraping Product Reviews
        Hack  53.   Receive an Email Alert for Newly Added Reviews
        Hack  54.   Scraping Customer Advice
        Hack  55.   Publishing Associates Statistics
        Hack  56.   Sorting Recommendations by Rating
        Hack  57.   Related Products with Alexa
        Hack  58.   Scraping Alexa's Competitive Data with Java
        Hack  59.   Finding Album Information with FreeDB and
        Hack  60.   Expanding Your Musical Tastes
        Hack  61.   Saving Daily Horoscopes to Your iPod
        Hack  62.   Graphing Data with RRDTOOL
        Hack  63.   Stocking Up on Financial Quotes
        Hack  64.   Super Author Searching
        Hack  65.   Mapping O'Reilly Best Sellers to Library Popularity
        Hack  66.   Using All Consuming to Get Book Lists
        Hack  67.   Tracking Packages with FedEx
        Hack  68.   Checking Blogs for New Comments
        Hack  69.   Aggregating RSS and Posting Changes
        Hack  70.   Using the Link Cosmos of Technorati
        Hack  71.   Finding Related RSS Feeds
        Hack  72.   Automatically Finding Blogs of Interest
        Hack  73.   Scraping TV Listings
        Hack  74.   What's Your Visitor's Weather Like?
        Hack  75.   Trendspotting with Geotargeting
        Hack  76.   Getting the Best Travel Route by Train
        Hack  77.   Geographic Distance and Back Again
        Hack  78.   Super Word Lookup
        Hack  79.   Word Associations with Lexical Freenet
        Hack  80.   Reformatting Bugtraq Reports
        Hack  81.   Keeping Tabs on the Web via Email
        Hack  82.   Publish IE's Favorites to Your Web Site
        Hack  83.   Spidering Game Prices
        Hack  84.   Bargain Hunting with PHP
        Hack  85.   Aggregating Multiple Search Engine Results
        Hack  86.   Robot Karaoke
        Hack  87.   Searching the Better Business Bureau
        Hack  88.   Searching for Health Inspections
        Hack  89.   Filtering for the Naughties
        Chapter 5.   Maintaining Your Collections
        Hacks #90-93
        Hack  90.   Using cron to Automate Tasks
        Hack  91.   Scheduling Tasks Without cron
        Hack  92.   Mirroring Web Sites with wget and rsync
        Hack  93.   Accumulating Search Results Over Time
        Chapter 6.   Giving Back to the World
        Hacks #94-100
        Hack  94.   Using XML::RSS to Repurpose Data
        Hack  95.   Placing RSS Headlines on Your Site
        Hack  96.   Making Your Resources Scrapable with Regular Expressions
        Hack  97.   Making Your Resources Scrapable with a REST Interface
        Hack  98.   Making Your Resources Scrapable with XML-RPC
        Hack  99.   Creating an IM Interface
        Hack  100.   Going Beyond the Book