Spidering Hacks |
By Tara Calishain, Kevin Hemenway |
|
Publisher | : O'Reilly |
Pub Date | : October 2003 |
ISBN | : 0-596-00577-6 |
Pages | : 424 |
| | Copyright |
| | Credits |
| | | About the Authors |
| | | Contributors |
|
| | Preface |
| | | Why Spidering Hacks? |
| | | How This Book Is Organized |
| | | How to Use This Book |
| | | Conventions Used in This Book |
| | | How to Contact Us |
| | | Got a Hack? |
|
| | Chapter 1. Walking Softly |
| | | Hacks #1-7 |
| | | Hack 1. A Crash Course in Spidering and Scraping |
| | | Hack 2. Best Practices for You and Your Spider |
| | | Hack 3. Anatomy of an HTML Page |
| | | Hack 4. Registering Your Spider |
| | | Hack 5. Preempting Discovery |
| | | Hack 6. Keeping Your Spider Out of Sticky Situations |
| | | Hack 7. Finding the Patterns of Identifiers |
|
| | Chapter 2. Assembling a Toolbox |
| | | Hacks #8-32 |
| | | Perl Modules |
| | | Resources You May Find Helpful |
| | | Hack 8. Installing Perl Modules |
| | | Hack 9. Simply Fetching with LWP::Simple |
| | | Hack 10. More Involved Requests with LWP::UserAgent |
| | | Hack 11. Adding HTTP Headers to Your Request |
| | | Hack 12. Posting Form Data with LWP |
| | | Hack 13. Authentication, Cookies, and Proxies |
| | | Hack 14. Handling Relative and Absolute URLs |
| | | Hack 15. Secured Access and Browser Attributes |
| | | Hack 16. Respecting Your Scrapee's Bandwidth |
| | | Hack 17. Respecting robots.txt |
| | | Hack 18. Adding Progress Bars to Your Scripts |
| | | Hack 19. Scraping with HTML::TreeBuilder |
| | | Hack 20. Parsing with HTML::TokeParser |
| | | Hack 21. WWW::Mechanize 101 |
| | | Hack 22. Scraping with WWW::Mechanize |
| | | Hack 23. In Praise of Regular Expressions |
| | | Hack 24. Painless RSS with Template::Extract |
| | | Hack 25. A Quick Introduction to XPath |
| | | Hack 26. Downloading with curl and wget |
| | | Hack 27. More Advanced wget Techniques |
| | | Hack 28. Using Pipes to Chain Commands |
| | | Hack 29. Running Multiple Utilities at Once |
| | | Hack 30. Utilizing the Web Scraping Proxy |
| | | Hack 31. Being Warned When Things Go Wrong |
| | | Hack 32. Being Adaptive to Site Redesigns |
|
| | Chapter 3. Collecting Media Files |
| | | Hacks #33-42 |
| | | Hack 33. Detective Case Study: Newgrounds |
| | | Hack 34. Detective Case Study: iFilm |
| | | Hack 35. Downloading Movies from the Library of Congress |
| | | Hack 36. Downloading Images from Webshots |
| | | Hack 37. Downloading Comics with dailystrips |
| | | Hack 38. Archiving Your Favorite Webcams |
| | | Hack 39. News Wallpaper for Your Site |
| | | Hack 40. Saving Only POP3 Email Attachments |
| | | Hack 41. Downloading MP3s from a Playlist |
| | | Hack 42. Downloading from Usenet with nget |
|
| | Chapter 4. Gleaning Data from Databases |
| | | Hacks #43-89 |
| | | Hack 43. Archiving Yahoo! Groups Messages with yahoo2mbox |
| | | Hack 44. Archiving Yahoo! Groups Messages with WWW::Yahoo::Groups |
| | | Hack 45. Gleaning Buzz from Yahoo! |
| | | Hack 46. Spidering the Yahoo! Catalog |
| | | Hack 47. Tracking Additions to Yahoo! |
| | | Hack 48. Scattersearch with Yahoo! and Google |
| | | Hack 49. Yahoo! Directory Mindshare in Google |
| | | Hack 50. Weblog-Free Google Results |
| | | Hack 51. Spidering, Google, and Multiple Domains |
| | | Hack 52. Scraping Amazon.com Product Reviews |
| | | Hack 53. Receive an Email Alert for Newly Added Amazon.com Reviews |
| | | Hack 54. Scraping Amazon.com Customer Advice |
| | | Hack 55. Publishing Amazon.com Associates Statistics |
| | | Hack 56. Sorting Amazon.com Recommendations by Rating |
| | | Hack 57. Related Amazon.com Products with Alexa |
| | | Hack 58. Scraping Alexa's Competitive Data with Java |
| | | Hack 59. Finding Album Information with FreeDB and Amazon.com |
| | | Hack 60. Expanding Your Musical Tastes |
| | | Hack 61. Saving Daily Horoscopes to Your iPod |
| | | Hack 62. Graphing Data with RRDTOOL |
| | | Hack 63. Stocking Up on Financial Quotes |
| | | Hack 64. Super Author Searching |
| | | Hack 65. Mapping O'Reilly Best Sellers to Library Popularity |
| | | Hack 66. Using All Consuming to Get Book Lists |
| | | Hack 67. Tracking Packages with FedEx |
| | | Hack 68. Checking Blogs for New Comments |
| | | Hack 69. Aggregating RSS and Posting Changes |
| | | Hack 70. Using the Link Cosmos of Technorati |
| | | Hack 71. Finding Related RSS Feeds |
| | | Hack 72. Automatically Finding Blogs of Interest |
| | | Hack 73. Scraping TV Listings |
| | | Hack 74. What's Your Visitor's Weather Like? |
| | | Hack 75. Trendspotting with Geotargeting |
| | | Hack 76. Getting the Best Travel Route by Train |
| | | Hack 77. Geographic Distance and Back Again |
| | | Hack 78. Super Word Lookup |
| | | Hack 79. Word Associations with Lexical Freenet |
| | | Hack 80. Reformatting Bugtraq Reports |
| | | Hack 81. Keeping Tabs on the Web via Email |
| | | Hack 82. Publish IE's Favorites to Your Web Site |
| | | Hack 83. Spidering GameStop.com Game Prices |
| | | Hack 84. Bargain Hunting with PHP |
| | | Hack 85. Aggregating Multiple Search Engine Results |
| | | Hack 86. Robot Karaoke |
| | | Hack 87. Searching the Better Business Bureau |
| | | Hack 88. Searching for Health Inspections |
| | | Hack 89. Filtering for the Naughties |
|
| | Chapter 5. Maintaining Your Collections |
| | | Hacks #90-93 |
| | | Hack 90. Using cron to Automate Tasks |
| | | Hack 91. Scheduling Tasks Without cron |
| | | Hack 92. Mirroring Web Sites with wget and rsync |
| | | Hack 93. Accumulating Search Results Over Time |
|
| | Chapter 6. Giving Back to the World |
| | | Hacks #94-100 |
| | | Hack 94. Using XML::RSS to Repurpose Data |
| | | Hack 95. Placing RSS Headlines on Your Site |
| | | Hack 96. Making Your Resources Scrapable with Regular Expressions |
| | | Hack 97. Making Your Resources Scrapable with a REST Interface |
| | | Hack 98. Making Your Resources Scrapable with XML-RPC |
| | | Hack 99. Creating an IM Interface |
| | | Hack 100. Going Beyond the Book |
|
| | Colophon |
| | Index |