Hack 72 Automatically Finding Blogs of Interest

Hack 72 Automatically Finding Blogs of Interest

figs/moderate.gif figs/hack72.gif

An easy way to find interesting new sites is to peruse an existing site's blogroll: a listing of blogs they read regularly. Let's create a spider to automate this by looking for keywords in the content of outbound links .

I enjoy reading blogs, but with the demands of the day, I find it difficult to read the dozen or so I like most, let alone discover new ones. I often have good luck when clicking through the blogrolls of writers I enjoy.

I decided to set out and automate this process, by creating a script that starts at one of my favorite sites and then visits each outbound link that site has to offer. As the script downloads each new page, it'll look through the content for keywords I've defined, in hopes of finding a new daily read that matches my own interests.

The Code

Save the following script as blogfinder.pl :

 #!/usr/bin/perl  -w use strict; $++; use LWP::UserAgent; use HTML::LinkExtor; use URI::URL; # where should results go? my $result_file  = "./result.html"; my $keywords_reg = qr/   pipe-delimited search terms   /; my $starter_url  = "   your favorite blog here   "; # open and create the result.html file. open(RESULT, ">$result_file") or die "Couldn't create: $!\n"; print RESULT "<html><head><title>Spider Findings</title></head><body>\n"; # our workhorse for access. my $ua = LWP::UserAgent->new; print "\nnow spidering: $starter_url\n"; # begin our link searching. LinkExtor takes a  # subroutine argument to handle found links, # and then the actual data of the page.  HTML::LinkExtor->new(   sub {         my ($tag, %attr) = @_;         return if $tag ne 'a';         # make any href relative link into         # an absolute value, and add to an         # internal list of links to check out.         my @links = map { url($_, $starter_url)->abs(  ) }                       grep { defined } @attr{qw/href/};         # make 'em all pretty...         foreach my $link (@links) {            print " + $link\n"; # hello!            my $data = $ua->get($link)->content;            if ($data =~ m/$keywords_reg/i) {               open(RESULT, ">>$result_file");               print RESULT "<a href=\"$link\">$link</a><br>\n";               close(RESULT); # one match printed, yes!            }         } # and now, the actual content that # HTML::LinkExtor goes through... })->parse(   do {      my $r = $ua->get($starter_url);      $r->content_type eq "text/html" ? $r->content : "";   } ); print RESULT "</body></html>"; close RESULT; exit; 

Once the LWP::UserAgent [Hack #10] object is created, we drop into the main workhorse loop of the spider subroutine. Here is where the script decides which link to spider. Obviously, the seed link is first, but as the spider traverses the first web page, it is on the lookout for links to extract. This is handled by the HTML::LinkExtor object (http://search.cpan.org/author/GAAS/HTML-Parser-3/lib/HTML/LinkExtor.pm). Each link, in turn , is passed to an HTML::LinkExtor callback, which downloads each page, looks for the magic keywords, and makes note of any matches in a newly created results.html file.

When the spider has finished its run, you will be left with an HTML file that contains links that match your search criteria. There is, of course, room for refinement. However, one thing I enjoy about this script is the subtle entropy that seems to arise in it. Through this unintended randomness, I am able to discover blogs I would never have discovered by other means. More often than not, such a discovery is one I would rather not have made. But every now and then, a real gem can be seen gleaming at the bottom of the trash heap that is so often our beloved Internet.

Running the Hack

The first thing you should do is replace the two lines at the top of the script with your favorite blog URL and a pipe-delimited ( ) list of values, like so:

 my $keywords_reg = qr/  foafperlos x  /; my $starter_url  = "  http://myfavoriteblog.com  "; 

The pipe is the equivalent of OR , so these lines mean "Spider myfavoriteblog.com and search for foaf OR perl OR os x ." If you know regular expressions, you can modify this even further to check for word boundaries (so that perl would not match amityperl , for instance). Once these two lines are configured, run the script, like so:

 %  perl blogfinder.pl  now spidering: http://www.myfavoriteblog.com  + http://myfavoriteblog.com  + http://www.luserinterface.net/index.cgi/colophon/  + mailto:saf@luserinterface.net  + http://jabber.org/  + http://sourceforge.net/projects/gaim/  + http://scottfallin.com/hacks/popBlosx.text 

Once the script is finished spidering the outbound links, you'll have a new file in the current directory, with a list of URLs that match your keyword criteria.

Hacking the Hack

There are a few ways to modify the hack, the most interesting of which is to add another level of link crawling to begin creating "blog neighborhoods" similar to the idea of "Six Degrees of Kevin Bacon" (http://www.wired.com/news/culture/0,1284,49343,00.html; see also an implementation by Mark Pilgrim based on Google search results: http://diveintomark.org/archives/2002/06/04/who_are_the_people_in_your_neighborhood). One of the easiest additions, however, involves stopping the spider from indexing more data than necessary.

As you can see from the sample output, the spider will look at any URI that has been put into an HTML A tag, which could involve email addresses, IRC and FTP servers, and so forth. Since the spider isn't equipped to handle those protocols, telling it to skip over them is a simple modification:

 foreach my $link (@links) {  next unless $link =~ /^http/i;  print " + $link\n"; # hello!     my $data = $ua->get($link)->content; 

Other possibilities could restrict spidering to third-party sites only (since you're not interested in spidering your favorite site, but rather the sites it links to) or add an upper limit to the number of sites spidered (i.e., "spider as much as you can, to a maximum of 200 sites").

Scott Fallin



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net