Hack 89 Filtering for the Naughties

figs/moderate.gif figs/hack89.gif

Use search engines to construct your own parental control ratings for sites .

As we've attempted to show several times in this book, your scripts don't have to start and end with simple Perl spidering. You can also incorporate various web APIs (such as Technorati [Hack #66]). In this hack, we're going to add some Google API magic to see if a list of domains pulled off a page contain prurient (i.e., naughty ) contentas determined by Google's SafeSearch filtering mechanism.

As the hack is implemented, a list of domains is pulled off Fark (http://www.fark.com), a site known for its odd selection of daily links. Each domain has 50 of its URLs (generated by a Google search) put into an array, and each array item is checked to see if it appears in a Google search with SafeSearch enabled. If it does, it's considered to be a good URL . If it doesn't, it's put under suspicion of being a not-so-good URL . The idea is to get a sense of how much of an entire domain is being filtered, instead of just one URL.

Filtering mechanisms are not perfect. Sometimes they filter things that aren't bad at all, while sometimes they miss objectionable content. While the goal of this script is to give you a good and general idea of a domain's content on the naughtiness scale, it won't be perfect.


The Code

Save the following code as purity.pl :

 #!/usr/bin/perl -w use strict; use LWP::Simple; use SOAP::Lite; # fill in your google.com API information here. my $google_key  = "   your Google API key here   "; my $google_wdsl = "GoogleSearch.wsdl"; my $gsrch       = SOAP::Lite->service("file:$google_wdsl"); # get our data from Fark's "friends". my $fark = get("http://www.fark.com/") or die $!; $fark =~ m!Friends:</td></tr>(.*?)<tr><td class=\"lmhead\">Fun Games:!migs;  my $farklinks = ; # all our relevances are in here. # and now loop through each entry. while ($farklinks =~ m!href="(.*?)"!gism) {    my $farkurl = ; next unless $farkurl;    my @checklist; # urls to check for safety.    print "\n\nChecking $farkurl.\n";    # getting the full result count for this URL.    my $count = $gsrch->doGoogleSearch($google_key, $farkurl,                         0, 1, "false", "",  "false", "", "", "");    my $firstresult = $count->{estimatedTotalResultsCount};    print "$firstresult matching results were found.\n";    if ($firstresult > 50) { $firstresult = 50; }    # now, get a maximum of 50 results, with no safe search.    # getting the full result count for this URL.    my $counter = 0; while ($counter < $firstresult) {        my $urls = $gsrch->doGoogleSearch($google_key, $farkurl,                            $counter, 10, "false", "",  "false", "", "", "");        foreach my $hit (@{$urls->{resultElements}}) {            push (@checklist, $hit->{URL});         } $counter = $counter +10;     }    # and now check each of the matching URLs.    my (@goodurls, @badurls); # storage.    foreach my $urltocheck (@checklist) {        $urltocheck =~ s/http:\/\///;        my $firstcheck = $gsrch->doGoogleSearch($google_key, $urltocheck,                                  0, 1, "true", "",  "true", "", "", "");        # check our results. if no matches, it's naughty.        my $firstnumber = $firstcheck->{estimatedTotalResultsCount}  0;        if ($firstnumber == 0) { push @badurls, $urltocheck; }        else { push @goodurls, $urltocheck; }    }    # and spit out some results.    my ($goodcount, $badcount) = (scalar(@goodurls), scalar(@badurls));    print "There are $goodcount good URLs and $badcount ".          "possibly impure URLs.\n"; # wheeEeeeEE!    # display bad domains if there are only a few.    unless ( $badcount >= 10  $badcount == 0) {        print "The bad URLs are\n";        foreach (@badurls) {           print " http://$_\n";         }     }    # happy percentage display.    my $percent = $goodcount * 2; my $total = $goodcount+$badcount;    if ($total==50) { print "This URL is $percent% pure!"; } } 

Running the Hack

The hack requires no variables . Simply run it from the command line as you would any Perl script, and it'll return a list of domains and each domain's purity percentage (as determined by Google's SafeSearch):

 %  perl purity.pl  Checking http://www.aprilwinchell.com/. 161 matching results were found. There are 36 good URLs and 14 possibly impure URLs. This URL is 72% pure! Checking http://www.badjocks.com/. 47 matching results were found. There are 36 good URLs and 9 possibly impure URLs. The bad URLs are  http://www.thepunchline.com/cgi-bin/links/bad_link.cgi?ID=4052&d=1  http://www.ilovebacon.com/020502/i.shtml  http://www.ilovebacon.com/022803/l.shtml ... 

Hacking the Hack

You might find something else you want to scrape, such as the links on your site's front page. Are you linking to something naughty by mistake? How about performing due diligence on a site you're thinking about linking to; will you inadvertently be leading readers to sites of a questionable nature via a seemingly innocent intermediary? Perhaps you'd like to check entries from a specific portion of the Yahoo! or DMOZ directories [Hack #47]? Anything that generates a list of links is fair game for this script.

As it stands, the script checks a maximum of 50 URLs per domain. While this makes for a pretty thorough check, it also makes for a long wait, especially if you have a fair amount of domains to check. You may decide that checking 10 domains is a far better thing to do. In that case, just change this line:

 if ($firstresult >  10  ) { $firstresult =  10  ; } 

When Tara originally wrote the code, she was a little concerned that it might be used to parse naughty sites and generate lists of naughty URLs for porn peddling. So, she chose not to display the list of naughty URLs generated, unless they were a significantly minor proportion of the final results (currently, the threshold is set to no more than 10 of the 50 URLs). You might want to change that, especially if you're using this script to check links from your own site and you want to get an idea of the kind of content you might be linking to. In this case you'll need to change just one line:

 unless ( $badcount >=  50  $badcount == 0) { 

By increasing the count to 50, you'll be informed of all the bad sites associated with the current domain. Just be forewarned: certain domains may return nothing but the naughties, and even the individual words that make up the returned URLs can be downright disturbing .



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net