Hack 36 Downloading Images from Webshots

figs/moderate.gif figs/hack36.gif

Search a large collection of community- contributed images, based on keywords of your choice, and then download the visual finding .

Webshots (http://www.webshots.com/) bills itself as "your world of photos." The Community section (http://www.webshots.com/r/home/community) has thousands of user-contributed photos available for download, and the Gallery (http://www.webshots.com/r/home/gallery) has even more from professional photographers. To access the Gallery, you must be a registered user , but the Community section is prime for automated downloading.

The Code

You'll need the WWW::Mechanize Perl module (see [Hack #21] and [Hack #22]) installed to run this script.

Save the following code to a file called webshots.pl :

 #!/usr/bin/perl -w use strict; use WWW::Mechanize; use Getopt::Long; my $max = 10; GetOptions(     "max=i" => $max, ); my $search = shift or die "Must specify a search term"; my $w = WWW::Mechanize->new; $w->get( "http://www.webshots.com/explore/" ); $w->success or die "Can't read the search page!\n"; $w->submit_form(     form_number => 1,     fields => { words => $search }, ); $w->success or die "Search failed!\n"; # execution of script stops if warning # about adult content is returned. if ( $w->content =~ /Adult content/i ) {     die "Search term probably returns adult content\n"; } my $ndownloads = 0; NEXT_PAGE_LOOP: while(1) {     $w->content =~ /Page (\d+) of (\d+)/        or warn "Can't find page count\n";     warn "On page  of ...\n";     # Pull the "Next" link off before we download pictures     my $nextlink = $w->find_link( text => "Next >" );     my $currpage = $w->uri;     my @links = $w->find_all_links( url_regex =>         qr[http://community.webshots.com/photo/] );     for my $link ( @links ) {        my $url = $link->url;        my $text = $link->text;        next if $text eq "[IMG]";        $w->get( $url );        $w->success or die "Couldn't fetch $url";        if ($w->content=~m[(http://community\.webshots\.com/.+?\.  [RETURN]  (jpggifpng))]) {            my $imgurl = ; my $type = ;            # Make a name based on the webshots title for the pic            my $filename = lc $text;        # Lowercase everything            $filename =~ s/\s+/-/g;         # Spaces become dashes            $filename =~ s/[^0-9a-z-]+//g;  # Strip all nonalphanumeric            $filename =~ s/(^--$)//;       # Strip leading/trailing dashes            $filename = "$filename.$type";            # Bring down the image if we don't already have it            if ( -e $filename ) { warn "Already have $filename\n"; }            else {                # use LWP's :content_file to save our                # image directly to the filesystem,                # instead of processing it ourselves.                warn "Saving $filename...\n";                $w->get( $imgurl, ":content_file"=>$filename );                ++$ndownloads; last if $ndownloads >= $max;            }         } else { warn "Couldn't find an image on $url\n"; }     }     last unless $nextlink && ($ndownloads<$max);     my $nexturl = URI->new_abs( $nextlink->url, $currpage )->as_string;     $w->get( $nexturl ); die "$nexturl failed!\n" unless $w->success; } 

Running the Hack

Invoke the script on the command line, passing it a search string, like this:

 %  perl webshots.pl cars  Saving bike02.jpg Saving escanaba-street-car-bridge.jpg Saving resort-community-of-playa-car-mexico.jpg Saving 1969-chevy-camaro.jpg Saving 1929-ford-roadster.jpg ... 

If your search string is more than one word, wrap it in double quotes:

 %  perl webshots.pl "chevy camaro"  Already have 1969-chevy-camaro.jpg Saving 1969-chevy-camaro-z28.jpg 

Note that the webshots script doesn't pull down identically named photos, which saves you time if you're searching for photos with different keywords. By default, the script downloads 10 unique photos and then stops. To download a different number of photos, use the --max switch:

 %  perl webshots.pl --max=40 "german shepherd"  

Hacking the Hack

There are a few ways you can improve upon this hack.

Starting on a given page

The script will cheerfully skip over photos that it's already seen and move on to more pages. If you're downloading from dozens of pages, this might take some time. The script could, instead, modify the page=x of the URLs to start on a different page number.

Downloading from other areas

The script, as shown, only does keyword searching, but there are two other sections that may be of interest: Most Popular and New Photos. Downloading photos from these sections will be different than downloading from search results, since they're based on albums of photos, not individual photos. You'll need to make the script link to an album, download the photos from there, and then back out to the list of albums. WWW::Mechanize 's back( ) method will help here.

Modifying filenames

Sometimes, different photos will have the same name; the script will therefore see them as the same and not download the "duplicate." Maybe a search on cats returns two different photos called stripes or, even more likely, untitled . If you find this to be the case, you can append an incrementing number to the filename if the file already exists:

  my $dupe_count = 1;  my $max = 10; 

And then:

 # Bring down the image if we don't already have it if ( -e $filename ) {     warn "Already have $filename\n";  $filename = $filename . $dupe_count;   $dupe_count++;  else {     warn "Saving $filename\n";     $w->get( $imgurl, ":content_file"=>$filename );     ++$ndownloads; last if $ndownloads >= $max; } 

Note that doing this means you may have duplicate photos with different names .

Bypassing the adult content warning

Currently, the script stops if the search returns a page warning about adult content. You can modify the hack to bypass the warning page automatically. Instead of letting the search end, "click" on the "continue searching" link:

 if ( $w->content =~ /adult content/ i ) {     $w->follow_link( text_regex => qr/Continue with webshots search/i );     $w->success or die "Couldn't follow warning link"; } 

Andy Lester



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net