Hack 73 Scraping TV Listings

figs/moderate.gif figs/hack73.gif

Freeing yourself from flipping through a weekly publication by visiting the TV Guide Online web site might sound like a good idea, but being forced to load heavy pages, showing only hours at a time and channels you don't care for, isn't exactly the utopia for which you were hoping .

To grab the latest TV listings from TV Guide Online (http://www.tvguide.com), we could write an HTML scraper from scratch using HTML::TableExtract [Hack #67] and similar modules, or we could go Borg on a script called tvlisting and assimilate it into our collective consciousness. Why reinvent the wheel if you don't have to, right? The author of tvlisting , Kurt V. Hindenburg, has extensively reverse-engineered TV Guide Online's dynamic site and created a script that can pull down all the TV listings for a whole day and output it in several different formats, including XML.

Grab tvlisting from http://www.cherrynebula.net/projects/tvlisting/tvlisting.html and follow the terse documentation to get it running on your platform. There are tons of options you can use when running tvlisting , most of which we won't cover for sake of brevity. So, snoop around in the tvlisting code, as well as the included sample_rc file, and check out the various options available. For our purposes, we'll modify the sample_rc file and use command-line arguments when we call the script. Open the sample_rc file and save it as tvlisting_config ; then we'll get started. Let's look at a small portion of our new tvlisting_config file:

 ## To use this script as a ## CGI; please read CGI.txt ## Choices : $TRUE, $FALSE $options{USE_CGI} = $FALSE; ## Choices : WGET, LYNX, CURL, LWPUSERAGENT $options{GET_METHOD} = qw(LWPUSERAGENT); ## Choices : HTML, TEXT, LATEX, XAWTV, XML $options{OUTPUT_FORMAT} = qw(XML); ## Choices : TVGUIDE $options{INPUT_SOURCE} = qw(TVGUIDE); ### Attributes dealing with channels. ## Should channels be run through the filter? ## Choices : $TRUE, $FALSE $options{FILTER_CHANNELS} = $TRUE; ## Filter by NAME and/or NUMBER? $options{FILTER_CHANNELS_BY_NAME} = $FALSE; $options{FILTER_CHANNELS_BY_NUMBER} = $TRUE; ## List of channels to OUTPUT $options{FILTER_CHANNELS_BY_NAME_LIST} =     ["WTTV", "WISH", "WTHR", "WFYI", "WXIN", "WRTV", "WNDY", "WIPX"]; $options{FILTER_CHANNELS_BY_NUMBER_LIST} =     [qw( 2 3 4 5 6 7 9 11 12 14 15 16 18 28 29 30 31 32         33 34 35 36 37 38 39 49 50 53 55 71 73 74 75 78)]; ## Your personal Service ID, used by ## tvguide.com to localize your listings. $options{SERVICE_ID} = 359508; 

As you can see, there are many options available (the preceding listing is about half of what you'd see in a normal configuration file). Starting from the top, I set USE_CGI to $FALSE , GET_METHOD to LWPUSERAGENT , and OUTPUT_FORMAT to XML . You may have noticed that you can output to HTML as well, but I'm not crazy about the quality of its HTML output. The FILTER_ options allow us to choose only the channels we are interested in, rather than having to weed through hundreds of useless entries to find what we're looking for. The most important option, SERVICE_ID , is what TV Guide Online uses to specify the stations and channel numbers that are available in your area. Without this option set correctly, you'll receive channels that do not map to the channels on your TV, and that's no fun. The Readme.txt file has some further information on how to hunt this ID down.

After configuration, it's simply a matter of running the script to get an output of the current hour 's listings for just the channels you're interested in. If you specified TEXT output, you'll see something like this (severely truncated for readability):

 %  bin/tvlisting  6:30 PM             7:00 PM             7:30 PM               +---------+---------+---------+---------+---------+ 76 WE      Felicity             Hollywood Wives 77 OXYGN   Can You Tell?        Beautiful 

An XML output format grants the following snippet, which is readily parseable:

 %  bin/tvlisting  <Channel Name="TOON" Number="53">   <Shows Title="Dexter's Laboratory" Sequence="1" Duration="6" />   <Shows Title="Ed, Edd n Eddy" Sequence="2" Duration="6" />   <Shows Title="Courage the Cowardly Dog" Sequence="3" Duration="6" />   <Shows Title="Pokemon" Sequence="4" Duration="6" /> </Channel> 

Even though you can filter by channels within tvlisting , there doesn't seem to be a way to filter by type of program, such as all " horror " movies or anything with Mister Miyagi. For that, we'd have to build our own quick scraper.

The Code

Save the following code as tvsearch.pl :

 #!/usr/bin/perl -w use strict; use Getopt::Long; use LWP::Simple; use HTML::TableExtract; my %opts; # our list of tvguide.com categories. my @search_categories = ( qw/ action+%26+adventure adult Movie                               comedy drama horror mystery+%26+suspense                               sci-fi+%26+paranormal western Sports                               Newscasts+%26+newsmagazines health+%26+fitness                               science+%26+technology education Children%27s                               talk+%26+discussion soap+opera                               shopping+%26+classifieds music / ); # instructions for if the user doesn't # pass a search term or category. bah. sub show_usage {  print "You need to pass either a search term (--search)\n";  print "or use one of the category numbers below (--category):\n\n";  my $i=1; foreach my $cat (@search_categories) {     $cat =~ s/\+/ /g; $cat =~ s/%26/&/; $cat =~ s/%27/'/;     print "  $i) ", ucfirst($cat), "\n"; $i++;  } exit; } # define our command-line flags (long and short versions). GetOptions(\%opts, 'searchs=s',      # a search term.                    'categoryc=s',    # a search category. ); unless ($opts{search}  $opts{category}) { show_usage; } # create some variables for use at tvguide.com. my ($day, $month) = (localtime)[3..4]; $month++; my $start_time = "8:00";         # this time is in military format my $time_span  = 20;             # number of hours of TV listings you want my $start_date = "$month\/$day"; # set the current month and day my $service_id = 61058;          # our service id (see tvlisting readme) my $search_phrase = undef;       # final holder of what was searched for my $html_file = undef;           # the downloaded data from tvguide.com my $url = 'http://www.tvguide.com/listings/search/SearchResults.asp'; # search by category. if ($opts{category}) {    my $id = $opts{category}; # convenience.    die "Search category must be a number!" unless $id =~ /\d+/;    die "Category ID was invalid" unless ($id >= 1 && $id <= 19);    $html_file = get("$url?l=$service_id&FormCategories=".                     "$search_categories[$id-1]");    die "get(  ) did not return as we expected.\n" unless $html_file;    $search_phrase = $search_categories[$id-1]; } elsif ($opts{search}) {     my $term = $opts{search}; # convenience.    $html_file = get("$url?I=$service_id&FormText=$term");    die "get(  ) did not return as we expected.\n" unless $html_file;    $search_phrase = $term; } # now begin printing out our matches. print "Search Results for '$search_phrase':\n\n"; # create a new table extract object and pass it the # headers of the tvguide.com table in our data.  my $table_extract =    HTML::TableExtract->new(         headers => ["Date","Start Time", "Title", "Ch#"],             keep_html => 1 ); $table_extract->parse($html_file); # now, with our extracted table, parse. foreach my $table ($table_extract->table_states) {     foreach my $cols ($table->rows) {         # this is not the best way to do this...         if(@$cols[0] =~ /Sorry your search found no matches/i)           { print "No matches to found for your search!\n"; exit; }         # get the date.         my $date = @$cols[0];         $date =~ s/<.*>//g;       $date =~ s/\s*//g;         $date =~ /(\w*)\D(\d*)/g; $date = "/";         # get the time.         my $time = @$cols[1];         $time =~ m/(\d*:\d*\s+\w+)/;         $time = ;         # get the title, detail_url, detail_number, and station.         @$cols[2] =~ /href="(.*\('\d*','(\d*)','\d*','\d*','(.*)',.*)"/i;         my ($detail_url, $detail_num, $channel) = (, , );         my $title = @$cols[2]; $title =~ s/<.*>//g;         $title =~ /(\b(.*)\b)/g; $title = ;         # get channel number         my $channel_num = @$cols[3];         $channel_num =~ m/>\s*(\d*)\s*</;         $channel_num = ;         # turn the evil Javascript URL into a normal one.         $detail_url =~ /javascript:cu\('(\d+)','(\d+)'/;         my $iSvcId = ; my $iTitleId = ;         $detail_url = "http://www.tvguide.com/listings/".                       "closerlook.asp?I=$iSvcId&Q=$iTitleId";         # now, print the results.         print " $date at $time on chan$channel_num ($channel): $title\n";         print "    $detail_url\n\n";     } } 

Running the Hack

A search for Farscape looks something like this:

 %  perl tvsearch.pl --search  farscape  Search Results for 'farscape':  Mon/28 at 12:00 AM on chan62 (SCI-FI): Farscape: What Was Lost: Sacrifice     http://www.tvguide.com/listings/closerlook.asp?I=61058&Q=3508575  Mon/4 at 12:00 AM on chan62 (SCI-FI): Farscape: What Was Lost: Resurrection     http://www.tvguide.com/listings/closerlook.asp?I=61058&Q=3508576 

William Eastler



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net