Recipe 20.19 Extracting Table Data

20.19.1 Problem

You have data in an HTML table, and you would like to turn that into a Perl data structure. For example, you want to monitor changes to an author's CPAN module list.

20.19.2 Solution

Use the HTML::TableContentParser module from CPAN:

use HTML::TableContentParser; $tcp = HTML::TableContentParser->new; $tables = $tcp->parse($HTML); foreach $table (@$tables) {   @headers = map { $_->{data} } @{ $table->{headers} };   # attributes of table tag available as keys in hash   $table_width = $table->{width};   foreach $row (@{ $tables->{rows} }) {     # attributes of tr tag available as keys in hash     foreach $col (@{ $row->{cols} }) {       # attributes of td tag available as keys in hash       $data = $col->{data};     }   } }

20.19.3 Discussion

The HTML::TableContentParser module converts all tables in the HTML document into a Perl data structure. As with HTML tables, there are three layers of nesting in the data structure: the table, the row, and the data in that row.

Each table, row, and data tag is represented as a hash reference. The hash keys correspond to attributes of the tag that defined that table, row, or cell. In addition, the value for a special key gives the contents of the table, row, or cell. In a table, the value for the rows key is a reference to an array of rows. In a row, the cols key points to an array of cells. In a cell, the data key holds the HTML contents of the data tag.

For example, take the following table:

<table width="100%" bgcolor="#ffffff">   <tr>     <td>Larry &amp; Gloria</td>     <td>Mountain View</td>     <td>California</td>   </tr>   <tr>     <td><b>Tom</b></td>     <td>Boulder</td>     <td>Colorado</td>   </tr>   <tr>     <td>Nathan &amp; Jenine</td>     <td>Fort Collins</td>     <td>Colorado</td>   </tr> </table>

The parse method returns this data structure:

[   {     'width' => '100%',     'bgcolor' => '#ffffff',     'rows' => [                {                 'cells' => [                             { 'data' => 'Larry &amp; Gloria' },                             { 'data' => 'Mountain View' },                             { 'data' => 'California' },                            ],                 'data' => "\n      "                },                {                 'cells' => [                             { 'data' => '<b>Tom</b>' },                             { 'data' => 'Boulder' },                             { 'data' => 'Colorado' },                            ],                 'data' => "\n      "                },                {                 'cells' => [                             { 'data' => 'Nathan &amp; Jenine' },                             { 'data' => 'Fort Collins' },                             { 'data' => 'Colorado' },                            ],                 'data' => "\n      "                }               ]   } ]

The data tags still contain tags and entities. If you don't want the tags and entities, remove them by hand using techniques from Recipe 20.6.

Example 20-11 fetches a particular CPAN author's page and displays in plain text the modules they own. You could use this as part of a system that notifies you when your favorite CPAN authors do something new.

Example 20-11. Dump modules for a particular CPAN author
  #!/usr/bin/perl -w   # dump-cpan-modules-for-author - display modules a CPAN author owns   use LWP::Simple;   use URI;   use HTML::TableContentParser;   use HTML::Entities;   use strict;   our $URL = shift || 'http://search.cpan.org/author/TOMC/';   my $tables = get_tables($URL);   my $modules = $tables->[4];    # 5th table holds module data   foreach my $r (@{ $modules->{rows} }) {     my ($module_name, $module_link, $status, $description) =          parse_module_row($r, $URL);     print "$module_name <$module_link>\n\t$status\n\t$description\n\n";   }    sub get_tables {     my $URL = shift;     my $page = get($URL);     my $tcp = new HTML::TableContentParser;     return $tcp->parse($page);   }   sub parse_module_row {     my ($row, $URL) = @_;     my ($module_html, $module_link, $module_name, $status, $description);     # extract cells     $module_html = $row->{cells}[0]{data};  # link and name in HTML     $status      = $row->{cells}[1]{data};  # status string and link     $description = $row->{cells}[2]{data};  # description only     $status =~ s{<.*?>}{  }g; # naive link removal, works on this simple HTML     # separate module link and name from html     ($module_link, $module_name) = $module_html =~ m{href="(.*?)".*?>(.*)<}i;     $module_link = URI->new_abs($module_link, $URL); # resolve relative links     # clean up entities and tags     decode_entities($module_name);     decode_entities($description);     return ($module_name, $module_link, $status, $description);   }

20.19.4 See Also

The documentation for the CPAN module HTML::TableContentParser; http://search.cpan.org



Perl Cookbook
Perl Cookbook, Second Edition
ISBN: 0596003137
EAN: 2147483647
Year: 2003
Pages: 501

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net