Hack 3 Anatomy of an HTML Page

figs/beginner.gif figs/hack03.gif

Getting the knack of scraping is more than just code; it takes knowing HTML and other kinds of web page files .

If you're new to spidering, figuring out what to scrape and why is not easy. Relative scraping newbies might try to take too much information, too little, or information that's too likely to change from what they want. If you know how HTML files are structured, however, you'll find it easier to scrape them and zero in on the information you need.

HTML files are just text files with special formatting. And that's just the kind of file you'll spend most of your time scraping, both in this book and in your own spidering adventures . While we'll also be spidering and grabbing multimedia filesimages, movies, and audio fileswe won't be scraping and parsing them to glean embedded information.

Anatomy of an HTML Page

That's not to say, however, that there aren't about as many ways to format an HTML page as there are pages on the Web. To understand how your spider might be able to find patterns of information on an HTML page, you'll need to start with the basicsthe very basicsof how an HTML web page looks, and then get into how the information within the body can be organized.

The core of an HTML page looks like this:

 <html> <head>   <title>     Title of the page   </title> </head> <body>   Body of the page </body> </html> 

That's it. 99% of the HTML pages on the Web start out like this. They can get a lot more elaborate but, in the end, this is the core. What does this mean to your spider? It means that there's only one piece of information that's clearly marked by tags, and that's the page title. If all you need is the title, you're in gravy.

But if you need information from the body of a pagesay, a headline or a dateyou have some detective work ahead of you. Many times, the body of a page has several tables, JavaScript, and other code that obscures what you're truly looking forall annoyances that have much more to do with formatting information than truly organizing it. But, at the same time, the HTML language contains several standards for organizing data. Some of these standards make the information larger on the page, representing a heading. Some of the standards organize information into lists within the body. If you understand how the standards work, you'll find it easier to pluck the information you want from the heavily coded confines of a web page body.

Header Information with the H Tags

Important information on a page (headlines, subheads, notices, and so forth) are usually noted with an <H x > tag, where x is a number from 1 to 6 . An <H1> tag is normally displayed as the largest, as it is highest in the headline hierarchy.

Depending on how the site is using them, you can sometimes get a good summary of information from a site just by scraping the H tags. For example, if you're scraping a news site and you know they always put headlines in <H2> tags and subheads in <H4> tags, you can scrape for that specific markup and get brief story extractions, without having to figure out the rest of the story's coding. In fact, if you know a site always does that, you can scrape the entire site just for those tags, without having to look at the rest of the site's page structure at all.

List Information with Special HTML Tags

Not all web wranglers use specific HTML tags to organize lists; some of them just start new numbered paragraphs. But, for the more meticulous page-builder, there are specific tags for lists.

Ordered lists (lists of information that are automatically numbered) are bounded with <ol> and </ol> tags, and each item within is bounded by <li> and </li> tags. If you're using regular expressions to scrape for information, you can grab everything between <ol> and </ol> , parse each <li></li> element into an array, and go from there. Here's an ordered list:

 <ol>  <li>eggs</li>  <li>milk</li>  <li>butter</li>  <li>sugar</li> </ol> 

Unordered lists are just like ordered lists, except that they appear in the browser with bullets instead of numbers , and the list is bounded with <ul></ul> instead of <ol></ol> .

Non-HTML Files

Some non-HTML files are just as nebulous as HTML files, while some are far better defined. Plain .txt files, for example (and there are plenty of them available on the Web), have no formatting at allnot even as basic as "this is the title and this is the body." On the other hand, text files are sometimes easier to parse, because they have no HTML code soup to wade through.

At the other extreme are XML (Extensible Markup Language) files. XML's parts are defined more rigidly than HTML. RSS, a syndication format and a simple form of XML, has clearly defined parts in its files for titles, content, links, and additional information. We often work with RSS files in this book; the precisely defined parts are easy to parse and write using Perl. See "Using XML::RSS to Repurpose Everything" [Hack #94].

The first thing you'll need to do when you decide you want to scrape something is determine what kind of file it is. If it's a plain .txt file, you won't be able to pinpoint your scraping. If it's an XML file, you'll be able to zoom in on what you want with regular expressions, or use any number of Perl XML modules (such as XML::Simple , XML::RSS , or XML::LibXML ).

XHTML: An XML/HTML Hybrid

From our previous examples, you can see that, while there's plenty of formatting code in HTML, organization code is less apparent and far less common in the average web site. But a standard called XHTML (Extensible Hypertext Markup Language) is on the horizon. The idea is that XHTML will eventually replace HTML. Its coding is stricter than HTML, and the code it produces is cleaner.




Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net