Parsing Invalid Markup

Problem

You need to extract data from a document thats supposed to be HTML or XML, but that contains some invalid markup.

Solution

For a quick solution, use Rubyful Soup, written by Leonard Richardson and found in the rubyful_soup gem. It can build a document model even out of invalid XML or HTML, and it offers an idiomatic Ruby interface for searching the document model. Its good for quick screen-scraping tasks or HTML cleanup.

	require 
ubygems
	require 
ubyful_soup

	invalid_html = A lot of tags are never closed.
	soup = BeautifulSoup.new(invalid_html)
	puts soup.prettify
	# A lot of
	# tags are
	# never closed.
	# 
	# 

	soup.b.i # => never closed.
	soup.i # => never closed.
	soup.find(nil, :attrs=>{class => 2}) # => never closed.
	soup.find_all(i) # => [never closed.]

	soup.b[class] # => "1"

	soup.find_text(/closed/) # => "never closed."

If you need better performance, do what Rubyful Soup does and write a custom parser on top of the event-based parser SGMLParser (found in the htmltools gem). It works a lot like REXMLs StreamListener interface.

Discussion

Sometimes it seems like the authors of markup parsers do their coding atop an ivory tower. Most parsers simply refuse to parse bad markup, but this cuts off an enormous source of interesting data. Most of the pages on the World Wide Web are invalid HTML, so if your application uses other peoples web pages as input, you need a forgiving parser. Invalid XML is less common but by no means rare.

The SGMLParser class in the htmltools gem uses regular expressions to parse an XMLlike data stream. When it finds an opening or closing tag, some data, or some other part of an XML-like document, it calls a hook method that you e supposed to define in a subclass. SGMLParser doesn build a document model or keep track of the document state: it just generates events. If closing tags don match up or if the markup has other problems, it won even notice.

Rubyful Soups parser classes define SGMLParser hook methods that build a document model out of an ambiguous document. Its BeautifulSoup class is intended for HTML documents: it uses heuristics like a web browsers to figure out what an ambiguous document "really" means. These heuristics are specific to HTML; to parse XML documents, you should use the BeautifulStoneSoup class. You can also subclass BeautifulStoneSoup and implement your own heuristics.

Rubyful Soup builds a densely linked model of the entire document, which uses a lot of memory. If you only need to process certain parts of the document, you can implement the SGMLParser hooks yourself and get a faster parser that uses less memory.

Heres a SGMLParser subclass that extracts URLs from a web page. It checks every A tag for an HRef attribute, and keeps the results in a set. Note the similarity to the LinkGrabber class defined in Recipe 11.13.

	require 
ubygems
	require html/sgml-parser
	require set

	html = %{<a name="anchor"><a href="http://www.oreilly.com">OReilly</a>
	 irrelevant<a href="http://www.ruby-lang.org/">Ruby</a>}

	class LinkGrabber < HTML::SGMLParser
	 attr_reader :urls

	 def initialize
	 @urls = Set.new
	 super
	 end

	 def do_a(attrs)
	 url = attrs.find { |attr| attr[0] == href }
	 @urls << url[1] if url
	 end
	end

	extractor = LinkGrabber.new
	extractor.feed(html)
	extractor.urls
	# => #

The equivalent Rubyful Soup program is quicker to write and easier to understand, but it runs more slowly and uses more memory:

	require 
ubyful_soup

	urls = Set.new
	BeautifulStoneSoup.new(html).find_all(a).each do |tag|
	 urls << tag[href] if tag[href]
	end

You can improve performance by telling Rubyful Soups parser to ignore everything except A tags and their contents:

	puts BeautifulStoneSoup.new(html, :parse_only_these => a)
	# <a name="anchor"></a>
	# <a href="http://www.oreilly.com">OReilly</a>
	# <a href="http://www.ruby-lang.org/">Ruby</a>

But the fastest implementation will always be a custom SGMLParser subclass. If your parser is part of a full application (rather than a one-off script), youll need to find the best tradeoff between performance and code legibility.

See Also

  • Recipe 11.13, "Extracting All the URLs from an HTML Document"
  • The Rubyful Soup documentation (http://www.crummy.com/software/RubyfulSoup/documentation.html)
  • The htree library defines a forgiving HTML/ XML parser that can convert a parsed document into a REXML Document object (http://cvs.m17n.org/~akr/htree/)
  • The HTML TIDY library can fix up most invalid HTML so that it can be parsed by a standard parser; its a C library with Ruby bindings; see http://tidy.sourceforge.net/ for the library, and http://rubyforge.org/projects/tidy for the bindings






Ruby Cookbook
Ruby Cookbook (Cookbooks (OReilly))
ISBN: 0596523696
EAN: 2147483647
Year: N/A
Pages: 399
Simiral book on Amazon

Flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net