Parsing Invalid Markup


You need to extract data from a document thats supposed to be HTML or XML, but that contains some invalid markup.


For a quick solution, use Rubyful Soup, written by Leonard Richardson and found in the rubyful_soup gem. It can build a document model even out of invalid XML or HTML, and it offers an idiomatic Ruby interface for searching the document model. Its good for quick screen-scraping tasks or HTML cleanup.


	invalid_html = A lot of tags are never closed.
	soup =
	puts soup.prettify
	# A lot of
	# tags are
	# never closed.

	soup.b.i # => never closed.
	soup.i # => never closed.
	soup.find(nil, :attrs=>{class => 2}) # => never closed.
	soup.find_all(i) # => [never closed.]

	soup.b[class] # => "1"

	soup.find_text(/closed/) # => "never closed."

If you need better performance, do what Rubyful Soup does and write a custom parser on top of the event-based parser SGMLParser (found in the htmltools gem). It works a lot like REXMLs StreamListener interface.


Sometimes it seems like the authors of markup parsers do their coding atop an ivory tower. Most parsers simply refuse to parse bad markup, but this cuts off an enormous source of interesting data. Most of the pages on the World Wide Web are invalid HTML, so if your application uses other peoples web pages as input, you need a forgiving parser. Invalid XML is less common but by no means rare.

The SGMLParser class in the htmltools gem uses regular expressions to parse an XMLlike data stream. When it finds an opening or closing tag, some data, or some other part of an XML-like document, it calls a hook method that you e supposed to define in a subclass. SGMLParser doesn build a document model or keep track of the document state: it just generates events. If closing tags don match up or if the markup has other problems, it won even notice.

Rubyful Soups parser classes define SGMLParser hook methods that build a document model out of an ambiguous document. Its BeautifulSoup class is intended for HTML documents: it uses heuristics like a web browsers to figure out what an ambiguous document "really" means. These heuristics are specific to HTML; to parse XML documents, you should use the BeautifulStoneSoup class. You can also subclass BeautifulStoneSoup and implement your own heuristics.

Rubyful Soup builds a densely linked model of the entire document, which uses a lot of memory. If you only need to process certain parts of the document, you can implement the SGMLParser hooks yourself and get a faster parser that uses less memory.

Heres a SGMLParser subclass that extracts URLs from a web page. It checks every A tag for an HRef attribute, and keeps the results in a set. Note the similarity to the LinkGrabber class defined in Recipe 11.13.

	require html/sgml-parser
	require set

	html = %{<a name="anchor"><a href="">OReilly</a>
	 irrelevant<a href="">Ruby</a>}

	class LinkGrabber < HTML::SGMLParser
	 attr_reader :urls

	 def initialize
	 @urls =

	 def do_a(attrs)
	 url = attrs.find { |attr| attr[0] == href }
	 @urls << url[1] if url

	extractor =
	# => #

The equivalent Rubyful Soup program is quicker to write and easier to understand, but it runs more slowly and uses more memory:


	urls = do |tag|
	 urls << tag[href] if tag[href]

You can improve performance by telling Rubyful Soups parser to ignore everything except A tags and their contents:

	puts, :parse_only_these => a)
	# <a name="anchor"></a>
	# <a href="">OReilly</a>
	# <a href="">Ruby</a>

But the fastest implementation will always be a custom SGMLParser subclass. If your parser is part of a full application (rather than a one-off script), youll need to find the best tradeoff between performance and code legibility.

See Also

  • Recipe 11.13, "Extracting All the URLs from an HTML Document"
  • The Rubyful Soup documentation (
  • The htree library defines a forgiving HTML/ XML parser that can convert a parsed document into a REXML Document object (
  • The HTML TIDY library can fix up most invalid HTML so that it can be parsed by a standard parser; its a C library with Ruby bindings; see for the library, and for the bindings

Ruby Cookbook
Ruby Cookbook (Cookbooks (OReilly))
ISBN: 0596523696
EAN: 2147483647
Year: N/A
Pages: 399
Simiral book on Amazon © 2008-2017.
If you may any questions please contact us: