Parsing Invalid Markup | XML and HTML

Table of contents:

Problem

You need to extract data from a document thats supposed to be HTML or XML, but that contains some invalid markup.

Solution

For a quick solution, use Rubyful Soup, written by Leonard Richardson and found in the rubyful_soup gem. It can build a document model even out of invalid XML or HTML, and it offers an idiomatic Ruby interface for searching the document model. Its good for quick screen-scraping tasks or HTML cleanup.

	require 
ubygems
	require 
ubyful_soup

	invalid_html = A lot of tags are never closed.
	soup = BeautifulSoup.new(invalid_html)
	puts soup.prettify
	# A lot of
	# tags are
	# never closed.
	# 
	# 

	soup.b.i # => never closed.
	soup.i # => never closed.
	soup.find(nil, :attrs=>{class => 2}) # => never closed.
	soup.find_all(i) # => [never closed.]

	soup.b[class] # => "1"

	soup.find_text(/closed/) # => "never closed."

If you need better performance, do what Rubyful Soup does and write a custom parser on top of the event-based parser SGMLParser (found in the htmltools gem). It works a lot like REXMLs StreamListener interface.

Discussion

Sometimes it seems like the authors of markup parsers do their coding atop an ivory tower. Most parsers simply refuse to parse bad markup, but this cuts off an enormous source of interesting data. Most of the pages on the World Wide Web are invalid HTML, so if your application uses other peoples web pages as input, you need a forgiving parser. Invalid XML is less common but by no means rare.

The SGMLParser class in the htmltools gem uses regular expressions to parse an XMLlike data stream. When it finds an opening or closing tag, some data, or some other part of an XML-like document, it calls a hook method that you e supposed to define in a subclass. SGMLParser doesn build a document model or keep track of the document state: it just generates events. If closing tags don match up or if the markup has other problems, it won even notice.

Rubyful Soups parser classes define SGMLParser hook methods that build a document model out of an ambiguous document. Its BeautifulSoup class is intended for HTML documents: it uses heuristics like a web browsers to figure out what an ambiguous document "really" means. These heuristics are specific to HTML; to parse XML documents, you should use the BeautifulStoneSoup class. You can also subclass BeautifulStoneSoup and implement your own heuristics.

Rubyful Soup builds a densely linked model of the entire document, which uses a lot of memory. If you only need to process certain parts of the document, you can implement the SGMLParser hooks yourself and get a faster parser that uses less memory.

Heres a SGMLParser subclass that extracts URLs from a web page. It checks every A tag for an HRef attribute, and keeps the results in a set. Note the similarity to the LinkGrabber class defined in Recipe 11.13.

require ubygems require html/sgml-parser require set html = %{<a name="anchor"><a href="http://www.oreilly.com">OReilly</a> irrelevant<a href="http://www.ruby-lang.org/">Ruby</a>} class LinkGrabber < HTML::SGMLParser attr_reader :urls def initialize @urls = Set.new super end def do_a(attrs) url = attrs.find { |attr| attr[0] == href } @urls << url[1] if url end end extractor = LinkGrabber.new extractor.feed(html) extractor.urls # => #

The equivalent Rubyful Soup program is quicker to write and easier to understand, but it runs more slowly and uses more memory:

require ubyful_soup urls = Set.new BeautifulStoneSoup.new(html).find_all(a).each do |tag| urls << tag[href] if tag[href] end

You can improve performance by telling Rubyful Soups parser to ignore everything except A tags and their contents:

puts BeautifulStoneSoup.new(html, :parse_only_these => a) # <a name="anchor"></a> # <a href="http://www.oreilly.com">OReilly</a> # <a href="http://www.ruby-lang.org/">Ruby</a>

But the fastest implementation will always be a custom SGMLParser subclass. If your parser is part of a full application (rather than a one-off script), youll need to find the best tradeoff between performance and code legibility.

See Also

Recipe 11.13, "Extracting All the URLs from an HTML Document"
The Rubyful Soup documentation (http://www.crummy.com/software/RubyfulSoup/documentation.html)
The htree library defines a forgiving HTML/ XML parser that can convert a parsed document into a REXML Document object (http://cvs.m17n.org/~akr/htree/)
The HTML TIDY library can fix up most invalid HTML so that it can be parsed by a standard parser; its a C library with Ruby bindings; see http://tidy.sourceforge.net/ for the library, and http://rubyforge.org/projects/tidy for the bindings

Strings
Strings
Building a String from Parts
Substituting Variables into Strings
Substituting Variables into an Existing String
Reversing a String by Words or Characters
Representing Unprintable Characters
Converting Between Characters and Values
Converting Between Strings and Symbols
Processing a String One Character at a Time
Processing a String One Word at a Time
Changing the Case of a String
Managing Whitespace
Testing Whether an Object Is String-Like
Getting the Parts of a String You Want
Handling International Encodings
Word-Wrapping Lines of Text
Generating a Succession of Strings
Matching Strings with Regular Expressions
Replacing Multiple Patterns in a Single Pass
Validating an Email Address
Classifying Text with a Bayesian Analyzer
Numbers
Numbers
Parsing a Number from a String
Comparing Floating-Point Numbers
Representing Numbers to Arbitrary Precision
Representing Rational Numbers
Generating Random Numbers
Converting Between Numeric Bases
Taking Logarithms
Finding Mean, Median, and Mode
Converting Between Degrees and Radians
Multiplying Matrices
Solving a System of Linear Equations
Using Complex Numbers
Simulating a Subclass of Fixnum
Doing Math with Roman Numbers
Generating a Sequence of Numbers
Generating Prime Numbers
Checking a Credit Card Checksum
Date and Time
Date and Time
Finding Todays Date
Parsing Dates, Precisely or Fuzzily
Printing a Date
Iterating Over Dates
Doing Date Arithmetic
Counting the Days Since an Arbitrary Date
Converting Between Time Zones
Checking Whether Daylight Saving Time Is in Effect
Converting Between Time and DateTime Objects
Finding the Day of the Week
Handling Commercial Dates
Running a Code Block Periodically
Waiting a Certain Amount of Time
Adding a Timeout to a Long-Running Operation
Arrays
Arrays
Iterating Over an Array
Rearranging Values Without Using Temporary Variables
Stripping Duplicate Elements from an Array
Reversing an Array
Sorting an Array
Ignoring Case When Sorting Strings
Making Sure a Sorted Array Stays Sorted
Summing the Items of an Array
Sorting an Array by Frequency of Appearance
Shuffling an Array
Getting the N Smallest Items of an Array
Building Up a Hash Using Injection
Extracting Portions of Arrays
Computing Set Operations on Arrays
Partitioning or Classifying a Set
Hashes
Hashes
Using Symbols as Hash Keys
Creating a Hash with a Default Value
Adding Elements to a Hash
Removing Elements from a Hash
Using an Array or Other Modifiable Object as a Hash Key
Keeping Multiple Values for the Same Hash Key
Iterating Over a Hash
Iterating Over a Hash in Insertion Order
Printing a Hash
Inverting a Hash
Choosing Randomly from a Weighted List
Building a Histogram
Remapping the Keys and Values of a Hash
Extracting Portions of Hashes
Searching a Hash with Regular Expressions
Files and Directories
Files and Directories
Checking to See If a File Exists
Checking Your Access to a File
Changing the Permissions on a File
Seeing When a File Was Last Used Problem
Listing a Directory
Reading the Contents of a File
Writing to a File
Writing to a Temporary File
Picking a Random Line from a File
Comparing Two Files
Performing Random Access on Read-Once Input Streams
Walking a Directory Tree
Locking a File
Backing Up to Versioned Filenames
Pretending a String Is a File
Redirecting Standard Input or Output
Processing a Binary File
Deleting a File
Truncating a File
Finding the Files You Want
Finding and Changing the Current Working Directory
Code Blocks and Iteration
Code Blocks and Iteration
Creating and Invoking a Block
Writing a Method That Accepts a Block
Binding a Block Argument to a Variable
Blocks as Closures: Using Outside Variables Within a Code Block
Writing an Iterator Over a Data Structure
Changing the Way an Object Iterates
Writing Block Methods That Classify or Collect
Stopping an Iteration
Looping Through Multiple Iterables in Parallel
Hiding Setup and Cleanup in a Block Method
Coupling Systems Loosely with Callbacks
Objects and Classes8
Objects and Classes8
Managing Instance Data
Managing Class Data
Checking Class or Module Membership
Writing an Inherited Class
Overloading Methods
Validating and Modifying Attribute Values
Defining a Virtual Attribute
Delegating Method Calls to Another Object
Converting and Coercing Objects to Different Types
Getting a Human-Readable Printout of Any Object
Accepting or Passing a Variable Number of Arguments
Simulating Keyword Arguments
Calling a Superclasss Method
Creating an Abstract Method
Freezing an Object to Prevent Changes
Making a Copy of an Object
Declaring Constants
Implementing Class and Singleton Methods
Controlling Access by Making Methods Private
Modules and Namespaces
Modules and Namespaces
Simulating Multiple Inheritance with Mixins
Extending Specific Objects with Modules
Mixing in Class Methods
Implementing Enumerable: Write One Method, Get 22 Free
Avoiding Naming Collisions with Namespaces
Automatically Loading Libraries as Needed
Including Namespaces
Initializing Instance Variables Defined by a Module
Automatically Initializing Mixed-In Modules
Reflection and Metaprogramming
Reflection and Metaprogramming
Finding an Objects Class and Superclass
Listing an Objects Methods
Listing Methods Unique to an Object
Getting a Reference to a Method
Fixing Bugs in Someone Elses Class
Listening for Changes to a Class
Checking Whether an Object Has Necessary Attributes
Responding to Calls to Undefined Methods
Automatically Initializing Instance Variables
Avoiding Boilerplate Code with Metaprogramming
Metaprogramming with String Evaluations
Evaluating Code in an Earlier Context
Undefining a Method
Aliasing Methods
Doing Aspect-Oriented Programming
Enforcing Software Contracts
XML and HTML
XML and HTML
Checking XML Well-Formedness
Extracting Data from a Documents Tree Structure
Extracting Data While Parsing a Document
Navigating a Document with XPath
Parsing Invalid Markup
Converting an XML Document into a Hash
Validating an XML Document
Substituting XML Entities
Creating and Modifying XML Documents
Compressing Whitespace in an XML Document
Guessing a Documents Encoding
Converting from One Encoding to Another
Extracting All the URLs from an HTML Document
Transforming Plain Text to HTML
Converting HTML Documents from the Web into Text
A Simple Feed Aggregator
Graphics and Other File Formats
Graphics and Other File Formats
Thumbnailing Images
Adding Text to an Image
Converting One Image Format to Another
Graphing Data
Adding Graphical Context with Sparklines
Strongly Encrypting Data
Parsing Comma-Separated Data
Parsing Not-Quite-Comma-Separated Data
Generating and Parsing Excel Spreadsheets
Compressing and Archiving Files with Gzip and Tar
Reading and Writing ZIP Files
Reading and Writing Configuration Files
Generating PDF Files
Representing Data as MIDI Music
Databases and Persistence
Databases and Persistence
Serializing Data with YAML
Serializing Data with Marshal
Persisting Objects with Madeleine
Indexing Unstructured Text with SimpleSearch
Indexing Structured Text with Ferret
Using Berkeley DB Databases
Controlling MySQL on Unix
Finding the Number of Rows Returned by a Query
Talking Directly to a MySQL Database
Talking Directly to a PostgreSQL Database
Using Object Relational Mapping with ActiveRecord
Using Object Relational Mapping with Og
Building Queries Programmatically
Validating Data with ActiveRecord
Preventing SQL Injection Attacks
Using Transactions in ActiveRecord
Adding Hooks to Table Events
Adding Taggability with a Database Mixin
Internet Services
Internet Services
Grabbing the Contents of a Web Page
Making an HTTPS Web Request
Customizing HTTP Request Headers
Performing DNS Queries
Sending Mail
Reading Mail with IMAP
Reading Mail with POP3
Being an FTP Client
Being a Telnet Client
Being an SSH Client
Copying a File to Another Machine
Being a BitTorrent Client
Pinging a Machine
Writing an Internet Server
Parsing URLs
Writing a CGI Script
Setting Cookies and Other HTTP Response Headers
Handling File Uploads via CGI
Running Servlets with WEBrick
A Real-World HTTP Client
Web Development Ruby on Rails
Web Development Ruby on Rails
Writing a Simple Rails Application to Show System Status
Passing Data from the Controller to the View
Creating a Layout for Your Header and Footer
Redirecting to a Different Location
Displaying Templates with Render
Integrating a Database with Your Rails Application
Understanding Pluralization Rules
Creating a Login System
Storing Hashed User Passwords in the Database
Escaping HTML and JavaScript for Display
Setting and Retrieving Session Information
Setting and Retrieving Cookies
Extracting Code into Helper Functions
Refactoring the View into Partial Snippets of Views
Adding DHTML Effects with script.aculo.us
Generating Forms for Manipulating Model Objects
Creating an Ajax Form
Exposing Web Services on Your Web Site
Sending Mail with Rails
Automatically Sending Error Messages to Your Email
Documenting Your Web Site
Unit Testing Your Web Site
Using breakpoint in Your Web Application
Web Services and Distributed Programming
Web Services and Distributed Programming
Searching for Books on Amazon
Finding Photos on Flickr
Writing an XML-RPC Client
Writing a SOAP Client
Writing a SOAP Server
Searching the Web with Googles SOAP Service
Using a WSDL File to Make SOAP Calls Easier
Charging a Credit Card
Finding the Cost to Ship Packages via UPS or FedEx
Sharing a Hash Between Any Number of Computers
Implementing a Distributed Queue
Creating a Shared Whiteboard
Securing DRb Services with Access Control Lists
Automatically Discovering DRb Services with Rinda
Proxying Objects That Cant Be Distributed
Storing Data on Distributed RAM with MemCached
Caching Expensive Results with MemCached
A Remote-Controlled Jukebox
Testing, Debugging, Optimizing, and Documenting
Testing, Debugging, Optimizing, and Documenting
Running Code Only in Debug Mode
Raising an Exception
Handling an Exception
Rerunning After an Exception
Adding Logging to Your Application
Creating and Understanding Tracebacks
Writing Unit Tests
Running Unit Tests
Testing Code That Uses External Resources
Using breakpoint to Inspect and Change the State of Your Application
Documenting Your Application
Profiling Your Application
Benchmarking Competing Solutions
Running Multiple Analysis Tools at Once
Who s Calling That Method? A Call Graph Analyzer
Packaging and Distributing Software
Packaging and Distributing Software
Finding Libraries by Querying Gem Respositories
Installing and Using a Gem
Requiring a Specific Version of a Gem
Uninstalling a Gem
Reading Documentation for Installed Gems
Packaging Your Code as a Gem
Distributing Your Gems
Installing and Creating Standalone Packages with setup.rb
Automating Tasks with Rake
Automating Tasks with Rake
Automatically Running Unit Tests
Automatically Generating Documentation
Cleaning Up Generated Files
Automatically Building a Gem
Gathering Statistics About Your Code
Publishing Your Documentation
Running Multiple Tasks in Parallel
A Generic Project Rakefile
Multitasking and Multithreading
Multitasking and Multithreading
Running a Daemon Process on Unix
Creating a Windows Service
Doing Two Things at Once with Threads
Synchronizing Access to an Object
Terminating a Thread
Running a Code Block on Many Objects Simultaneously
Limiting Multithreading with a Thread Pool
Driving an External Process with popen
Capturing the Output and Error Streams from a Unix Shell Command
Controlling a Process on Another Machine
Avoiding Deadlock
User Interface
User Interface
Getting Input One Line at a Time
Getting Input One Character at a Time
Parsing Command-Line Arguments
Testing Whether a Program Is Running Interactively
Setting Up and Tearing Down a Curses Program
Clearing the Screen
Determining Terminal Size
Changing Text Color
Reading a Password
Allowing Input Editing with Readline
Making Your Keyboard Lights Blink
Creating a GUI Application with Tk
Creating a GUI Application with wxRuby
Creating a GUI Application with Ruby/GTK
Creating a Mac OS X Application with RubyCocoa
Using AppleScript to Get User Input
Extending Ruby with Other Languages
Extending Ruby with Other Languages
Writing a C Extension for Ruby
Using a C Library from Ruby
Calling a C Library Through SWIG
Writing Inline C in Your Ruby Code
Using Java Libraries with JRuby
System Administration
System Administration
Scripting an External Program
Managing Windows Services
Running Code as Another User
Running Periodic Tasks Without cron or at
Deleting Files That Match a Regular Expression
Renaming Files in Bulk
Finding Duplicate Files
Automating Backups
Normalizing Ownership and Permissions in User Directories
Killing All Processes for a Given User
show all menu