Converting HTML Documents from the Web into Text

Table of contents:

Problem

You want to get a text summary of a web site.

Solution

The open-uri library is the easiest way to grab the content of a web page; it lets you open a URL as though it were a file:

	require open-uri

	example = open(http://www.example.com/)
	# => #

	html = example.read

As with a file, the read method returns a string. You can do a series of sub and gsub methods to clean the code into a more readable format.

	plain_text = 
html.sub(%r{(.*?)}mi, \1).gsub(/<.*?>/m,  ).
	 gsub(%r{(
s*){2}}, "

")

Finally, you can use the standard CGI library to unescape HTML entities like < into their ASCII equivalents (<):

	require cgi
	plain_text = CGI.unescapeHTML(plain_text)

The final product:

	puts plain_text
	# Example 
Web Page
	#
	# You have reached this web page by typing "example.com",
	# "example.net",
	# or "example.org" into your web browser.
	# These domain names are reserved for use in documentation and are not available
	# for registration. See RFC
	# 2606 , Section 3.

Discussion

The open-uri library extends the open method so that you can access the contents of web pages and FTP sites with the same interface used for local files.

The simple regular expression substitutions above do nothing but remove HTML tags and clean up excess whitespace. They work well for well-formatted HTML, but the web is full of mean and ugly HTML, so you may consider taking a more involved approach. Lets define a HTMLSanitizer class to do our dirty business.

An HTMLSanitizer will start off with some HTML, and through a series of search-and-replace operations transform it into plain text. Different HTML tags will be handled differently. The contents of some HTML tags should simply be removed in a plaintext rendering. For example, you probably don want to see the contents of and