Converting HTML Documents from the Web into Text

Problem

You want to get a text summary of a web site.

Solution

The open-uri library is the easiest way to grab the content of a web page; it lets you open a URL as though it were a file:

	require open-uri

	example = open(http://www.example.com/)
	# => #

	html = example.read

As with a file, the read method returns a string. You can do a series of sub and gsub methods to clean the code into a more readable format.

	plain_text = 
html.sub(%r{(.*?)}mi, \1).gsub(/<.*?>/m,  ).
	 gsub(%r{(
s*){2}}, "

")

Finally, you can use the standard CGI library to unescape HTML entities like < into their ASCII equivalents (<):

	require cgi
	plain_text = CGI.unescapeHTML(plain_text)

The final product:

	puts plain_text
	# Example 
Web Page
	#
	# You have reached this web page by typing "example.com",
	# "example.net",
	# or "example.org" into your web browser.
	# These domain names are reserved for use in documentation and are not available
	# for registration. See RFC
	# 2606 , Section 3.

Discussion

The open-uri library extends the open method so that you can access the contents of web pages and FTP sites with the same interface used for local files.

The simple regular expression substitutions above do nothing but remove HTML tags and clean up excess whitespace. They work well for well-formatted HTML, but the web is full of mean and ugly HTML, so you may consider taking a more involved approach. Lets define a HTMLSanitizer class to do our dirty business.

An HTMLSanitizer will start off with some HTML, and through a series of search-and-replace operations transform it into plain text. Different HTML tags will be handled differently. The contents of some HTML tags should simply be removed in a plaintext rendering. For example, you probably don want to see the contents of and

Strings

Numbers

Date and Time

Arrays

Hashes

Files and Directories

Code Blocks and Iteration

Objects and Classes8

Modules and Namespaces

Reflection and Metaprogramming

XML and HTML

Graphics and Other File Formats

Databases and Persistence

Internet Services

Web Development Ruby on Rails

Web Services and Distributed Programming

Testing, Debugging, Optimizing, and Documenting

Packaging and Distributing Software

Automating Tasks with Rake

Multitasking and Multithreading

User Interface

Extending Ruby with Other Languages

System Administration



Ruby Cookbook
Ruby Cookbook (Cookbooks (OReilly))
ISBN: 0596523696
EAN: 2147483647
Year: N/A
Pages: 399

Flylib.com © 2008-2020.
If you may any questions please contact us: flylib@qtcs.net