Grabbing the Contents of a Web Page

Problem

You want to display or process a specific web page.

Solution

The simplest solution is to use the open-uri library. It lets you open a web page as though it were a file. This code fetches the oreilly.com homepage and prints out the first part of it:

	require  
open-uri
	puts open(http://www.oreilly.com/).read(200)
	# 
	# 

For more complex applications, youll need to use the net/http library. Use Net::HTTP.get_response to make an HTTP request and get the response as a Net::HTTPResponse object containing the response code, headers, and body.

	require  
net/http
	response = Net::HTTP.get_response(www.oreilly.com, /about/)
	response.code # => "200"
	response.body.size # => 21835
	response[Content-type]
	# => "text/html; charset=ISO-8859-1"
	puts response.body[0,200]
	# 
	#
	#
	# 
	# 
	# 

Discussion

If you just want the text of the page, use get. If you also want the response code or the values of the HTTP response headers, use get_reponse.

The get_response method returns some HTTPResponse subclass of Net:HTTPResponse, which contains all information about an HTTP response. Theres one subclass for every response code defined in the HTTP standard; for instance, HTTPOK for the 200 response code, HTTPMovedPermanently for the 301 response code, and HTTPNotFound for the 404 response code. Theres also an HTTPUnknown subclass for any response codes not defined in HTTP.

The only difference between these subclasses is the class name and the code member. You can check the response code of an HTTP response by comparing specific classes with is_a?, or by checking the result of HTTPResponse#code, which returns a String:

	puts "Success!" if response.is_a? Net::HTTPOK
	# Success!

	puts case response.code[0] # Check the first byte of the response code.
	 when ?1 then "Status code indicates an HTTP informational response."
	 when ?2 then "Status code indicates success."
	 when ?3 then "Status code indicates redirection."
	 when ?4 then "Status code indicates client error."
	 when ?5 then "Status code indicates server error."
	 else "Non-standard status code."
	end
	# Status code indicates success.

You can get the value of an HTTP response header by treating HTTPResponse as a hash, passing the header name into HTTPResponse#[]. The only difference from a real Hash is that the names of the headers are case-insensitive. Like a hash, HTTPResponse supports the iteration methods #each, #each_key, and #each_value:

	response[Server]
	# => "Apache/1.3.34 (Unix) PHP/4.3.11 mod_perl/1.29"
	response[SERVER]
	# => "Apache/1.3.34 (Unix) PHP/4.3.11 mod_perl/1.29"

	response.each_key { |key| puts key }
	# x-cache
	# p3p
	# content-type
	# date
	# server
	# transfer-encoding

If you do a request by calling NET::HTTP.get_response with no code block, Ruby will read the body of the web page into a string, which you can fetch with the HTTPResponse::body method. If you like, you can process the body as you read it, one segment at a time, by passing a code block to HTTPResponse::read_body:

	Net::HTTP.get_response(www.oreilly.com, /about/) do |response|
	 response.read_body do |segment|
	 puts "Received segment of #{segment.size} byte(s)!"
	 end
	end
	# Received segment of 614 byte(s)!
	# Received segment of 1024 byte(s)!
	# Received segment of 848 byte(s)!
	# Received segment of 1024 byte(s)!
	# …

Note that you can only call read_body once per request. Also, there are no guarantees that a segment won end in the middle of an HTML tag name or some other inconvenient place, so this is best for applications where you e not handing the web page as structured data: for instance, when you e simply piping it to some other source.

See Also

  • Recipe 14.2, "Making an HTTPS Web Request"
  • Recipe 14.3, "Customizing HTTP Request Headers"
  • Recipe 14.20, "A Real-World HTTP Client," covers a lot of edge cases youll need to handle if you want to write a general-purpose client
  • Most HTML youll find on the web is invalid, so to parse it youll need the tricks described in Recipe 11.5, "Parsing Invalid Markup"


Strings

Numbers

Date and Time

Arrays

Hashes

Files and Directories

Code Blocks and Iteration

Objects and Classes8

Modules and Namespaces

Reflection and Metaprogramming

XML and HTML

Graphics and Other File Formats

Databases and Persistence

Internet Services

Web Development Ruby on Rails

Web Services and Distributed Programming

Testing, Debugging, Optimizing, and Documenting

Packaging and Distributing Software

Automating Tasks with Rake

Multitasking and Multithreading

User Interface

Extending Ruby with Other Languages

System Administration



Ruby Cookbook
Ruby Cookbook (Cookbooks (OReilly))
ISBN: 0596523696
EAN: 2147483647
Year: N/A
Pages: 399

Flylib.com © 2008-2020.
If you may any questions please contact us: flylib@qtcs.net