Parsing URLs

Problem

You want to parse a string representation of a URL into a data structure that articulates the parts of the URL.

Solution

URI.parse TRansforms a string describing a URL into a URI object.[5] The parts of the URL can be determined by interrogating the URI object.

[5] The class name is URI, but I use both "URI" and "URL" because they are more or less interchangeable.

	require uri

	URI.parse(https://www.example.com).scheme # => "https"
	URI.parse(http://www.example.com/).host # => "www.example.com"
	URI.parse(http://www.example.com:6060/).port # => 6060
	URI.parse(http://example.com/a/file.html).path # => "/a/file.html"

URI.split transforms a string into an array of URL parts. This is more efficient than URI.parse, but you have to know which parts correspond to which slots in the array:

	URI.split(http://example.com/a/file.html)
	# => ["http", nil, "example.com", nil, nil, "/a/file.html", nil, nil, nil]

Discussion

The URI module contains classes for five of the most popular URI schemas. Each one can store in a structured format the data that makes up a URI for that schema. URI.parse creates an instance of the appropriate class for a particular URLs scheme.

Every URI can be decomposed into a set of components, joined by constant strings. For example: the components for a HTTP URI are the scheme ("http"), the hostname ("www.example.com (http://www.example.com)"), and so on. Each URI schema has its own components, and each of Rubys URI classes stores the names of its components in an ordered array of symbols, called component:

	URI::HTTP.component
	# => [:scheme, :userinfo, :host, :port, :path, :query, :fragment]

	URI::MailTo.component
	# => [:scheme, :to, :headers]

Each of the components of a URI class has a corresponding accessor method, which you can call to get one component of a URI. You can also instantiate a URI class directly (rather than going through URI.parse) by passing in the appropriate component symbols as a map of keyword arguments.

	URI::HTTP.build(:host => example.com, :path => /a/file.html,
	 :fragment => section_3).to_s
	# => "http://example.com/a/file.html#section_3"

The following debugging method iterates over the components handled by the scheme of a given URI object, and prints the corresponding values:

	class URI::Generic
	 def dump
	 component.each do |m|
	 puts "#{m}: #{send(m).inspect}"
	 end
	 end
	end

URI::HTTP and URI::HTTPS are the most commonly encountered subclasses of URI, since most URIs are the URLs to web pages. Both classes provide the same interface.

	url = http://leonardr:pw@www.subdomain.example.com:6060 +
	 /cgi-bin/mycgi.cgi?key1=val1#anchor
	URI.parse(url).dump
	# scheme: "http"
	# userinfo: "leonardr:pw"
	# host: "www.subdomain.example.com"
	# port: 6060
	# path: "/cgi-bin/mycgi.cgi"
	# query: "key1=val1"
	# fragment: "anchor"

A URI::FTP object represents an FTP server, or a path to a file on an FTP server. The typecode component indicates whether the file in question is text, binary, or a directory; it typically won be known unless you create a URI::FTP object and specify one.

	URI::parse(ftp://leonardr:password@ftp.example.com/a/file.txt).dump
	# scheme: "ftp"
	# userinfo: "leonardr:password"
	# host: "ftp.example.com"
	# port: 21
	# path: "/a/file.txt"
	# typecode: nil

A URI::Mailto represents an email address, or even an entire message to be sent to that address. In addition to its component array, this class provides a method (to_mailtext) that formats the URI as an email message.

	uri = URI::parse(mailto:leonardr@example.com?Subject=Hello&body=Hi!)
	uri.dump
	# scheme: "mailto"
	# to: "leonardr@example.com"
	# headers: [["Subject", "Hello"], ["body", "Hi!"]]

	puts uri.to_mailtext
	# To: leonardr@example.com
	# Subject: Hello
	#
	# Hi!

A URI::LDAP object contains a path to an LDAP server or a query against one:

	URI::parse("ldap://ldap.example.com").dump
	# scheme: "ldap"
	# host: "ldap.example.com"
	# port: 389
	# dn: nil
	# attributes: nil
	# scope: nil
	# filter: nil
	# extensions: nil

	URI::parse(ldap://ldap.example.com/o=Alice%20Exeter,c=US?extension).dump
	# scheme: "ldap"
	# host: "ldap.example.com"
	# port: 389
	# dn: "o=Alice%20Exeter,c=US"
	# attributes: "extension"
	# scope: nil
	# filter: nil
	# extensions: nil

The URI::Generic class, superclass of all of the above, is a catch-all class that holds URIs with other schemes, or with no scheme at all. It holds much the same components as URI::HTTP, although theres no guarantee that any of them will be non-nil for a given URI::Generic object.

URI::Generic also exposes two other components not used by any of its built-in sub-classes. The first is opaque, which is the portion of a URL that couldn be parsed (that is, everything after the scheme):

	uri = URI.parse(	ag:example.com,2006,my-tag)
	uri.scheme # => "tag"
	uri.opaque # => "example.com,2006,my-tag"

The second is registry, which is only used for URI schemes whose naming authority is registry-based instead of server-based. Its likely that youll never need to use registry, since almost all URI schemes are server-based (for instance, HTTP, FTP, and LDAP all use the DNS system to designate a host).

To combine the components of a URI object into a string, simply call to_s:

	 uri = URI.parse(http://www.example.com/#anchor)
	 uri.port = 8080
	 uri.to_s # => "http://www.example.com:8080/#anchor"

See Also

  • Recipe 11.13, "Extracting All the URLs from an HTML Document"
  • ri URI


Strings

Numbers

Date and Time

Arrays

Hashes

Files and Directories

Code Blocks and Iteration

Objects and Classes8

Modules and Namespaces

Reflection and Metaprogramming

XML and HTML

Graphics and Other File Formats

Databases and Persistence

Internet Services

Web Development Ruby on Rails

Web Services and Distributed Programming

Testing, Debugging, Optimizing, and Documenting

Packaging and Distributing Software

Automating Tasks with Rake

Multitasking and Multithreading

User Interface

Extending Ruby with Other Languages

System Administration



Ruby Cookbook
Ruby Cookbook (Cookbooks (OReilly))
ISBN: 0596523696
EAN: 2147483647
Year: N/A
Pages: 399

Flylib.com © 2008-2020.
If you may any questions please contact us: flylib@qtcs.net