You want to parse a string representation of a URL into a data structure that articulates the parts of the URL.
URI.parse TRansforms a string describing a URL into a URI object.[5] The parts of the URL can be determined by interrogating the URI object.
[5] The class name is URI, but I use both "URI" and "URL" because they are more or less interchangeable.
require uri URI.parse(https://www.example.com).scheme # => "https" URI.parse(http://www.example.com/).host # => "www.example.com" URI.parse(http://www.example.com:6060/).port # => 6060 URI.parse(http://example.com/a/file.html).path # => "/a/file.html"
URI.split transforms a string into an array of URL parts. This is more efficient than URI.parse, but you have to know which parts correspond to which slots in the array:
URI.split(http://example.com/a/file.html) # => ["http", nil, "example.com", nil, nil, "/a/file.html", nil, nil, nil]
The URI module contains classes for five of the most popular URI schemas. Each one can store in a structured format the data that makes up a URI for that schema. URI.parse creates an instance of the appropriate class for a particular URLs scheme.
Every URI can be decomposed into a set of components, joined by constant strings. For example: the components for a HTTP URI are the scheme ("http"), the hostname ("www.example.com (http://www.example.com)"), and so on. Each URI schema has its own components, and each of Rubys URI classes stores the names of its components in an ordered array of symbols, called component:
URI::HTTP.component # => [:scheme, :userinfo, :host, :port, :path, :query, :fragment] URI::MailTo.component # => [:scheme, :to, :headers]
Each of the components of a URI class has a corresponding accessor method, which you can call to get one component of a URI. You can also instantiate a URI class directly (rather than going through URI.parse) by passing in the appropriate component symbols as a map of keyword arguments.
URI::HTTP.build(:host => example.com, :path => /a/file.html, :fragment => section_3).to_s # => "http://example.com/a/file.html#section_3"
The following debugging method iterates over the components handled by the scheme of a given URI object, and prints the corresponding values:
class URI::Generic def dump component.each do |m| puts "#{m}: #{send(m).inspect}" end end end
URI::HTTP and URI::HTTPS are the most commonly encountered subclasses of URI, since most URIs are the URLs to web pages. Both classes provide the same interface.
url = http://leonardr:pw@www.subdomain.example.com:6060 + /cgi-bin/mycgi.cgi?key1=val1#anchor URI.parse(url).dump # scheme: "http" # userinfo: "leonardr:pw" # host: "www.subdomain.example.com" # port: 6060 # path: "/cgi-bin/mycgi.cgi" # query: "key1=val1" # fragment: "anchor"
A URI::FTP object represents an FTP server, or a path to a file on an FTP server. The typecode component indicates whether the file in question is text, binary, or a directory; it typically won be known unless you create a URI::FTP object and specify one.
URI::parse(ftp://leonardr:password@ftp.example.com/a/file.txt).dump # scheme: "ftp" # userinfo: "leonardr:password" # host: "ftp.example.com" # port: 21 # path: "/a/file.txt" # typecode: nil
A URI::Mailto represents an email address, or even an entire message to be sent to that address. In addition to its component array, this class provides a method (to_mailtext) that formats the URI as an email message.
uri = URI::parse(mailto:leonardr@example.com?Subject=Hello&body=Hi!) uri.dump # scheme: "mailto" # to: "leonardr@example.com" # headers: [["Subject", "Hello"], ["body", "Hi!"]] puts uri.to_mailtext # To: leonardr@example.com # Subject: Hello # # Hi!
A URI::LDAP object contains a path to an LDAP server or a query against one:
URI::parse("ldap://ldap.example.com").dump # scheme: "ldap" # host: "ldap.example.com" # port: 389 # dn: nil # attributes: nil # scope: nil # filter: nil # extensions: nil URI::parse(ldap://ldap.example.com/o=Alice%20Exeter,c=US?extension).dump # scheme: "ldap" # host: "ldap.example.com" # port: 389 # dn: "o=Alice%20Exeter,c=US" # attributes: "extension" # scope: nil # filter: nil # extensions: nil
The URI::Generic class, superclass of all of the above, is a catch-all class that holds URIs with other schemes, or with no scheme at all. It holds much the same components as URI::HTTP, although theres no guarantee that any of them will be non-nil for a given URI::Generic object.
URI::Generic also exposes two other components not used by any of its built-in sub-classes. The first is opaque, which is the portion of a URL that couldn be parsed (that is, everything after the scheme):
uri = URI.parse( ag:example.com,2006,my-tag) uri.scheme # => "tag" uri.opaque # => "example.com,2006,my-tag"
The second is registry, which is only used for URI schemes whose naming authority is registry-based instead of server-based. Its likely that youll never need to use registry, since almost all URI schemes are server-based (for instance, HTTP, FTP, and LDAP all use the DNS system to designate a host).
To combine the components of a URI object into a string, simply call to_s:
uri = URI.parse(http://www.example.com/#anchor) uri.port = 8080 uri.to_s # => "http://www.example.com:8080/#anchor"
Strings
Numbers
Date and Time
Arrays
Hashes
Files and Directories
Code Blocks and Iteration
Objects and Classes8
Modules and Namespaces
Reflection and Metaprogramming
XML and HTML
Graphics and Other File Formats
Databases and Persistence
Internet Services
Web Development Ruby on Rails
Web Services and Distributed Programming
Testing, Debugging, Optimizing, and Documenting
Packaging and Distributing Software
Automating Tasks with Rake
Multitasking and Multithreading
User Interface
Extending Ruby with Other Languages
System Administration