Extracting All the URLs from an HTML Document

Table of contents:

Problem

You want to find all the URLs on a web page.

Solution

Do you only want to find links (that is, URLs mentioned in the HREF attribute of an A tag)? Do you also want to find the URLs of embedded objects like images and applets? Or do you want to find all URLs, including ones mentioned in the text of the page?

The last case is the simplest. You can use URI.extract to get all the URLs found in a string, or to get only the URLs with certain schemes. Here well extract URLs from some HTML, whether or not they e inside A tags:

	require uri

	text = %{"My homepage is at
	<a href="http://www.example.com/">http://www.example.com/</a>, and be sure
	to check out my weblog at http://www.example.com/blog/. Email me at <a
	href="mailto:bob@example.com">bob@example.com</a>.}

	 
URI.extract(text)
	# => ["http://www.example.com/", "http://www.example.com/",
	# "http://www.example.com/blog/.", "mailto:bob@example.com"]

	# Get HTTP(S) links only.
	URI.extract(text, [http, https])
	# => ["http://www.example.com/", "http://www.example.com/"
	# "http://www.example.com/blog/."]

If you only want URLs that show up inside certain tags, you need to parse the HTML. Assuming the document is valid, you can do this with any of the parsers in the rexml library. Heres an efficient implementation using REXMLs stream parser. It retrieves URLs found in the hrEF attributes of A tags and the SRC attributes of IMG tags, but you can customize this behavior by passing a different map to the constructor.

	require 
exml/document
	require 
exml/streamlistener
	require set

	class LinkGrabber
	 include REXML::StreamListener
	 attr_reader :links

	def initialize(interesting_tags = {a => %w{href}, img => %w{src}}.freeze)
	 @tags = interesting_tags
	 @links = Set.new
	 end
	 def tag_start(name, attrs)
	 @tags[name].each do |uri_attr|
	 @links << attrs[uri_attr] if attrs[uri_attr]
	 end if @tags[name]
	 end

	 def parse(text)
	 REXML::Document.parse_stream(text, self)
	 end
	end

	grabber = 
LinkGrabber.new
	grabber.parse(text)
	grabber.links
	# => #

Discussion

The URI.extract solution uses regular expressions to find everything that looks like a URL. This is faster and easier to write than a REXML parser, but it will find every absolute URL in the document, including any mentioned in the text and any in the documents initial DOCTYPE. It will not find relative URLs hidden within HREF attributes, since those don start with an access scheme like "http://".

URI.extract treats the period at the end of the first sentence ("check out my weblog at…")as though it were part of the URL. URLs contained within English text are often ambiguous in this way. "http://www.example.com/blog/." is a perfectly valid URL and might be correct, but that period is probably just punctuation. Accessing the URL is the only sure way to know for sure, but its almost always safe to strip those characters:

	END_CHARS = %{.,?!:;}
	URI.extract(text, [http]).collect { |u| END_CHARS.index(u[-1]) ? u.chop : u }
	# => ["http://www.example.com/", "http://www.example.com/",
	# "http://www.example.com/blog/"]

The parser solution defines a listener that hears about every tag present in its interesting_tags map. It checks each tag for attributes that tend to contain URLs: "href" for <a> tags and "src" for tags, for instance. Every URL it finds goes into a set.

The use of a set here guarantees that the result contains no duplicate URLs. If you want to gather (possibly duplicate)URLs in the order they were found in the document, use a list, the way URI.extract does.

The LinkGrabber solution will not find URLs in the text portions of the document, but it will find relative URLs. Of course, you still need to know how to turn relative URLs into absolute URLs. If the document has a tag, you can use that. Otherwise, the base depends on the original URL of the document.

Heres a subclass of LinkGrabber that changes relative links to absolute links if possible. Since it uses URI.join, which returns a URI object, your set will end up containing URI objects instead of strings:

	class AbsoluteLinkGrabber < LinkGrabber
	 include REXML::StreamListener
	 attr_reader :links

	 def initialize(original_url = nil,
	 interesting_tags = {a => %w{href}, img => %w{src}}.freeze)
	 super(interesting_tags)
	 @base = original_url
	 end

	 def tag_start(name, attrs)
	 if name == ase
	 @base = attrs[href]
	 end
	 super
	 end

	 def parse(text)
	 super
	 # If we know of a base URL by the end of the document, use it to
	 # change all relative 
URLs to absolute URLs.
	 @links.collect! { |l| URI.join(@base, l) } if @base
	 end
	end

If you want to use the parsing solution, but the web page has invalid HTML that chokes the REXML parsers (which is quite likely), try the techniques mentioned in Recipe 11.5.

Almost 20 HTML tags can have URLs in one or more of their attributes. If you want to collect every URL mentioned in an appropriate part of a web page, heres a big map you can pass in to the constructor of LinkGrabber or AbsoluteLinkGrabber:

	URL_LOCATIONS = { a => %w{href},
	 area => %w{href},
	 applet => %w{classid},
	 ase => %w{href},
	 lockquote => %w{cite},
	 ody => %w{background},
	 codebase => %w{classid},
	 del => %w{cite},
	 form => %w{action},
	 frame => %w{src longdesc},
	 iframe => %w{src longdesc},
	 input => %w{src usemap},
	 img => %w{src longdesc usemap},
	 ins => %w{cite},
	 link => %w{href},
	 object => %w{usemap archive codebase data},
	 profile => %w{head},
	 q => %w{cite},
	 script => %w{src}}.freeze

Extracting All the URLs from an HTML Document

Problem

Solution

Discussion

See Also