Further Securing Your Feeds | Professional Web APIs with PHP. eBay, Google, PayPal, Amazon, FedEx, Plus Web Feeds

In the previous examples, all recorded data was completely cleaned by replacing all HTML entities (such as < and >) with their encoded counterparts (such as > and <, respectively), thus ensuring that any inline HTML is displayed to the end user, rather than interpreted by the user agent. Taking some action is imperative, because every cross-site scripting (XSS) attack that can be performed via data entry on a form can also be performed by simply providing a feed to be consumed.

Link Utility Functions

These functions will come in useful for handling tags that contain links, which may need to be rewritten entirely or modified slightly to be contained within your feed.

Strip URLs of Filenames

Retrieving the path using parse_url() makes it easy to obtain the path and filename portion; however, splitting the filename off to retrieve only the path component isn't quite as easy. At first glance the dirname() function seems appropriate because it accomplishes basically the same task. However, it assumes that the last portion of the string is a filename, unless it contains a trailing slash, which doesn't generally hold true for URLs:

 function getPathOnly($path) {   if(substr($path, -1, 1) == "/")   {     return $path;

If the path in question ends with a forward slash, there is no filename present and the path may be returned as is:

   }else   {     $pathComponents = explode("/", $path);     $count = count($pathComponents);     $last = $pathComponents[$count - 1];     if (substr_count($last, ".") > 0)     {        array_pop($pathComponents);     }

Explode the path using the forward slash into an array, and check the last element of the array (and hence the path). If it contains a period, assume it is a filename and pop it off the array to remove it. If no period is found, assume it is a directory and leave the last element as is:

     $final = implode("/", $pathComponents);     return $final . "/";   } }

Put the path back together using the forward slash as glue, and finally, append a forward slash to the end to return a properly formatted path.

Combine URLs

Many bloggers only consider their own website when blogging, not their feed and how it might be used. As such, you will occasionally see relative URLs used in images, or anchor links found in a feed. Obviously, when presenting the feed to users on your site these links won't work, which is far from desired. What to do? Ideally, you somehow need to be able to replace the relative URLs with an absolute URL, which should allow your aggregator to retrieve the images just fine.

The following function takes two URLs, the URL from which the feed was retrieved and the URL found in the content of the feed, and returns an absolute URL. This absolute URL may just be the URL found within the content of the feed, or some work may be done to interpret the link:

 function relativeToAbsolute($sourceURL, $link) {  $sup = parse_url($sourceURL);

relativeToAbsolute() is called with the source URL and the link in question. The source URL is then divided into parts with parse_url(). The source URL needs to be worked with because it may include a script name (such as http://example.org/feed.php) or even query elements (http://example.org/feed.php?format=rss&sort=desc), which need to be removed. This is where the previous function comes in.

 if (!isset($sup['scheme'])) {  $sourceURL = "http://" . $sourceURL;  $sup = parse_url($sourceURL); }

If the source URL didn't contain a scheme (ftp://, http://, https://), prepend the http scheme and re-parse the URL. This needs to be done because running parse_url() on URLs without a scheme yields undesirable results:

 $sourceURL = $sup['scheme'] . "://" . $sup['host'] . getPathOnly($sup['path']);

Using the scheme and host from parse_url, combined with the path run through the previous function, should result in a nice clean URL to work with, with a trailing slash.

 $start = substr($link, 0, 1);

The first character of the passed link will be used a few times, so it is saved.

 if($start == '.') {  if (substr($link, 0, 2) == "./")  {     $final = $sourceURL . substr($link, 2);

If the link starts with a period, and indeed starts with period slash, the absolute URL for the link in question is simply the source URL, followed by the link (period and slash removed).

  }else if (substr($link, 0, 3) == "../")  {    $sup = parse_url($sourceURL);    $pathParts = explode("/", $sup['path']);    array_pop($pathParts);    while ((substr($link, 0, 3) == "../") & (count($pathParts) > 0))     {      $x = array_pop($pathParts);      $link = substr($link, 3);    }     $final =  $sup['scheme'] . "://" . $sup['host'] . implode("/", $pathParts) "/"       . $link;

If the link starts with ../ the source URL must be traversed upwards, the source URL is again parsed into its parts, and the path is exploded into an array. array_pop() is used to remove the last element of the array (the leading and trailing slashes in the path result in the first and last array elements being empty; this doesn't matter when the implode() function is called, but it comes into play when a content filled element must be removed from the end). By placing the logic to traverse the path one directory up, and removing the ../ from the link in a loop, relative links such as ../../logo.png can be processed.

  }else  {    $final = $sourceURL . $link;  }

If the link begins with a period, but not ./ or ../, it is assumed that the period is merely part of the file-name, and the final URL is created as such.

  }else if ($start == "/")  {   $final =  $sup['scheme'] . "://" . $sup['host'] . $link;

If the link begins with a leading slash (/), the final URL is simply the scheme, the host, and the link appended together. In this case the leading slash in the link is needed.

  }else if (substr_count($link, "/") == 0)  {   $final = $sourceURL . $link;

If the link contains zero forward slashes, it is assumed to be the filename (for example, logo.png), and the final URL can be created merely by appending the link to the source URL.

  }else  {   $final = $link;  }  return $final; }

Finally, if none of the other situations has been appropriate, the link is assumed to be complete on its own and is not touched, and the resultant string from whatever operation was performed is returned.

Image Tags

The image tag is often considered rather benign and thus safe to allow through and shown to the end user, but this couldn't be further from the truth. The request that a browser generates when you click a link is identical to the request it generates when it attempts to load an image for you. Consider the following HTML document:

 <html> <head> <title>Example Page</title> </head> <body> <img src="http://www.google.ca/search?q=Paul+Reinheimer"> <a href="http://www.google.ca/search?q=Paul+Reinheimer">Search Again!</a> </body> </html>

When the document is loaded, my browser generates the following request:

 GET /consume/security.html HTTP/1.1 Host: example.preinheimer.com User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0 Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,ima ge/png,*/*;q=0.5 Accept-Language: en-us,en;q=0.5 Accept-Encoding: gzip,deflate Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Keep-Alive: 300 Connection: keep-alive

Then, immediately thereafter (without clicking anything), it generates this request:

 GET /search?q=Paul+Reinheimer HTTP/1.1 Host: www.google.ca User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0 Accept: image/png,*/*;q=0.5 Accept-Language: en-us,en;q=0.5 Accept-Encoding: gzip,deflate Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Keep-Alive: 300 Connection: keep-alive Referer: http://example.preinheimer.com/consume/security.html

Clicking the link on the page generates this request, and the Google search results page loads:

 GET /search?q=Paul+Reinheimer HTTP/1.1 Host: www.google.ca User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0 Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,ima ge/png,*/*;q=0.5 Accept-Language: en-us,en;q=0.5 Accept-Encoding: gzip,deflate Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Keep-Alive: 300 Connection: keep-alive Referer: http://example.preinheimer.com/consume/security.html

The only difference is in the Accept line. The request generated by the image tag indicates that it is expecting a png, but will also accept anything else (*/*). As far as Google is concerned, the first request is no different from the second, and as such, will respond to both requests the same (by sending the search results page). In this case, nothing really happened. Requesting search results is pretty benign (though in some corporate environments, sending out search requests on, shall we say, more colorful topics may get someone in trouble), however the same technique could be used in hundreds of other places.

For example:

 <img src="http://my.stockbroker.com/buyshares?ticker=SCO&lot=2000">

The user's browser makes a request to a stockbroker, purchasing 2,000 shares of SCO — a charitable activity, but not one someone is likely to want to engage in while reading your feed aggregation site.

 <img src="http://www.amazon.com/oneclickbuy?isbn=0764589547&number=10">

A more useful image tag, this one orders multiple copies of this book from Amazon.

 <img src="http://my.florist.com/sendflowers?product=deluxe&address=home&from=Liz">

Finally, this image tag orders flowers to be sent to one's home, from Liz — not terrible, unless your significant other sees these and her name isn't Liz."

Note

All of these URLs are completely fictional, and are merely presented to try and make clear the dangers presented by these normally ignored tags.

These attacks may seem humorous, but they are a lot closer to reality than you might think. "Advances" such as remembering login information and single-click purchases are quite convenient, but are what allow attacks like these to take place. The original architects of the HTML specification considered this (and things like this) and as such, GET requests are to be considered "safe," in that they should not generate any lasting action. Lasting actions, such as purchasing shares, ordering books or flowers, and such should be done with POST requests, which cannot be created with the same ease.

Many developers and development teams ignore these safety precautions, however, and allow "unsafe" transactions to occur over GET requests. Some of these vulnerabilities may stem from improper use of constructs such as $_REQUEST in PHP, which allows developers to easily access passed information from either request type. The problem is, however, that it doesn't specify from which request type the information came. It is usually safer to explicitly use either $_GET or $_POST for this reason.

In the majority of circumstances I feel it is often permissible to strip all image tags from a given feed, and provide either a link to the images (URL shown) or just simply a link back to the original source. However, you may disagree. The next few sections discuss a few of the available options.

Retrieve All of the Image Links within a String

Retrieving all of the image links within the string (feed) isn't too difficult, with some regular expression (regex) fun. The following expression can be used to process HTML image tags, and it is also used, in this case, to take the opportunity to provide a crash course in regular expressions:

 $processedFeed = preg_replace('/<img\s+.*?src="/books/4/404/1/html/2/([^\"\' >]*)"\s?(width="([0–9]*)")?\s?(height="([0–9]*)")?[^>]*>/ie',              "cleanImage('$sourceURL', '\\0','\\1','\\2','\\3','\\4', '\\5\)",              $feed);

The function preg_replace() will perform a regex search and replace on the given string. The form is preg_replace(pattern, replacement, subject, [limit]). So the pattern for this regex is as follows:

 /<img\s+.*?src="/books/4/404/1/html/2/([^\"\' >]*)"\s?(width="([0–9]*)")?\s(height="([0-9]*)")?[^>]*>/ie

The first forward slash and last forward slash are the delimiters, setting the left and right limits of the pattern. The last ie indicates that this should be a case-insensitive match, and that the compiler should execute the replacement as PHP, so you can call a function (in this case, cleanImage()).

This regex pattern uses several special characters, as outlined in the following table.

Character	Meaning
\s	This matches white space, like spaces or tabs.
.	This matches any single character except line breaks.
*	This applies to the previous character. It may be repeated 0 or more times.
?	This applies to the previous character. It may be present 0 or 1 times.
+	This applies to the previous item. It may be present 1 or more times.
()	This groups a segment of the pattern together. This affects what is returned, and can be used to mark a portion of the expression as optional, repeating, and so on.
[]	The contents define a character class. [0 9] will match any digit from zero to nine, [a z,A Z,0 9] will match any single lowercase letter, uppercase letter, or digit, and [a z,A Z,0 9] will match any number of them.
^	This is when the first character of a character class negates anything within the class, so [^0–9] will match anything that is not a number.

So, with this newfound knowledge in mind, you can examine the pattern:

<img\s+.*? — This seeks out <img, followed by white space (there must be at least one white space character, followed optionally by any other character any number of times).
src="/books/4/404/1/html/2/([^\"\' >]*)"\s? — This seeks out the source portion of the tag. src="/books/4/404/1/html/2/" is sought, with anything but a single or double quote or a right caret, repeated zero or more times. All of that can optionally be followed by white space.
(width="([0–9]*)")?\s? — This seeks out an optional width parameter, in which any number of digits are sought. The width parameter should be followed optionally by white space characters.
(height="([0–9]*)")? — This seeks out an optional height parameter in the same manner.
[^>]*> — This matches anything but a right caret any number of times, followed by the right caret to close the image tag.

This regex pattern will return up to six separate items, the entire match, followed by whatever matched inside each set of parentheses. Running the regex on the following string:

 Hi <img src="/books/4/404/1/html/2/./logo.png" width="23" height="66"> Logo

results in the following:

 0 <img src=\"./logo.png\" width=\"23\" height=\"66\"> 1 ./logo.png 2 width=\"23\" 3 23 4 height=\"66\" 5 66

Each of those six returned elements are passed to the cleanImage() function, as well as $sourceURL (which is another variable not related to the regex expression). Whatever cleanImage() returns will be put in place of the entire match.

 function cleanImage($sourceURL, $entireMatch, $link, $widthE, $w, $heightE, $h) {    $link = relativeToAbsolute($sourceURL, $link);    return "<img src=\"$link\" height=\"$h\" width=\"$w\">"; }

While the function initially doesn't seem to do much, because it merely replaces one properly formatted image tag with another, consider the following image tag:

 <img src="/books/4/404/1/html/2/./logo.png" width="178" height="60" border="0" alt=" Logo" onClick="window.alert('Click');">

If a string containing that image tag was run through this function, the JavaScript (remember, other more malicious code could just as easily take its place) would be stripped, as would the border and alt tags, and the src attribute would be set to the absolute URL, based on the URL the feed was received from.

This brief introduction to regex should be sufficient for the following examples to make sense. For a more in-depth look at regex (the more you learn about regex the more powerful it becomes and the more you want to learn), take a look at this excellent tutorial online: www.regular-expressions.info/tutorial.html.

Replace Images with Links

Replacing inline images with links to the image in question is trivial with the previous example:

 function replaceImages($sourceURL, $entireMatch, $link, $widthE, $w, $heightE, $h) {    $link = relativeToAbsolute($sourceURL, $link);    return "<a href=\"$link\" title=\"Inline Image\">(image)</a>"; }

Pro

Lessened XSS vulnerability.

Cons

Removing images may destroy the layout of the text.

Images very pertinent to the story aren't immediately available.

Retrieve and Serve the Image in Question

Upon aggregation of any feed, the content is examined for embedded <img> tags. When found, the images in question are retrieved and saved on the server, and the image tags are rewritten to point to the new local location:

 function retrieveImages($sourceURL, $entireMatch, $link, $widthE, $w, $heightE, $h) {   $localSavePath = "/www/domains/feedimages.preinheimer.com/";   $localImageURL = "http://feedimages.preinheimer.com/";

The destination path for retrieved images, as well as the eventual URL for those images, is specified.

   $link = relativeToAbsolute($sourceURL, $link);   $image = file_get_contents($link);

An absolute URL for the image is created (or merely confirmed) and the image is retrieved.

   $filename = md5($link);   $filepath = $localSavePath . $filename;

A unique filename for the file is generated; if the remote filename was used (and feeds from multiple locations were aggregated), collisions would be likely. This also saves you from any sort of filename filtering — the MD5 string will be safe. The file is saved to the specified directory.

   file_put_contents($filepath, $image);   $image = null;   @list($lwidth, $lheight, $ltype, $lattr) = getimagesize($filepath);

The image is saved to disk, and the variable that contained the image is destroyed because it is no longer needed. Variables with information regarding the image are populated with getimagesize().

  if ($lwidth * $lheight == 0)  {    return "";

If either the width or height specified by getimagesize() is not present, the image is invalid (as would be the case in a cross-site scripting attack) and an empty string is returned. This effectively removes the image tag from the feed.

  }else  {   if ($w < 1)   {     $w = $lwidth;   }   if ($h < 1)   {     $h = $lheight;   }   return "<img src=\"" . $localImageURL . $filename . "\" width=\"$w\" height=\"$h\" alt=\" Original Source: $link\">";  }

If either the width or height variable is not present in the original feed, it is set to the correct value from getimagesize(). This should help speed page loads for users, because their browsers can make intelligent decisions about page layout. The correct height and width variables are simply not used, because many people still use the height and width tags to stretch images (either smaller or larger), even though there are far more effective alternatives. The alt tag is set to the original full URL of the image in question, giving credit where credit is due.

Pros

No XSS vulnerabilities.

Predictable image availability.

Images will show even for users who have instructed their browser not to load images from external sources.

Some sites block remote image loads when they are noticed.

Cons

There could be copyright issues. The owner of the image may not take kindly to your display of it outside its original source (this person may or may not be the one providing the feed).

You may find you end up with images of questionable moral values. Depending on the content of the image, you or your users may not desire to have it hosted or displayed on your site.

Some methods employed to prevent cross-site image linking may also prevent your script from retrieving the image in question, though it is quite likely that your users would have been able to load the image themselves if this is the case.

Retrieve Image Once, and Confirm It Is an Image

Download the image once to the server (when the feed is aggregated) and use a tool such as imagemagik to examine the image. If it loads properly, it is likely safe to assume that it is not in fact an XSS attack.

Updating the previous script to point to the original URL and deleting the local file are trivial.

 unlink($filepath); return "<img src=\"" . $link . "\" width=\"$w\" height=\"$h\">";

unlink() is PHP's delete function, which removes the file in question, and returns the correct (absolute) image source. Note that this will still remove broken image links from feeds, and will set correct width and height values if they are not present.

Pros

Lessened XSS vulnerability.

Images don't have to be served locally (no bandwidth usage, fewer copyright concerns).

Con

Attacker could merely examine request headers for the image to see if it comes from a known aggregator (or doesn't specify a user agent, or a referrer, like an automated download might do) and return an image. If not, do a header re-direct to the targeted XSS site. This is definitely more complicated, but not outside the realm of possibility.

Link Tags

Allowing use of the <a href> tags will probably be required for any major aggregation, because the links themselves may often provide as much "content" as the rest of the feed. There are a few things to consider when allowing these tags to be used.

First, consider the following HTML snippet:

 <a href="home.php" onclick="window.open(this.href, 'home', 'width=480,height=480,scrollbars=yes'); return false;">Comments (0)</a>

The JavaScript possibilities within a link tag are endless. In short, you don't want them.

Second, titles can be misleading. Just because a link says it points somewhere doesn't mean it does. Consider this:

 ... Today we got a new puppy <a href="http://www.playboy.com">rover</a> he is cute!

Nothing tragic; the user merely ends up at a site different from the one they expected, but keep in mind the URL doesn't have to be that benign. Consider the example URLs provided in the image example. Yes, in this case the user would be aware that shares were purchased, books or flowers were ordered, or whatever, but in many cases canceling such transactions has repercussions (stock trades), or may in fact be impossible (some forum packages do not allow retraction of posts).

Although your technical users likely glance at the URL shown in the status bar of their browser before clicking on a link (and thus immunizing themselves against such an attack), the less technically minded rarely do, and as such, these things should be considered.

The regex pattern used to pull links out of the feed is quite similar to the one used for images.

 $teststring = preg_replace('/<a\s+.*?href=[\"\']?([^\"\'   >]*)[\"\']?\s?(title=[\"\']?([^\"\'>]*)[\"\']?)?[^>]*>(.*?)<\/a>/ie',              "cleanHREF('$sourceURL', '\\1', '\\3', '\\4')",              $teststring);

The returns for this regex are as follows:

0 — Entire match
1 — URL in question
2 — Title match
3 — Title in itself
4 — The name for the link

Note that this regex is a little more flexible than the previous example. Rather than specifying that double quotes must encase all of the variables, [\"\']? is used, which allows either type of quote to be used, or none at all.

Note

While writing the examples for this book, "The Regex Coach" was a great tool. It shows the results to specific regex patterns against given text, highlights specific returns, and so on. If you are going to be using regex for long, I would recommend using this free tool.

The corresponding function returns the cleaned link:

 function cleanHREF($sourceURL, $link, $title, $name) {    $link = relativeToAbsolute($sourceURL, $link);    return "<a href=\"$link\" title=\"$title\">$name</a>"; }

To deal with the possible issue of a misleading title, a small substitution can be made:

 function cleanAndDisplayHREF($sourceURL, $link, $title, $name) {    $link = relativeToAbsolute($sourceURL, $link);    return "<a href=\"$link\" title=\"$title\">$name</a> ($link)"; }

Although these methods effectively deal with additional JavaScript elements that may be added to the tag, JavaScript can also be added to the href attribute. There are a few options to deal with this possibility.

IFrames

It really seems like the concepts of IFrames and feed syndication are in opposition to each other. A feed seeks to present all of the relevant information cohesively in one spot, whereas an IFrame functions to reference external information and present it internally. When presented with an IFrame in a feed to be consumed, you have a few attractive options:

Strip the IFrame. IFrames were designed with backward compatibility in mind, so they should include content between the frame tags that can be substituted when the IFrame itself is not loaded.
Retrieve the content of the IFrame and display it inline.
Provide a link to the content in question.

Fortunately, IFrames in feeds remain quite rare. If you notice that one of the feeds you are consuming is using the IFrame tag, the previous examples should be easily modified to meet your needs.

Formatting Tags

Formatting tags include h1, h2 ,..., h5, p, b, i, and so on.

These tags in themselves rarely present any security risk. A best practice would be to use html_entities() on the feed, then use str_replace() on the feed to replace the entities with allowed strings. For example, take the following string:

 <h1>Hi!</h1> my name is <b>paul</b>

Run htmlentities() on it, and you receive this:

 &lt;h1&gt;Hi!&lt;/h1&gt; my name is &lt;b&gt;paul&lt;/b&gt;

Use str_replace() to replace <b> and </b> with <b> and </b>, respectively, and you have this:

 &lt;h1&gt;Hi!&lt;/h1&gt; my name is <b>paul</b>

The allowed tags make it through, and the other ones just get left behind.

That concludes the section on how you can increase the security of your feeds. Of course, doing things like ensuring the feed provider is a legitimate business, that the feeds are administered, or that they are simply well-known names can go a long way to ensuring you avoid the snares and pitfalls of using external content. Alas, security is not the only thing you need to worry about when obtaining information from other people or sites. Look at another consideration.