Retrieving and Storing the Feed | Professional Web APIs with PHP. eBay, Google, PayPal, Amazon, FedEx, Plus Web Feeds

This section looks at a more advanced script, which makes use of a database to store feeds. Obviously, the table used will need to reflect the structure of the feed.

A Basic Storage Script

First, you need to create a mysql table that will be populated with the information from the feed when the aggregateFeeds.php script is run:

   `id` varchar(32) NOT NULL default '',   `source` varchar(75) NOT NULL default '',   `title` varchar(255) NOT NULL default '',   `date` timestamp(14) NOT NULL,   `content` text NOT NULL,   `link` varchar(255) NOT NULL default '',   PRIMARY KEY  (`id`) ) TYPE=MyISAM;

The ID field will contain an MD5 hash, which is 32 hex characters long. The source field will contain the URL of the feed in question. Title, date, content, and link will all come from the feed:

 <?php   include ("../common_db.php");   $request = "http://rss.news.yahoo.com/rss/software";   $response = file_get_contents($request);   $xml = simplexml_load_string($response);   echo "Updated " . processRSSFeed($xml, $request) . " feeds";

The URL for the feed is declared and the feed is retrieved. The feed is then processed into the SimpleXML object. The feed is sent for processing, and the total number of feeds updated is printed (this script would most likely be run by a cron job, or other timed construct, so a detailed output isn't really required). Updating is done by a processRSSFeed() function, which looks like this:

 function processRSSFeed($xml, $source) {   $updatedStories = 0;   foreach($xml->channel->item AS $story)   {     if (saveFeed($story->guid, $source, $story->title, $story->pubDate,       $story->description, $story->link) == 2)     {       break;     }     $updatedStories += 1;   }     return $updatedStories; }

ProcessRSSFeed() takes the input RSS feed as a SimpleXML object, as well as the source URL of the feed provided. The foreach loop provides an easy method to access each element in the item; rather than $xml->channel->item[#]->title, you can simply use $story->title. Each story is saved in turn, and the output indicates whether this was an addition to the database or merely an update to one already present in the database. If the story was merely an update, you can stop processing other items because it is likely that they are already present. The function returns the total number of elements updated.

Note

This method assumes that the feed is provided in reverse chronological order (as is the standard), with the most recent additions posted at the top. It also assumes that any updates to previous stories will be re-seeded at the top of the feed, rather than updated in their current position (a standard reporting practice is to report any corrections or updates in the same manner as the original story). Depending on how the feed you are consuming operates, you may want to process the entire feed regardless.

Finally, to actually save the feed to the database, use the following code:

 function saveFeed($guid, $source, $title, $date, $content, $link) {   if (strlen($guid) > 0)   {     $pk = md5($source . $guid);   }else   {     $pk = md5($source . $title);   }

A primary key is generally a good idea when storing data into the database; in this case it is a good idea to create one of your own. GUID could be used — however, it isn't always provided, and although sites guarantee theirs to be unique on their own site, there are no claims of cross-site uniqueness (there are likely several home-brewed RSS feed providers out there with GUIDs starting at 1, incrementing as appropriate), so you prepend the source URL to the GUID. In cases of feeds that do not provide the GUID field, the title is used — date or link would be another good choice. In either case, the primary key is the MD5 (a one-way hashing algorithm that generates a key 32 hex characters long) of the resultant string:

   $linkID = db_connect();   $title = mysql_real_escape_string(strip_tags($title));   $content = mysql_real_escape_string(strip_tags($content));   $link = mysql_real_escape_string($link);   $source = mysql_real_escape_string($source);

A connection is established to the database, the strings are stripped of any HTML encoding, slashes are added to avoid SQL Injection attacks, and the variables are ready to be saved to the database.

Note

It is considered a best practice to escape all data to be saved to the database with the database-specific function, rather than simply using addslashes(). This ensures that all characters that the specific database requires are escaped, rather than simply ', ", \, and NULL (mysql_real_escape_string() also escapes \x00, \n, \r, and \x1a). Other databases have similar functionality.

   $date = strtotime($date);   if ($date == -1)   {     $date = date();   }

To process the date, you rely on the strtotime() function. It will make every effort to interpret the date presented, and result in a date in the UNIX format. Although the RSS specification requires the date to be in a specific format (ISO 8601), using strtotime() is not only easier than writing your own function, but it also understands most other textual date formats. If, however, the format is not understood, or simply not there, the current date is used instead.

       $query = "REPLACE INTO 03_feed_raw   (`id`, `source`, `title`, `date`, `content`, `link`)   VALUES   ('$pk', '$source', '$title', FROM_UNIXTIME('$date'), '$content', '$link')";   return replaceQuery($query, $linkID); } ?>

The REPLACE INTO syntax in MySQL is a real timesaver in this case, though it only works because you have a primary key. If the query is run, and no existing record has the same primary key, it will insert the record, and mysql_affected_rows() will return 1. If, however, a record exists with that primary key, it will be deleted, a new record will be created with the information in the query, and mysql_affected_rows() will return 2.

If your database system doesn't support a REPLACE INTO syntax (or MySQL's alternative INSERT ... ON DUPLICATE KEY UPDATE) or something to that effect, you still have a few choices. You can check for an existing record in each instance with a SELECT query, and create it if it doesn't exist. You could simply compare the most recent date in your database, and only insert queries from feed elements that came afterwards, and so on.

As mentioned earlier, this feed was designed to be called by a cron job, or other automated process (Windows Scheduled Tasks, for example). The $request variable could be turned into an array and iterated through to grab multiple feeds and so on.

Extending the Script to Include Atom Support

Extending the script to grab other feed types should be trivial. This function (in place of the preceding processRSSFeed() function) will grab the specified Atom feed and save it. This script was tested against the Google Blog (www.google.com/googleblog/atom.xml) where Google employees post on a semiregular basis.

Here is a snippet of Google's Blog for reference (trimmed for space):

 <?xml version="1.0" encoding=" UTF-8" standalone=" yes"?> <?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type=" text/css"?> <feed xmlns="http://purl.org/atom/ns#" version="0.3" xml:lang=" en-US">   <link href="http://www.blogger.com/atom/10861780" rel=" service.post" title="Google Blog" type="application/atom+xml"/>   <link href="http://www.blogger.com/atom/10861780" rel=" service.feed" title="Google Blog" type="application/atom+xml"/>   <title mode="escaped" type="text/html">Google Blog</title>   <tagline mode="escaped" type=" text/html"></tagline>   <link href="http://googleblog.blogspot.com" rel="alternate" title="Google Blog" type="text/html"/>   <id>tag:blogger.com,1999:blog-10861780</id>   <modified>2005-06-16T21:33:27Z</modified>   <generator url="http://www.blogger.com/" version="5.15">Blogger</generator>   <info mode="xml" type="text/html">     <div xmlns="http://www.w3.org/1999/xhtml">This is an Atom formatted XML site feed. It is intended to be viewed in a Newsreader or syndicated to another site. Please visit the <a href="http://help.blogger.com/bin/answer.py?answer=697">Blogger Help</a> for more info.</div>  </info>   <entry xmlns="http://purl.org/atom/ns#">     <link href="http://www.blogger.com/atom/10861780/111775901581356827" rel="service.edit" title="Dot what?" type="application/atom+xml"/>     <author>       <name>A Googler</name>     </author>     <issued>2005-06-03T13:03:00-07:00</issued>     <modified>2005-06-06T13:32:53Z</modified>     <created>2005-06-03T00:36:55Z</created>     <link href="http://googleblog.blogspot.com/2005/06/dot-what.html" rel="alternate" title="Dot what?" type="text/html"/>     <id>tag:blogger.com,1999:blog-10861780.post-111775901581356827</id>     <title mode="escaped" type="text/html">Dot what?</title>     <content mode="escaped" type="text/html" xml:base="http://googleblog.blogspot.com" xml:space="preserve">&lt;span &gt;Posted by Tom Stocky, Product Marketing Manager &lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There's been a lot of talk lately about ICANN's preliminary approval of some new top level Internet domains (.cat, .jobs, .mobi, .post, .travel, and .xxx),... <content>   </entry> </feed> function processAtomFeed($xml, $source) {   $updatedStories = 0;   foreach($xml->entry AS $story)   {     if (saveFeed($story->id, $source, $story->title, $story->issued,       $story->content, $story->link) == 2)     {       break;     }     $updatedStories += 1;   }   return $updatedStories; }

As you can tell, changing the script to allow different feed types to be retrieved is quite simple. Examine the feed in question, determine your needs, and modify the loop, database tables, whatever.

Note

It may seem like a neat idea to have your script autodetect the encoding used in the specified feed (RSS versus Atom), but in the majority of cases, it isn't too useful. The frequency with which new feeds will be added for retrieval is generally low, so you might as well have the user specify the feed type. If you do require auto detection of feed type, do it once, when the feed is added to the retrieval list, rather than on each run of this script.

Retrieving Enclosures

The RSS specification includes the enclosure element, which is a subelement of item. It contains the filesize, type, and URL for a file attached to the item element. This would commonly be used to attach a song to a post by a band, or an image related to a specific post. Updating the processRSSFeed() function to retrieve and save the specified enclosure is also relatively painless.

 function processRSSFeedWithEnclosure($xml, $source) {   $updatedStories = 0;   $MaxSize = 1000000;   foreach($xml->channel->item AS $story)   {     if (saveFeed($story->guid, $source, $story->title, $story->pubDate,       $story->description, $story->link) == 2)     {       break;     }else if (isset($story->enclosure['url']) && isset($story->enclosure['length'])       && ($story->enclosure['length'] < $MaxSize))     {       $filename = basename($story->enclosure['url']);       $file = file_get_contents($story->enclosure['url']);       file_put_contents("/tmp/" . $filename, $file);     }     $updatedStories += 1;   }   return $updatedStories; }

The check for an enclosure with the particular item is done after the save attempt for a couple reasons, primarily to avoid repeatedly downloading the same enclosure for an unchanged lead item. This also ensures that the file is downloaded again if the story is updated. The if portion of the else if statement is a little tricky:

 if (isset($story->enclosure['url']) && isset($story->enclosure['length'])       && ($story->enclosure['length'] < $MaxSize))

First, check for the existence of the url element of enclosure (note the different syntax for attributes), then the existence of the length attribute, and finally ensure that the length attribute indicates a file size less than the specified max size. This works because conditionals are checked in order — when one fails (in this case, with all AND operations), the rest are ignored.

Assuming that the enclosure exists and is of an appropriate size, it is downloaded with file_get_contents() and saved to disk. Depending on how the feed and enclosures are used, you will want to add at least one additional step, saving information on the enclosures to a separate table or to the same table, moving files somewhere "safer" on disk, running a virus scan, double-checking the encoding of the file, and so forth. You could also add additional logic to retrieve only certain file types (in other words, only images, or everything but .pdf files). As with anything, the possibilities are endless.

Note

The file in this example was saved to /tmp, and though this works great for an example, it is a bad idea in any real-world application. Save your files to a directory where only your web server has access, outside the document root. Have the files virus-scanned and moved elsewhere by a batch process called after all feeds have been updated.