Professional Web APIs with PHP. eBay, Google, PayPal, Amazon, FedEx, Plus Web Feeds
Authors: Reinheimer P
Published year: 2006
Pages: 28-29/130
Buy this book on amazon.com >>

Keep a Copy of the Feed Structure

Drop the feed into either a SimpleXML or MiniXML construct, and use print_r() to view the contents. From here you should be able to spec out your code in detail. Keep print_r() handy while coding because it is a great tool to re-examine the branch of a feed you are working with when you are having problems. I often like to keep a hard copy of a print_r() dump handy while playing with an XML document.

Doing a quick dump of the Yahoo! feed results in the following (some items have been shortened for space — the description tag for one, to fit on one line and the indenting has been modified):


SimpleXMLElement Object


(


[channel] => SimpleXMLElement Object


(


[title] => Yahoo! News: Technology - Software


[copyright] => Copyright (c) 2004 Yahoo! Inc. All rights reserved.


[description] => Technology - Software


[language] => en-us


[lastBuildDate] => Fri, 26 Nov 2004 18:50:07 GMT


[ttl] => 5


[image] => SimpleXMLElement Object


(


[title] => Yahoo! News


[width] => 142


[height] => 18


[url] =>


http://us.i1.yimg.com/us.yimg.com/i/us/nws/th/main_142.gif


)


[item] => Array


(


[0] => SimpleXMLElement Object


(


[title] => Recording Industry, File-Share Face Off (AP)


[link] =>


http://us.rd.yahoo.com/dailynews/rss/software/*




http://story.news.yahoo.com/news?tmpl=story2&u=/ap/20041126/


ap_on_bi_ge/kazaa_trial


[guid] => ap/20041126/kazaa_trial


[pubDate] => Fri, 26 Nov 2004 18:50:07 GMT


[description] => AP - The



next

chapter


in the global legal battle between...


)


[1] => SimpleXMLElement Object


(


[title] => Britons Offered 'Real' Windows XP (AP)


[guid] => ap/20041126/britain_microsoft_piracy


[pubDate] => Fri, 26 Nov 2004 16:25:12 GMT


[description] => AP - Owners of pirated copies of Microsoft Corp.'s Windows...


)


...


[49] => SimpleXMLElement Object


(


[title] => GPL 3 to Take on IP, Patents (Ziff Davis)


[guid] => zd/20041122/139714


[pubDate] => Mon, 22 Nov 2004 06:21:23 GMT


[description] => Ziff Davis - With a relatively

hostile

environment that has...


)


)


)


)

Note 

Note that running print_r() on a SimpleXML object will not reveal attributes, which are used in several places throughout feeds. If something seems to be missing, go back and look at the original source.

Looking at the feed in this manner, it is obvious not only where loops belong (a foreach around the item tag would work perfectly ), but what the syntax should be to access any element in particular ( $xml->item[0]->pubDate to get the publication date of the most recently posted item).

As an additional piece of wisdom to make your life easier, I suggest you grab the feed once (I like to use file_get_contents() then file_put_contents() myself , to avoid any encoding "fun" taking it on and off my Windows box) and save it on your test server. There is simply no need to pester the source of the feed constantly while testing everything. In this instance, a copy of the feed was saved with the file_ functions and as yahoo.xml .



Retrieving and Storing the Feed

This section looks at a more advanced script, which makes use of a database to store feeds. Obviously, the table used will need to reflect the structure of the feed.

A Basic Storage Script

First, you need to create a mysql table that will be populated with the information from the feed when the aggregateFeeds.php script is run:


`id` varchar(32) NOT NULL default '',


`source` varchar(75) NOT NULL default '',


`title` varchar(255) NOT NULL default '',


`date` timestamp(14) NOT NULL,


`content` text NOT NULL,


`link` varchar(255) NOT NULL default '',


PRIMARY KEY (`id`)


) TYPE=MyISAM;

The ID field will contain an MD5 hash, which is 32 hex characters long. The source field will contain the URL of the feed in question. Title, date, content, and link will all come from the feed:


<?php


include ("../common_db.php");


$request = "


http://rss.news.yahoo.com/rss/software


";


$response = file_get_contents($request);


$xml = simplexml_load_string($response);


echo "Updated " . processRSSFeed($xml, $request) . " feeds";

The URL for the feed is declared and the feed is retrieved. The feed is then processed into the SimpleXML object. The feed is sent for processing, and the total number of feeds updated is printed (this script would most likely be run by a cron job, or other timed construct, so a detailed output isn't really required). Updating is done by a processRSSFeed() function, which looks like this:


function processRSSFeed($xml, $source)


{


$updatedStories = 0;


foreach($xml->channel->item AS $story)


{


if (saveFeed($story->guid, $source, $story->title, $story->pubDate,


$story->description, $story->link) == 2)


{


break;


}


$updatedStories += 1;


}


return $updatedStories;


}

ProcessRSSFeed() takes the input RSS feed as a SimpleXML object, as well as the source URL of the feed provided. The foreach loop provides an easy method to access each element in the item ; rather than $xml->channel->item[#]->title , you can simply use $story->title . Each story is saved in turn , and the output indicates whether this was an addition to the database or merely an update to one already present in the database. If the story was merely an update, you can stop processing other items because it is likely that they are already present. The function returns the total number of elements updated.

Note 

This method assumes that the feed is provided in reverse chronological order (as is the standard), with the most recent additions posted at the top. It also assumes that any updates to previous stories will be re- seeded at the top of the feed, rather than updated in their current position (a standard reporting practice is to report any corrections or updates in the same manner as the original story). Depending on how the feed you are consuming operates, you may want to process the entire feed regardless.

Finally, to actually save the feed to the database, use the following code:


function saveFeed($guid, $source, $title, $date, $content, $link)


{


if (strlen($guid) > 0)


{


$pk = md5($source . $guid);


}else


{


$pk = md5($source . $title);


}

A primary key is generally a good idea when storing data into the database; in this case it is a good idea to create one of your own. GUID could be used — however, it isn't always provided, and although sites guarantee theirs to be unique on their own site, there are no claims of cross-site uniqueness (there are likely several home-brewed RSS feed providers out there with GUIDs starting at 1, incrementing as appropriate), so you prepend the source URL to the GUID. In cases of feeds that do not provide the GUID field, the title is used — date or link would be another good choice. In either case, the primary key is the MD5 (a one-way hashing algorithm that generates a key 32 hex characters long) of the resultant string:


$linkID = db_connect();


$title = mysql_real_escape_string(strip_tags($title));


$content = mysql_real_escape_string(strip_tags($content));


$link = mysql_real_escape_string($link);


$source = mysql_real_escape_string($source);

A connection is established to the database, the strings are stripped of any HTML encoding, slashes are added to avoid SQL Injection attacks, and the variables are ready to be saved to the database.

Note 

It is considered a best practice to escape all data to be saved to the database with the database-specific function, rather than simply using addslashes() . This ensures that all characters that the specific database requires are escaped, rather than simply ', ", \, and NULL ( mysql_real_escape_string() also escapes \x00, \n, \r , and \x1a ). Other databases have similar functionality.


$date = strtotime($date);


if ($date == -1)


{


$date = date();


}

To process the date, you rely on the strtotime() function. It will make every effort to interpret the date presented, and result in a date in the UNIX format. Although the RSS specification requires the date to be in a specific format (ISO 8601), using strtotime() is not only easier than writing your own function, but it also understands most other textual date formats. If, however, the format is not understood , or simply not there, the current date is used instead.


$query = "REPLACE INTO 03_feed_raw


(`id`, `source`, `title`, `date`, `content`, `link`)


VALUES


('$pk', '$source', '$title', FROM_UNIXTIME('$date'), '$content', '$link')";


return replaceQuery($query, $linkID);


}


?>

The REPLACE INTO syntax in MySQL is a real timesaver in this case, though it only works because you have a primary key. If the query is run, and no existing record has the same primary key, it will insert the record, and mysql_affected_rows() will return 1. If, however, a record exists with that primary key, it will be deleted, a new record will be created with the information in the query, and mysql_affected_rows() will return 2.

If your database system doesn't support a REPLACE INTO syntax (or MySQL's alternative INSERT ... ON DUPLICATE KEY UPDATE ) or something to that effect, you still have a few choices. You can check for an existing record in each instance with a SELECT query, and create it if it doesn't exist. You could simply compare the most recent date in your database, and only insert queries from feed elements that came afterwards, and so on.

As mentioned earlier, this feed was designed to be called by a cron job, or other automated process (Windows Scheduled Tasks, for example). The $request variable could be turned into an array and iterated through to grab multiple feeds and so on.

Extending the Script to Include Atom Support

Extending the script to grab other feed types should be trivial. This function (in place of the preceding processRSSFeed() function) will grab the specified Atom feed and save it. This script was tested against the Google Blog ( www.google.com/googleblog/atom.xml ) where Google employees post on a semiregular basis.

Here is a snippet of Google's Blog for reference ( trimmed for space):


<?xml version="1.0" encoding=" UTF-8" standalone=" yes"?>


<?xml-stylesheet href="


http://www.blogger.com/styles/atom.css


" type=" text/css"?>


<feed xmlns="


http://purl.org/atom/ns#


" version="0.3" xml:lang=" en-US">


<link href="


http://www.blogger.com/atom/10861780


" rel=" service.post"


title="Google Blog" type="application/atom+xml"/>


<link href="


http://www.blogger.com/atom/10861780


" rel=" service.feed"


title="Google Blog" type="application/atom+xml"/>


<title mode="escaped" type="text/html">Google Blog</title>


<tagline mode="escaped" type=" text/html"></

tagline

>


<link href="


http://googleblog.

blogspot

.com


" rel="alternate" title="Google Blog"


type="text/html"/>


<id>tag:blogger.com,1999:blog-10861780</id>


<modified>2005-06-16T21:33:27Z</modified>


<generator url="


http://www.blogger.com/


" version="5.15">Blogger</generator>


<

info

mode="xml" type="text/html">


<div xmlns="


http://www.w3.org/1999/xhtml


">This is an Atom formatted XML site


feed. It is intended to be

viewed

in a Newsreader or syndicated to another site.


Please visit the <a href="


http://help.blogger.com/bin/answer.py?answer=697


">Blogger


Help</a> for more info.</div>


</info>


<entry xmlns="


http://purl.org/atom/ns#


">


<link href="


http://www.blogger.com/atom/10861780/111775901581356827


"


rel="service.edit" title="Dot what?" type="application/atom+xml"/>


<author>


<name>A Googler</

name

>


</author>


<issued>2005-06-03T13:03:00-07:00</issued>


<modified>2005-06-06T13:32:53Z</modified>


<created>2005-06-03T00:36:55Z</created>


<link href="


http://googleblog.blogspot.com/2005/06/dot-what.html


"


rel="alternate" title="Dot what?" type="text/html"/>


<id>tag:blogger.com,1999:blog-10861780.post-111775901581356827</id>


<title mode="escaped" type="text/html">Dot what?</title>


<content mode="escaped" type="text/html"


xml:base="


http://googleblog.blogspot.com


" xml:space="preserve">&lt;span


class="byline-author"&gt;Posted by Tom Stocky, Product Marketing Manager


&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There's been a lot of talk lately about


ICANN's preliminary approval of some new top level Internet domains (.cat, .jobs,


.mobi, .post, .travel, and .xxx),...


<content>


</entry>


</feed>


function processAtomFeed($xml, $source)


{


$updatedStories = 0;


foreach($xml->entry AS $story)


{


if (saveFeed($story->id, $source, $story->title, $story->issued,


$story->content, $story->link) == 2)


{


break;


}


$updatedStories += 1;


}


return $updatedStories;


}

As you can tell, changing the script to allow different feed types to be retrieved is quite simple. Examine the feed in question, determine your needs, and modify the loop, database tables, whatever.

Note 

It may seem like a neat idea to have your script autodetect the encoding used in the specified feed (RSS versus Atom), but in the majority of cases, it isn't too useful. The frequency with which new feeds will be added for retrieval is generally low, so you might as well have the user specify the feed type. If you do require auto detection of feed type, do it once, when the feed is added to the retrieval list, rather than on each run of this script.

Retrieving Enclosures

The RSS specification includes the enclosure element, which is a subelement of item . It contains the filesize, type , and URL for a file attached to the item element. This would commonly be used to attach a song to a post by a band , or an image related to a specific post. Updating the processRSSFeed() function to retrieve and save the specified enclosure is also relatively painless.


function processRSSFeedWithEnclosure($xml, $source)


{


$updatedStories = 0;


$MaxSize = 1000000;


foreach($xml->channel->item AS $story)


{


if (saveFeed($story->guid, $source, $story->title, $story->pubDate,


$story->description, $story->link) == 2)


{


break;


}else if (isset($story->enclosure['url']) && isset($story->enclosure['length'])


&& ($story->enclosure['length'] < $MaxSize))


{


$filename = basename($story->enclosure['url']);


$file = file_get_contents($story->enclosure['url']);


file_put_contents("/tmp/" . $filename, $file);


}


$updatedStories += 1;


}


return $updatedStories;


}

The check for an enclosure with the particular item is done after the save attempt for a couple reasons, primarily to avoid repeatedly downloading the same enclosure for an unchanged lead item. This also ensures that the file is downloaded again if the story is updated. The if portion of the else if statement is a little tricky:

if (isset($story->enclosure['url']) && isset($story->enclosure['length']) && ($story->enclosure['length'] < $MaxSize))

First, check for the existence of the url element of enclosure (note the different syntax for attributes), then the existence of the length attribute, and finally ensure that the length attribute indicates a file size less than the specified max size . This works because conditionals are checked in order — when one fails (in this case, with all AND operations), the rest are ignored.

Assuming that the enclosure exists and is of an appropriate size, it is downloaded with file_get_contents() and saved to disk. Depending on how the feed and enclosures are used, you will want to add at least one additional step, saving information on the enclosures to a separate table or to the same table, moving files somewhere "safer" on disk, running a virus scan, double-checking the encoding of the file, and so forth. You could also add additional logic to retrieve only certain file types (in other words, only images, or everything but .pdf files). As with anything, the possibilities are endless.

Note 

The file in this example was saved to /tmp , and though this works great for an example, it is a bad idea in any real-world application. Save your files to a directory where only your web server has access, outside the document root. Have the files virus-scanned and moved elsewhere by a batch process called after all feeds have been updated.


Professional Web APIs with PHP. eBay, Google, PayPal, Amazon, FedEx, Plus Web Feeds
Authors: Reinheimer P
Published year: 2006
Pages: 28-29/130
Buy this book on amazon.com >>