Drop the feed into either a SimpleXML or MiniXML construct, and use print_r() to view the contents. From here you should be able to spec out your code in detail. Keep print_r() handy while coding because it is a great tool to re-examine the branch of a feed you are working with when you are having problems. I often like to keep a hard copy of a print_r() dump handy while playing with an XML document.
Doing a quick dump of the Yahoo! feed results in the following (some items have been
SimpleXMLElement Object ( [channel] => SimpleXMLElement Object ( [title] => Yahoo! News: Technology - Software [copyright] => Copyright (c) 2004 Yahoo! Inc. All rights reserved. [description] => Technology - Software [language] => en-us [lastBuildDate] => Fri, 26 Nov 2004 18:50:07 GMT [ttl] => 5 [image] => SimpleXMLElement Object ( [title] => Yahoo! News [width] => 142 [height] => 18 [url] => http://us.i1.yimg.com/us.yimg.com/i/us/nws/th/main_142.gif ) [item] => Array ( [0] => SimpleXMLElement Object ( [title] => Recording Industry, File-Share Face Off (AP) [link] => http://us.rd.yahoo.com/dailynews/rss/software/* http://story.news.yahoo.com/news?tmpl=story2&u=/ap/20041126/ ap_on_bi_ge/kazaa_trial [guid] => ap/20041126/kazaa_trial [pubDate] => Fri, 26 Nov 2004 18:50:07 GMT [description] => AP - Thenext chapter in the global legal battle between... ) [1] => SimpleXMLElement Object ( [title] => Britons Offered 'Real' Windows XP (AP) [guid] => ap/20041126/britain_microsoft_piracy [pubDate] => Fri, 26 Nov 2004 16:25:12 GMT [description] => AP - Owners of pirated copies of Microsoft Corp.'s Windows... ) ... [49] => SimpleXMLElement Object ( [title] => GPL 3 to Take on IP, Patents (Ziff Davis) [guid] => zd/20041122/139714 [pubDate] => Mon, 22 Nov 2004 06:21:23 GMT [description] => Ziff Davis - With a relativelyhostile environment that has... ) ) ) )
| Note |
Note that running print_r() on a SimpleXML object will not reveal attributes, which are used in several places throughout feeds. If something seems to be missing, go back and look at the original source. |
Looking at the feed in this manner, it is obvious not only where
As an additional piece of wisdom to make your life easier, I suggest you grab the feed once (I like to use
file_get_contents()
then
file_put_contents()
This section looks at a more advanced script, which makes use of a database to store feeds. Obviously, the table used will need to reflect the structure of the feed.
First, you need to create a mysql table that will be
`id` varchar(32) NOT NULL default '', `source` varchar(75) NOT NULL default '', `title` varchar(255) NOT NULL default '', `date` timestamp(14) NOT NULL, `content` text NOT NULL, `link` varchar(255) NOT NULL default '', PRIMARY KEY (`id`) ) TYPE=MyISAM;
The ID field will contain an MD5 hash, which is 32 hex
<?php include ("../common_db.php"); $request = " http://rss.news.yahoo.com/rss/software "; $response = file_get_contents($request); $xml = simplexml_load_string($response); echo "Updated " . processRSSFeed($xml, $request) . " feeds";
The URL for the feed is declared and the feed is retrieved. The feed is then
function processRSSFeed($xml, $source) { $updatedStories = 0; foreach($xml->channel->item AS $story) { if (saveFeed($story->guid, $source, $story->title, $story->pubDate, $story->description, $story->link) == 2) { break; } $updatedStories += 1; } return $updatedStories; }
ProcessRSSFeed()
takes the input RSS feed as a SimpleXML object, as well as the source URL of the feed provided. The
foreach
loop provides an easy method to access each element in the
item
; rather than
$xml->channel->item[#]->title
, you can simply use
$story->title
. Each story is saved in
| Note |
This method assumes that the feed is provided in reverse chronological order (as is the standard), with the most recent additions posted at the top. It also assumes that any updates to previous stories will be re-
|
Finally, to actually save the feed to the database, use the following code:
function saveFeed($guid, $source, $title, $date, $content, $link) { if (strlen($guid) > 0) { $pk = md5($source . $guid); }else { $pk = md5($source . $title); }
A primary key is
$linkID = db_connect(); $title = mysql_real_escape_string(strip_tags($title)); $content = mysql_real_escape_string(strip_tags($content)); $link = mysql_real_escape_string($link); $source = mysql_real_escape_string($source);
A connection is established to the database, the strings are stripped of any HTML encoding,
| Note |
It is
|
$date = strtotime($date); if ($date == -1) { $date = date(); }
To process the date, you rely on the
strtotime()
function. It will make every effort to interpret the date presented, and result in a date in the UNIX format. Although the RSS specification requires the date to be in a specific format (ISO 8601), using
strtotime()
is not only easier than writing your own function, but it also understands most other textual date formats. If, however, the format is not
$query = "REPLACE INTO 03_feed_raw (`id`, `source`, `title`, `date`, `content`, `link`) VALUES ('$pk', '$source', '$title', FROM_UNIXTIME('$date'), '$content', '$link')"; return replaceQuery($query, $linkID); } ?>
The REPLACE INTO syntax in MySQL is a real timesaver in this case, though it only works because you have a primary key. If the query is run, and no existing record has the same primary key, it will insert the record, and mysql_affected_rows() will return 1. If, however, a record exists with that primary key, it will be deleted, a new record will be created with the information in the query, and mysql_affected_rows() will return 2.
If your database system doesn't support a REPLACE INTO syntax (or MySQL's alternative INSERT ... ON DUPLICATE KEY UPDATE ) or something to that effect, you still have a few choices. You can check for an existing record in each instance with a SELECT query, and create it if it doesn't exist. You could simply compare the most recent date in your database, and only insert queries from feed elements that came afterwards, and so on.
As mentioned earlier, this feed was designed to be called by a cron job, or other automated process (Windows Scheduled Tasks, for example). The $request variable could be turned into an array and iterated through to grab multiple feeds and so on.
Extending the script to grab other feed types should be trivial. This function (in place of the
Here is a snippet of Google's Blog for reference (
<?xml version="1.0" encoding=" UTF-8" standalone=" yes"?> <?xml-stylesheet href=" http://www.blogger.com/styles/atom.css " type=" text/css"?> <feed xmlns=" http://purl.org/atom/ns# " version="0.3" xml:lang=" en-US"> <link href=" http://www.blogger.com/atom/10861780 " rel=" service.post" title="Google Blog" type="application/atom+xml"/> <link href=" http://www.blogger.com/atom/10861780 " rel=" service.feed" title="Google Blog" type="application/atom+xml"/> <title mode="escaped" type="text/html">Google Blog</title> <tagline mode="escaped" type=" text/html"></tagline > <link href=" http://googleblog.blogspot .com " rel="alternate" title="Google Blog" type="text/html"/> <id>tag:blogger.com,1999:blog-10861780</id> <modified>2005-06-16T21:33:27Z</modified> <generator url=" http://www.blogger.com/ " version="5.15">Blogger</generator> <info mode="xml" type="text/html"> <div xmlns=" http://www.w3.org/1999/xhtml ">This is an Atom formatted XML site feed. It is intended to beviewed in a Newsreader or syndicated to another site. Please visit the <a href=" http://help.blogger.com/bin/answer.py?answer=697 ">Blogger Help</a> for more info.</div> </info> <entry xmlns=" http://purl.org/atom/ns# "> <link href=" http://www.blogger.com/atom/10861780/111775901581356827 " rel="service.edit" title="Dot what?" type="application/atom+xml"/> <author> <name>A Googler</name > </author> <issued>2005-06-03T13:03:00-07:00</issued> <modified>2005-06-06T13:32:53Z</modified> <created>2005-06-03T00:36:55Z</created> <link href=" http://googleblog.blogspot.com/2005/06/dot-what.html " rel="alternate" title="Dot what?" type="text/html"/> <id>tag:blogger.com,1999:blog-10861780.post-111775901581356827</id> <title mode="escaped" type="text/html">Dot what?</title> <content mode="escaped" type="text/html" xml:base=" http://googleblog.blogspot.com " xml:space="preserve"><span class="byline-author">Posted by Tom Stocky, Product Marketing Manager </span><br /><br />There's been a lot of talk lately about ICANN's preliminary approval of some new top level Internet domains (.cat, .jobs, .mobi, .post, .travel, and .xxx),... <content> </entry> </feed> function processAtomFeed($xml, $source) { $updatedStories = 0; foreach($xml->entry AS $story) { if (saveFeed($story->id, $source, $story->title, $story->issued, $story->content, $story->link) == 2) { break; } $updatedStories += 1; } return $updatedStories; }
As you can tell, changing the script to allow different feed types to be retrieved is quite simple. Examine the feed in question, determine your needs, and modify the loop, database tables, whatever.
| Note |
It may seem like a neat idea to have your script autodetect the encoding used in the specified feed (RSS versus Atom), but in the majority of cases, it isn't too useful. The frequency with which new feeds will be added for retrieval is generally low, so you might as well have the
|
The RSS specification includes the
enclosure
element, which is a subelement of
item
. It contains the
filesize, type
, and
URL
for a file attached to the
item
element. This would commonly be used to attach a song to a post by a
function processRSSFeedWithEnclosure($xml, $source) { $updatedStories = 0; $MaxSize = 1000000; foreach($xml->channel->item AS $story) { if (saveFeed($story->guid, $source, $story->title, $story->pubDate, $story->description, $story->link) == 2) { break; }else if (isset($story->enclosure['url']) && isset($story->enclosure['length']) && ($story->enclosure['length'] < $MaxSize)) { $filename = basename($story->enclosure['url']); $file = file_get_contents($story->enclosure['url']); file_put_contents("/tmp/" . $filename, $file); } $updatedStories += 1; } return $updatedStories; }
The check for an enclosure with the particular item is done after the save attempt for a couple reasons, primarily to avoid repeatedly downloading the same enclosure for an unchanged lead item. This also ensures that the file is downloaded again if the story is updated. The if portion of the else if statement is a little tricky:
if (isset($story->enclosure['url']) && isset($story->enclosure['length']) && ($story->enclosure['length'] < $MaxSize))
First, check for the existence of the
url
element of
enclosure
(note the different syntax for attributes), then the existence of the
length
attribute, and finally ensure that the
length
attribute indicates a file size less than the specified max
Assuming that the enclosure exists and is of an appropriate size, it is downloaded with file_get_contents() and saved to disk. Depending on how the feed and enclosures are used, you will want to add at least one additional step, saving information on the enclosures to a separate table or to the same table, moving files somewhere "safer" on disk, running a virus scan, double-checking the encoding of the file, and so forth. You could also add additional logic to retrieve only certain file types (in other words, only images, or everything but .pdf files). As with anything, the possibilities are endless.
| Note |
The file in this example was saved to /tmp , and though this works great for an example, it is a bad idea in any real-world application. Save your files to a directory where only your web server has access, outside the document root. Have the files virus-scanned and moved elsewhere by a batch process called after all feeds have been updated. |