In-Depth Feed Consumption


So far, you have looked at only the very basics of consuming feeds. The devil is really in the detail when it comes to making a robust aggregator, because quite a number of factors can come into play.

Know Thy Elements

In this section, you continue to consume the Yahoo! Software feed, found at http://rss.news.yahoo.com/rss/software. Having viewed the feed in your browser (or a text editor for that matter), you can be sure you have the exact URL, as well as familiarize yourself with the content of the feed. You may have been lured into a false sense of security by the specifications presented thus far — it looks simple enough, after all. Unfortunately, this isn't quite the case; different programmers interpret the specifications differently, and add their own elements via namespacing. Some feeds (even major ones) don't actually follow any of the specifications at all, resulting in quite a few details that may require a special case in your code. Understanding exactly how the feed you are about to consume works is critical if you want your aggregator to work properly, so always look for edge cases (the really long or really short entries, or entries that have other unique characteristics). The following sections discuss some important things to note.

Which Elements Do You Want?

Chances are you aren't going to need the entire feed, but only certain elements of it. In this case, you want to grab the <title> for the feed itself, as well as the <link> of the web page for the feed. The lastBuildDate tag can be used to check to see if anything new has been posted, and the ttl (Time To Live) tag notes that the feed should be cached for 5 minutes. You won't save the TTL, but keep it in mind when figuring out how often to hit the page. The image tag can be ignored; you don't need that for now. Finally, you want the item tag and everything underneath it (title, link, guid, pubDate, description).

Which Elements Repeat?

Some elements will repeat in the feed (item, for example, in RSS feeds), whereas others will only have one instance. Some subelements may usually only be there once, but may repeat in some instances (think of a listing of books — most only have one author, but some (like this one) have more than one), so your code should be ready to deal with this. Because this is indeed an RSS feed, the item tag repeats.

Plain Text versus HTML Encoded Data?

Look at what type of data the feed is providing. You may need to either add tags yourself (encapsulate the content of a feed in <p></p> tags, for example), or strip tags to be better displayed within your templates. Remember to view the document's source for this, because your browser will automatically turn things like &amp; into just &.

The Yahoo! feed appears to provide all of its data in plain text, with appropriate characters encoded into HTML entities. Running strip tags on all of the data should be fine because there isn't any meaningful links in the content. It is interesting to note that this feed presently sends its root link as shown here:

  • http://news.yahoo.com/news?tmpl=index&amp;cid=1209

You will need to change &amp; to simply & if you want this to be a functional link.

Am I Using UTF-8 Encoding? Is The Feed Really Encoded That Way?

Document encoding is rather important; often (thanks to the magic that is PHP) we pretty much ignore it and trust that everything will work well, when in fact, using things like SimpleXML, this simply (if you will pardon the pun) isn't the case. Content providers are often careful with the information they provide, to ensure it is of the appropriate type, but often still manage to send characters encoded in another format when including user-submitted text (such as product reviews on Amazon, or posts in a forum). SimpleXML will return errors and warnings if it receives such incorrectly formatted data; in these cases you may need to either massage the data yourself or contact the provider of the feed and inform them of the encoding issues.

For your purposes, the Yahoo! feed declares itself to be ISO-8859-1, and all the content provided appears to be encoded correctly.

Note 

Don't kid yourself and think that big professional sites don't make these mistakes — they do! While writing this book, Amazon corrected an error where it was returning ISO-8859-1 characters in its REST-based API while declaring the stream to be UTF-8. Test your script well, trap errors, and if the script fails while trying to grab a feed, have it notify someone, and continue to use the cached copy. Don't trust foreign data, ever!

Browse the Site Providing the Feed

Look around the site for copyright information and restrictions regarding the feed you want to consume. Also look to see if there is more detailed information with regards to what will be included in the feed (in terms of HTML formatting, content-type, and so on).

Browsing the Yahoo! site indicates that the feed will be provided in the RSS format, and lays out the terms of use for the content of the feed. You can view this information here: http://news.yahoo.com/rss.

Reuters also provides a feed, as well as clear terms of use. If you are planning to make use of a feed in any commercial context, be especially careful when consuming feeds. You (or someone from the legal department) would be well advised to contact the feed provider to request permission before coding. Here is what Reuters spells out as its terms of use:

Note 

What are the terms of use?

Reuters offers RSS as a free service to any individual user or non-profit organization that would like to access it for non-commercial use. For all other usage requests, contact us. By accessing our RSS service you are indicating your understanding and agreement that you will not use Reuters RSS for commercial purposes. Reuters requests that your use of our RSS service be accompanied by proper attribution to Reuters as the source. Reuters reserves the right to discontinue this service at any time and further reserves the right to request the immediate cessation of any specific use of our RSS service.




Professional Web APIs with PHP. eBay, Google, PayPal, Amazon, FedEx, Plus Web Feeds
Professional Web APIs with PHP. eBay, Google, PayPal, Amazon, FedEx, Plus Web Feeds
ISBN: 764589547
EAN: N/A
Year: 2006
Pages: 130

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net