Putting It All Together | Professional Web APIs with PHP. eBay, Google, PayPal, Amazon, FedEx, Plus Web Feeds

Using the aforementioned methods, this section illustrates a cohesive method of retrieving and recording a foreign feed. In this case the script was run against several RSS-based blog feeds. After examining these feeds, it was noted that they did not provide a description element as the earlier Yahoo! samples did, so the content:encoded element was used in its place:

 function processRSSFeed($xml, $source) {   $updatedStories = 0;   foreach($xml->channel->item AS $story)   {     $content = $story->children( "http://purl.org/rss/1.0/modules/content/");     $storyContent = $content->encoded;

In this case, the content:encoded element is needed, so the namespace is accessed directly.

     if (saveFeed($story->guid, $source, $story->title, $story->pubDate, $storyContent, $story->link) == 2)     {       break;     }     $updatedStories += 1;   }   return $updatedStories; }

As you can tell, there haven't been many changes to the processRSSFeed function.

 function saveFeed($guid, $source, $title, $date, $content, $link) {   if (strlen($guid) > 0)   {     $pk = md5($source . $guid);   }else   {     $pk = md5($source . $title);   }

The big plus with the use of MD5 (or any other hashing algorithm, for that matter) is that no matter what the input is, you are guaranteed a predictable length and a predictable content string as output, so this is one case where no changes are needed.

   $linkID = db_connect();   //We still don't want any HTML tags in the title of the item   $title = mysql_real_escape_string(strip_tags($title));   //Clean broken HTML first, to avoid problems with other steps   $config = array('indent' => TRUE,              'output-html' => TRUE,              'wrap' => 200,              'clean' => TRUE,              'show-body-only' => TRUE);     $tidy = tidy_parse_string($content, $config, 'UTF8');     tidy_clean_repair($tidy);     $content = tidy_get_output($tidy);

You want HTML output, wrapped at 200 lines, cleaned up, and you only want what would be contained within the body element, rather than an entire page.

     //Confirm HTML links are absolute, and append the url to the link     $content = preg_replace('/<a\s+.*?href=[\"\']?([^\"\'>]*)[\"\']?\s?(title=[\"\']?([^\"\'>]*)[\ "\']?)?[^>]*>(.*?)<\/a>/ie',              "cleanAndDisplayHREF('$source', '\\1', '\\3', '\\4')",              $content);     //Display images as images, but load from local server     $content = preg_replace('/<img\s+.*?src="/books/4/404/1/html/2/([^\"\' >]*)"\s?(width="([0-9]*)")?\s?(height="([0-9]*)")?[^>]*>/ie',              "retrieveImages('$source', '\\0','\\1','\\2','\\3','\\4', '\\5')",              $content);     $content = mysql_real_escape_string(strip_tags($content, "<p><img><a>"));

Deal with any and all links or images within the provided text, then strip out any HTML tags that aren't images, links, or paragraph markers. It is strongly advisable to take a strict whitelist approach to which tags you want to allow, especially if the content will appear nested within other items. Having a first-level header appear in the middle of what should be smooth text can ruin your day and your page layout.

     $link = mysql_real_escape_string($link);     $source = mysql_real_escape_string($source);     $date = strtotime($date);     if ($date == -1)     {       $date = time();     }     $query = "REPLACE INTO 03_feed_raw     (`id`, `source`, `title`, `date`, `content`, `link`)     VALUES     ('$pk', '$source', '$title', FROM_UNIXTIME('$date'), '$content', '$link')";     return replaceQuery($query, $linkID);   }

Finally, using the escaped link and source information, along with the now properly formatted date, the information is replaced in the database for future use.