Hack 70 Using the Link Cosmos of Technorati

Hack 70 Using the Link Cosmos of Technorati

figs/beginner.gif figs/hack70.gif

Similar to other indexing sites like Blogdex, the Link Cosmos at Technorati keeps track of an immense number of blogs , correlating popular links and topics for all to see. With the recently released API, developers can now integrate the results into their own scripts .

Technorati (http://www.technorati.com) walks, crawls, investigates, and generally mingles around weblog-style web sites and indexes them, gathering loads of information. I mean loads : it keeps track of articles on the web site, what links to it, what it links to, how popular it is, how popular the web sites that link to it are, how popular the people that read it are, and who is most likely to succeed. Well, it does most of those things.

Need Some REST?

The current version of the Technorati interface uses a REST (Representational State Transfer) interface. REST interfaces allow for transfer of data via the GET or POST method of a URL. We will initially use the interface to access the Technorati Cosmos data. The Cosmos is the set of data that keeps track of who links to whom and essentially contains who thinks who is interesting. Technorati allows queries of the following information via the REST interface:


Link Cosmos

Who you link to, who links to who, and when.


Blog info

General information about a specified weblog, including the weblog name , URL, RSS URL (if one exists), how many places it links to, how many places link to it, and when it was last updated. This is the same information that is returned for each weblog in the Cosmos lookup.


Outbound blogs

A list of web sites that the specified URL links to.

We're going to focus on the Link Cosmos information, which in my bloated opinion is the most important. The following small piece of code uses the Technorati interface to grab the current weblog listing and print the resulting XML data that is returned from the Technorati interface. You'll need to become a member of the site to receive your developer's API key:

 #!/usr/bin/perl -w use strict; use LWP::Simple; my $key       = "  your developer key  "; my $searchURL = "http://www.perceive.net/"; my $restAPI   = "http://api.technorati.com/cosmos?key=$key&url=".                 "$searchURL  &type=weblog  &format=xml"; my $xml = get($restAPI); print "$xml\n"; 

Dave Sifry, the developer of Technorati, has also made a small distinction between general web sites and weblogs. Notice type=weblog in the URL of the previous code. You can change this to type=link , and you'll get the last 20 web sites that link to your site, rather than just the last 20 blogs. This is a small distinction, but one that could be useful.

The returned result is a chunk of XML, which resembles this:

 <item>   <weblog>     <name>phil ringnalda dot com</name>     <url>http://philringnalda.com</url>     <rssurl>http://www.philringnalda.com/index.xml</rssurl>     <inboundblogs>339</inboundblogs>     <inboundlinks>471</inboundlinks>     <lastupdate>2003-07-11 21:09:28 GMT</lastupdate>   </weblog> </item> 

Many REST interfaces use XML as the format for returning data to the requestor . This allows the data to be parsed easily and used in various ways, such as creating HTML for your web site:

 use XML::Simple; my $parsed_data = XMLin($xml); my $items = $parsed_data->{document}->{item}; print qq{<ol>\n}; for my $item (@$items) {     my ($weblog, $url) = ($item->{weblog}->{name}, $item->{weblog}->{url});     print qq{<li><a href="$url">$name</a></li>};     } print qq{</ol>}; 

First, we load the XML::Simple module, which will allow us to load the data into a hash. The XMLin function does this for us and returns a hash of hashes and arrays. After XMLin has loaded the data, we get an array of weblog items and iterate through it, printing some HTML with links to the web sites. We could just as easily have printed it as a comma-delimited file or anything else we could cook up in our silly little heads.

The most interesting part of all of this is the transfer and use of the information; Technorati allows us to see who has created links to our web site and use that data for free . Dave obviously learned how to share in kindergarten.

A Skeleton Key for Words

In addition to the lovely Cosmos API, Technorati provides us with an interface to query for weblog posts that contain a specified keyword. For instance, say you really like Perl; you can query the API periodically to get all the recent posts that contain "Perl." I can imagine some handy uses for that: if you have keywords attached to posts in your weblog, you could have a Related Posts link that queries Technorati for other posts containing those keywords and shows a list of articles similar to yours.

The API to retrieve this information is also a REST interface, following the lead made by the Cosmos API. We can alter the code for the Cosmos API to provide access to this data:

 #!/usr/bin/perl-w use strict; use LWP::Simple; my $key        = "your developer key"; my $searchTerm = "Perl"; my $restAPI    = "http://api.technorati.com/  search  ?key=$key".  "&query=$searchTerm  &format=xml"; my $xml = get($restAPI); print "$xml\n"; 

Searching using the Keyword API returns more information in the XML stream, which gives some context to why it returned a match for a given item:

 <context>    <excerpt>     Ben Trott has uploaded version 0.02 of XML::FOAF to CPAN.     This is a<b>Perl</b> module designed to make it...    </excerpt>    <title>New version of XML::FOAF in CPAN</title>    <link>http://rdfweb.org/mt/foaflog/archives/000033.html</link> </context> 

The returned data consists of an excerpt of words that appear near the keyword that was searched for (the keyword is also tagged as bold in the HTML <b>Perl</b> in this example), the title of the article it was found in, and a URL to the item. The result also contains the same information about the weblog it was found in, such as inbound and outbound links.

We can slightly modify the previous code from the Cosmos API to display these related articles in a nice, concise format:

 use XML::Simple; my $parsed_data = XMLin($xml); my $items = $parsed_data->{document}->{item}; print qq{<dl>\n}; for my $item (@$items) {     my ($weblog, $context, $title, $link) =       ($item->{weblog}->{name}, $item->{context}->{excerpt},       $item->{context}->{title}, $item->{context}->{link});     print qq{<dt><a href="$link">$weblog : $title</a></dt>};     print qq{<dd>$context</dd>}; } print qq{</dl>}; 

The Technorati API is a useful method for retrieving information about weblogs, and it can help in the aggregation of useful data. With the attention that is paid to Technorati, I'm sure that these interfaces will become even more robust and useful as the development progresses. With the information in this hack, you are capable of using and expanding on these interfaces, creating uses of the data that are even more interesting. Further information is available at the Technorati Developer Wiki (http://developers.technorati.com/wiki/) and mailing list (http://developers.technorati.com/mailman/listinfo/api-discuss).

Eric Vitiello



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net