Downloading Usenet Articles

The core of Usenet, of course, is downloading and reading articles. This lab shows how to learn which articles are available in a group, and then download the most recent ones.

9.2.1. How Do I Do That?

Call NNTPClient's fetchGroup method to get the article count and first and last message numbers for a newsgroup. Then call fetchArticle to download each article. Example 9-2 demonstrates the technique.

Example 9-2. nntpdownload.py




from twisted.news import nntp

from twisted.internet import protocol, defer

import time, email



class NNTPGroupDownloadProtocol(nntp.NNTPClient):



 def connectionMade(self):

 nntp.NNTPClient.connectionMade(self)

 self.fetchGroup(self.factory.newsgroup)



 def gotGroup(self, groupInfo):

 articleCount, first, last, groupName = groupInfo

 first = int(first)

 last = int(last)

 start = max(first, last-self.factory.articleCount)

 self.articlesToFetch = range(start+1, last+1)

 self.articleCount = len(self.articlesToFetch)

 self.fetchNextArticle( )



 def fetchNextArticle(self):

 if self.articlesToFetch:

 nextArticleIdx = self.articlesToFetch.pop(0)

 print "Fetching article %i of %i..." % (

 self.articleCount-len(self.articlesToFetch),

 self.articleCount),

 self.fetchArticle(nextArticleIdx)

 else:

 # all done

 self.quit( )

 self.factory.deferred.callback(0)



 def gotArticle(self, article):

 print "OK"

 self.factory.handleArticle(article)

 self.fetchNextArticle( )



 def getArticleFailed(self, errorMessage):

 print errorMessage

 self.fetchNextArticle( )



 def getGroupFailed(self, errorMessage):

 self.factory.deferred.errback(Exception(errorMessage))

 self.quit( )

 self.transport.loseConnection( )



 def connectionLost(self, error):

 if not self.factory.deferred.called:

 self.factory.deferred.errback(error)



class NNTPGroupDownloadFactory(protocol.ClientFactory):

 protocol = NNTPGroupDownloadProtocol



 def _ _init_ _(self, newsgroup, outputfile, articleCount=10):

 self.newsgroup = newsgroup

 self.articleCount = articleCount

 self.output = outputfile

 self.deferred = defer.Deferred( )



 def handleArticle(self, articleData):

 parsedMessage = email.message_from_string(articleData)

 self.output.write(parsedMessage.as_string(unixfrom=True))

 self.output.write('

')



if __name__ == "_ _main_ _":

 from twisted.internet import reactor

 import sys



 def handleError(error):

 print >> sys.stderr, error.getErrorMessage( )

 reactor.stop( )



 if len(sys.argv) != 4:

 print >> sys.stderr, "Usage: %s nntpserver newsgroup outputfile"

 sys.exit(1)



 server, newsgroup, outfile = sys.argv[1:4]

 factory = NNTPGroupDownloadFactory(newsgroup, file(outfile, 'w+b'))

 factory.deferred.addCallback(

 lambda _: reactor.stop( )).addErrback(

 handleError)

 reactor.connectTCP(server, 119, factory)

 reactor.run( )

Run nntpdownload.py with the name of an NNTP server, a newsgroup, and the filename to which the messages should be written. It will connect to the server, download the most recent 10 messages from that newsgroup, and then quit:


 $ python nntpdownload.py freetext.usenetserver.com comp.lang.python 

 > comp.lang.python-latest.mbox

 Fetching article 1 of 10... OK

 Fetching article 2 of 10... OK

 Fetching article 3 of 10... OK

 Fetching article 4 of 10... OK

 Fetching article 5 of 10... OK

 Fetching article 6 of 10... OK

 Fetching article 7 of 10... OK

 Fetching article 8 of 10... OK

 Fetching article 9 of 10... OK

 Fetching article 10 of 10... OK

 

9.2.2. How Does That Work?

The NNTPGroupDownloadProtocol class, a subclass of nntp.NNTPClient, does most of the work in nntpdownload.py. The self.fetchGroup method asks the server for information about the newsgroup. When the server responds, gotGroup is called with the returned information: the total number of articles in the group, the index of the first article, the index of the last article, and the group name. NNTPGroupDownloadProtocol then goes back the number of articles specified by self.factory.articleCount (unless there aren't that many messages, in which case it just goes back to the first available article) and uses Python's range function to create a list of every number from the starting message index to the ending message index. Then it calls fetchNextArticle to begin downloading the set of messages.

fetchNextArticle takes the remaining list of article indexes and downloads the first one with a call to self.fetchArticle. The gotArticle method, called when the article has been successfully downloaded, passes the article data to self.factory.handleArticle, and then calls self.fetchArticle again. If an article download fails, the gotArticleFailed method will be called. gotArticleFailed prints an error message, but doesn't abort the entire operation; instead, it simply goes on to the next message.

An alternative to the approach used here is to use the fetchNewNews method, which takes a date and returns a list of all the articles posted since. Unfortunately, the underlying NEWNEWS command is not supported by many servers.

Because Usenet articles are in the same format as email, they can be stored in the same Unix mbox format used by the mail client examples shown in Chapter 7. The NNTPGroupDownloadFactory's handleArticle method parses the message using the email module and writes it to the output file in mbox format, followed by two blank lines to ensure that it will be clearly delimited from the next article in the file.

Getting Started

Building Simple Clients and Servers

Web Clients

Web Servers

Web Services and RPC

Authentication

Mail Clients

Mail Servers

NNTP Clients and Servers

SSH

Services, Processes, and Logging



Twisted Network Programming Essentials
Twisted Network Programming Essentials
ISBN: 0596100329
EAN: 2147483647
Year: 2004
Pages: 107
Authors: Abe Fettig

Flylib.com © 2008-2020.
If you may any questions please contact us: flylib@qtcs.net