Hack 42 Downloading from Usenet with nget

figs/beginner.gif figs/hack42.gif

Even though common wisdom states that porn peddlers and spam pushers have overrun Usenet, there are still a number of groups resolutely producing good content for good folks. In this hack, we'll show how to download files from news groups of your choice .

nget (http://nget. sourceforge .net/) is an open source Usenet downloader, available for Linux, FreeBSD, Mac OS X, cygwin32, and mingw32. Much like wget [Hack #26] is a downloader for the Web, nget excels at a large number of configuration choices for archiving files from Usenet newsgroups.

Once nget is installed, you'll need to copy the default .ngetrc configuration file into either ~/.ngetrc4/ or ~/_ngetrc/ . The .ngetrc is referred to each time you use nget , and it contains a hefty dose of sane values to form a basis for your future operations. To get started, we'll have to edit this file to point to our own Usenet server. If your ISP doesn't provide you with one, you can find a list of public access servers at http://dmoz.org/Computers/Usenet/Public_News_Servers/ (though you might have trouble finding one that supports alt.binaries.* , which is what this hack assumes you have access to). Open the .ngetrc file and scroll down until you see this:

 //hostname aliases {halias {<yourhostalias> addr=<yourhostaddress> id=1 #optional host config settings: # user =< name > # pass=<password> # fullxover=1 # shortname=<y> # maxstreaming=64 # idletimeout=300 # linelenience=0 } #Examples: # {host1 # addr=news.host1.com # fullxover=1 # id=384845 # linelenience=0,2 # } 

For nget to know which Usenet server to download from, we'll need to modify the <yourhostalias> and <yourhostaddress> values, like this:

 //hostname aliases {halias {  readfreenews  addr=  biggulp.readfreenews.net  id=1 ... etc ... 

We can configure as many servers as necessary, simply by creating more hostname alias blocks. What you set for <yourhostalias> can be used on the command line to specify different servers for different downloads (such as nget --host readfreenews or nget --host host1 ).

Save the file after making those changes. Technically, we're done. With our server configured, we could use the -g flag to pass the group we want to download, and nget would resolutely start downloading headers:

 %  nget -g alt.binaries.pictures.comics  make_connection(1,biggulp.readfreenews.net,nntp,0xbfffea20,512) Connecting to 209.98.153.154:119 r >> 201 NNRP BIGGULP (white) - problems - email news@readfreenews.com r << GROUP alt.binaries.pictures.comics r >> 211 7934 2518935 2526868 alt.binaries.pictures.comics Retrieving headers 2519140-2526868 : 2087/7728/7729 27% 3313B/s 7m44s ... saving cache: 7932 parts , 5067 files.. done. (7932 sa) r << QUIT OK: 1 group 

This is all fine and dandy, but no files were actually downloaded, only the message headers. To retrieve files, use the -r flag, which receives a regular expression that defines which files you'd like to download. Regular expressions are a discussion for another time and place, but take a look at the following examples:

 # this would retrieve every message available. %  nget -g alt.binaries.pictures.comics -r ".*"  # this would retrieve messages with a subject # line that contained "jpg" somewhere in it. %  nget -g alt.binaries.pictures.comics -r ".*jpg"  # this would retrieve messages with a subject # line that matches either "Donald" or "Mickey". %  nget -g alt.binaries.pictures.comics -r "(DonaldMickey)"  # the exact same command as the above, only # this time, check for duplicates of files # we've already downloaded. %  nget -g alt.binaries.pictures.comics -df -r "(DonaldMickey)"  # download all messages that DO have "jpg" in the # subject line, but DON'T have the word "unknown". %  nget -g alt.binaries.pictures.comics -r '(?=^(?:(?!unknown).)*$).*jpg'  # same thing as the above, only slightly more readable. %  nget -g alt.binaries.pictures.comics -R "subject jpg == subject unknown !=   [RETURN]   &&"  

There's one problem with these examples: we've assumed that you know the name of the group you want to download from. What if you want to find other comic groups to grab files from? Besides investigating on the Net, how are you going to know which groups you want?

Thankfully, there are a few different ways that nget can handle this for you. Depending on the server you've configured, you might be able to search the server list directly with nget -XT -r "comics ". This may not always work, and the alternative is to download/update the group listing:

 %  nget -a -T -r "comics"  ... etc ... r alt.alt.comics.jack-chick ? [r] r alt.binaries.pictures.comics ? [r] r alt.binaries.pictures.comics.reposts ? [r] r alt.binaries.pictures.erotica.comics ? [r] r alt.comics.alan-moore Quis custodiet ipsos custodes. [r] r alt.comics.batman Marketing mania. [r] ...etc... 

Once you know of other groups to download from, you can include them all at once in the same command, as a comma-spliced list to -g . This next example shows how to download everything from two newsgroups, ignore messages that have no binary attachments, check for file duplicates, andif file duplicates are foundset a header in the message cache to not check again:

  %  
   nget --text ignore -dfim -g alt.binaries.pictures.comics,   
  [RETURN]  
  alt.binaries.pictures.comics.reposts -r ".*"  

nget supports a number of other options, including the ability to filter by author, date, number of lines in the actual message, and so forth. More information about it and the . netrc configuration file can be found at http://nget.sourceforge.net/ or with man nget on your command line.



Spidering Hacks
Spidering Hacks
ISBN: 0596005776
EAN: 2147483647
Year: 2005
Pages: 157

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net