Recipe10.10.Blocking Duplicate Mails


Recipe 10.10. Blocking Duplicate Mails

Credit: Marina Pianu, Peter Cogolo

Problem

Many of the mails you receive are duplicates. You need to block the duplicates with a fast, simple filter before they reach a more time-consuming step, such as an anti-spam filter, in your email pipeline.

Solution

Many mail systems, such as the popular procmail, and KDE's KMail, enable you to control your mail-reception pipeline. Specifically, you can insert in the pipeline your filter programs, which get messages on standard input, may modify them, and emit them again on standard output. Here is one such filter, with the specific purpose of performing the task described in the Problemblocking messages that are duplicates of other messages that you have received recently:

#!/usr/bin/python import time, sys, os, email now = time.time( ) # get archive of previously-seen message-ids and times kde_dir = os.expanduser('~/.kde') if not os.path.isdir(kde_dir):     os.mkdir(kde_dir) arfile = os.path.join(kde_dir, 'duplicate_mails') duplicates = {  } try:     archive = open(arfile) except IOError:     pass else:     for line in archive:         when, msgid = line[:-1].split(' ', 1)         duplicates[msgid] = float(when)     archive.close( ) redo_archive = False # suck message in from stdin and study it msg = email.message_from_file(sys.stdin) msgid = msg['Message-ID'] if msgid:     if msgid in duplicates:         # duplicate message: alter its subject         subject = msg['Subject']         if subject is None:             msg['Subject'] = '**** DUP **** ' + msgid         else:             del msg['Subject']             msg['Subject'] = '**** DUP **** ' + subject     else:         # non-duplicate message: redo the archive file         redo_archive = True         duplicates[msgid] = now else:     # invalid (missing message-id) message: alter its subject     subject = msg['Subject']     if subject is None:         msg['Subject'] = '**** NID **** '     else:         del msg['Subject']         msg['Subject'] = '**** NID **** ' + subject # emit message back to stdout print msg if redo_archive:     # redo archive file, keep only msgs from the last two hours     keep_last = now - 2*60*60.0     archive = file(arfile, 'w')     for msgid, when in duplicates.iteritems( ):         if when > keep_last:             archive.write('%9.2f %s\n' % (when, what))     archive.close( )

Discussion

Whether it is because of spammers' malice or incompetence, or because of hiccups at my Internet ISP (Internet service provider), at times I get huge amounts of duplicate messages that can overload my mail-reception pipeline, particularly antispam filters. Fortunately, like many other mail systems, KDE's KMail, the one I use, lets me insert my own filters in the mail reception pipeline. In particular, I can diagnose duplicate messages, alter their headers (I use "Subject" for clarity), and tell later stages in the filters' pipeline to throw away messages with such subjects or to shunt them aside into a dedicated mailbox for later perusal, without passing them on to the antispam and other filters.

The email module from the Python Standard Library performs all the required parsing of the message and lets me access headers with dictionary-like indexing syntax. I need some "memory" of recently seen messages. Fortunately, I have noticed all duplicates happen within a few minutes of each other, so I don't have to keep that memory for longtwo hours are plenty. Therefore, I keep that memory in a simple text file, which records the time when a message was received and the message ID. I thought I might have to find a more advanced way to keep this kind of FIFO (first-in, first-out) archive, but I tried a simple approach firsta simple text file that is entirely rewritten whenever a new nonduplicate message arrives. This approach appears to perform quite adequately for my needs (at most a couple hundred messages an hour), even on my somewhat dated PC. "Do the simplest thing that could possibly work" strikes again!

See Also

Documentation about package email and modules time, sys and os in the Library Reference and Python in a Nutshell.



Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2004
Pages: 420

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net