Recipe13.9.Fixing Messages Parsed by Python 2.4 email.FeedParser

Recipe 13.9. Fixing Messages Parsed by Python 2.4 email.FeedParser

Credit: Matthew Cowles

Problem

You're using Python 2.4's new email.FeedParser module, but sometimes, when dealing with badly malformed incoming messages, that module produces message objects that are internally inconsistent (e.g., a message has a content-type header that says the message is multipart, but the body isn't), and you need to fix those inconsistencies.

Solution

Python 2.4's new standard library module email.FeedParser is very useful, but a little post-processing on the messages it returns can heuristically fix some inconsistencies and make it even better. Here's a module containing a class and a few functions to help with this task:

import email, email.FeedParser import re, sys, sgmllib # what chars are non-Ascii, what max fraction of them can be in a text part kGuessBinaryThreshold = 0.2 kGuessBinaryRE = re.compile("[\\0000-\\0025\\0200-\\0377]") # what max fraction of HTML tags can be in a text (non-HTML) part kGuessHTMLThreshold = 0.05 class Cleaner(sgmllib.SGMLParser):     entitydefs = {"nbsp": " "}  # I'll break if I want to     def _ _init_ _(self):         sgmllib.SGMLParser._ _init_ _(self)         self.result = [  ]     def do_p(self, *junk):         self.result.append('\n')     def do_br(self, *junk):         self.result.append('\n')     def handle_data(self, data):         self.result.append(data)     def cleaned_text(self):         return ''.join(self.result) def stripHTML(text):     ''' return text, with HTML tags stripped '''     c = Cleaner( )     try:       c.feed(text)     except sgmllib.SGMLParseError:       return text     else:       return c.cleaned_text( ) def guessIsBinary(text):     ''' return whether we can heuristically guess 'text' is binary '''     if not text: return False     nMatches = float(len(kGuessBinaryRE.findall(text)))     return nMatches/len(text) >= kGuessBinaryThreshold def guessIsHTML(text):     ''' return whether we can heuristically guess 'text' is HTML '''     if not text: return False     lt = len(text)     textWithoutTags = stripHTML(text)     tagsChars = float(lt-len(textWithoutTags))     return tagsChars/lt >= kGuessHTMLThreshold def getMungedMessage(openFile):     openFile.seek(0)     p = email.FeedParser.FeedParser( )     p.feed(openFile.read( ))     m = p.close( )     # Fix up multipart content-type when message isn't multi-part     if m.get_main_type( )=="multipart" and not m.is_multipart( ):         t = m.get_payload(decode=1)         if guessIsBinary(t):             # Use generic "opaque" type             m.set_type("application/octet-stream")         elif guessIsHTML(t):             m.set_type("text/html")         else:             m.set_type("text/plain")     return m

Discussion

FeedParser is a new module in the Python 2.4 Standard Library's email package. The module's name comes from the fact that it maintains a buffer, so that you don't have to give it all the text at once. Possibly more interesting is that the module doesn't raise an error when called on malformed messages; instead, it tries to make some sense of them and return a useful email.Message object. That's useful because so much mail is spam and so much spam is malformed.

The other side of the coin, given that the heroic feed parser works on incorrect messages, is that you can get back from it an email.Message object that's internally inconsistent. This recipe tries to make sense of one kind of inconsistency: a message with a content-type header that says that the message is multipart, but the body isn't.

The heuristics that the recipe uses to guess at the correct content-type are inevitably messy. Still, better to have such messy heuristics in recipes, rather than embedded forever in the Python Standard Library.

Recipe13.9.Fixing Messages Parsed by Python 2.4 email.FeedParser