12.8 Converting Ad-Hoc Text into XML Markup


Credit: Luther Blissett

12.8.1 Problem

You have plain text that follows certain common conventions (e.g., paragraphs are separated by empty lines, text meant to be highlighted is marked up _like this_), and you need to mark it up automatically as XML.

12.8.2 Solution

Producing XML markup from data that is otherwise structured, including plain text that rigorously follows certain conventions, is really quite easy:

def markup(text, paragraph_tag='paragraph', inline_tags={'_':'highlight'}):     # First we must escape special characters, which have special meaning in XML     text = text.replace('&', "&amp;")\                .replace('<', "&lt;")\                .replace(''', "&quot;")\                .replace('>', "&gt;")     # paragraph markup; pass any false value as the paragraph_tag argument to disable     if paragraph_tag:         # Get list of lines, removing leading and trailing empty lines:         lines = text.splitlines(1)         while lines and lines[-1].isspace(): lines.pop(  )         while lines and lines[0].isspace(  ): lines.pop(0)         # Insert paragraph tags on empty lines:         marker = '</%s>\n\n<%s>' % (paragraph_tag, paragraph_tag)         for i in range(len(lines)):             if lines[i].isspace(  ):                 lines[i] = marker                 # remove 'empty paragraphs':                 if i!=0 and lines[i-1] == marker:                     lines[i-1] = ''         # Join text again         lines.insert(0, '<%s>'%paragraph_tag)         lines.append('</%s>\n'%paragraph_tag)         text = ''.join(lines)     # inline-tags markup; pass any false value as the inline_tags argument to disable     if inline_tags:         for ch, tag in inline_tags.items(  ):             pieces = text.split(ch)             # Text should have an even count of ch, so pieces should have             # odd length. But just in case there's an extra unmatched ch:             if len(pieces)%2 == 0: pieces.append('')             for i in range(1, len(pieces), 2):                 pieces[i] = '<%s>%s</%s>'%(tag, pieces[i], tag)             # Join text again             text = ''.join(pieces)     return text if _ _name_ _ == '_ _main_ _':     sample = """ Just some _important_ text, with inlike "_markup_" by convention. Oh, and paragraphs separated by empty (or all-whitespace) lines. Sometimes more than one, wantonly. I've got _lots_ of old text like that around -- don't you? """     print markup(sample)

12.8.3 Discussion

Sometimes you have a lot of plain text that needs to be automatically marked up in a structured way usually, these days, as XML. If the plain text you start with follows certain typical conventions, you can use them heuristically to get each text snippet into marked-up form with reasonably little effort.

In my case, the two conventions I had to work with were: paragraphs are separated by blank lines (empty or with some spaces, and sometimes several redundant blank lines for just one paragraph separation), and underlines (one before, one after) are used to indicate important text that should be highlighted. This seems to be quite far from the brave new world of:

<paragraph>blah blah</paragraph>

But in reality, it isn't as far as all that, thanks, of course, to Python! While you could use regular expressions for this task, I prefer the simplicity of the split and splitlines methods, with join to put the strings back together again.

12.8.4 See Also

StructuredText, the latest incarnation of which, ReStructuredText, is part of the docutils package (http://docutils.sourceforge.net/).



Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2005
Pages: 346

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net