Recipe16.7.Merging and Splitting Tokens

Recipe 16.7. Merging and Splitting Tokens

Credit: Peter Cogolo

Problem

You need to tokenize an input language whose tokens are almost the same as Python's, with a few exceptions that need token merging and splitting.

Solution

Standard library module tokenize is very handy; we need to wrap it with a generator to do the post-processing for a little splitting and merging of tokens. The merging requires the ability to "peek ahead" in an iterator. We can get that ability by wrapping any iterator into a small dedicated iterator class:

class peek_ahead(object):     sentinel = object( )     def _ _init_ _(self, it):         self._nit = iter(it).next         self.preview = None         self._step( )     def _ _iter_ _(self):         return self     def next(self):         result = self._step( )         if result is self.sentinel: raise StopIteration         else: return result     def _step(self):         result = self.preview         try: self.preview = self._nit( )         except StopIteration: self.preview = self.sentinel         return result

Armed with this tool, we can easily split and merge tokens. Say, for example, by the rules of the language we're lexing, that we must consider each of ':=' and ':+' to be a single token, but a floating-point token that is a '.' with digits on both sides, such as '31.17', must be given as a sequence of three tokens, '31', '.', '17' in this case. Here's how (using Python 2.4 code with comments on how to change it if you're stuck with version 2.3):

import tokenize, cStringIO # in 2.3, also do 'from sets import Set as set' mergers = {':' : set('=+'), } def tokens_of(x):     it = peek_ahead(toktuple[1] for toktuple in             tokenize.generate_tokens(cStringIO.StringIO(x).readline)          )     # in 2.3, you need to add brackets [ ] around the arg to peek_ahead     for tok in it:         if it.preview in mergers.get(tok, ( )):             # merge with next token, as required             yield tok+it.next( )         elif tok[:1].isdigit( ) and '.' in tok:             # split if digits on BOTH sides of the '.'             before, after = tok.split('.', 1)             if after:                 # both sides -> yield as 3 separate tokens                 yield before                 yield '.'                 yield after             else:                 # nope -> yield as one token                 yield tok         else:             # not a merge or split case, just yield the token             yield tok

Discussion

Here's an example of use of this recipe's code:

>>> x = 'p{z:=23,  w:+7}: m :+ 23.4' >>> print ' / '.join(tokens_of(x)) p / { / z / := / 23 / , / w / :+ / 7 / } / : / m / :+ / 23 / . / 4 /

In this recipe, I yield tokens only as substrings of the string I'm lexing, rather than the whole tuple yielded by tokenize.generate_tokens, including such items as token position within the overall string (by line and column). If your needs are more sophisticated than mine, you should simply peek_ahead on whole token tuples (while I'm simplifying things by picking up just the substring, item 1, out of each token tuple, by passing to peek_ahead a generator expression), and compute start and end positions appropriately when splitting or merging. For example, if you're merging two adjacent tokens, the overall token has the same start position as the first, and the same end position as the second, of the two tokens you're merging.

The peek_ahead iterator wrapper class can often be useful in many kinds of lexing and parsing tasks, exactly because such tasks are well suited to operating on streams (which are well represented by iterators) but often require a level of peek-ahead and/or push-back ability. You can often get by with just one level; if you need more than one level, consider having your wrapper hold a container of peeked-ahead or pushed-back tokens. Python 2.4's collections.deque container implements a double-ended queue, which is particularly well suited for such tasks. For a more powerful look-ahead iterator wrapper, see Recipe 19.18.

Recipe16.7.Merging and Splitting Tokens