Recipe 16.7. Merging and Splitting TokensCredit: Peter Cogolo ProblemYou need to tokenize an input language whose tokens are almost the same as Python's, with a few exceptions that need token merging and splitting. SolutionStandard library module tokenize is very handy; we need to wrap it with a generator to do the post-processing for a little splitting and merging of tokens. The merging requires the ability to "peek ahead" in an iterator. We can get that ability by wrapping any iterator into a small dedicated iterator class: class peek_ahead(object): sentinel = object( ) def _ _init_ _(self, it): self._nit = iter(it).next self.preview = None self._step( ) def _ _iter_ _(self): return self def next(self): result = self._step( ) if result is self.sentinel: raise StopIteration else: return result def _step(self): result = self.preview try: self.preview = self._nit( ) except StopIteration: self.preview = self.sentinel return result Armed with this tool, we can easily split and merge tokens. Say, for example, by the rules of the language we're lexing, that we must consider each of ':=' and ':+' to be a single token, but a floating-point token that is a '.' with digits on both sides, such as '31.17', must be given as a sequence of three tokens, '31', '.', '17' in this case. Here's how (using Python 2.4 code with comments on how to change it if you're stuck with version 2.3): import tokenize, cStringIO # in 2.3, also do 'from sets import Set as set' mergers = {':' : set('=+'), } def tokens_of(x): it = peek_ahead(toktuple[1] for toktuple in tokenize.generate_tokens(cStringIO.StringIO(x).readline) ) # in 2.3, you need to add brackets [ ] around the arg to peek_ahead for tok in it: if it.preview in mergers.get(tok, ( )): # merge with next token, as required yield tok+it.next( ) elif tok[:1].isdigit( ) and '.' in tok: # split if digits on BOTH sides of the '.' before, after = tok.split('.', 1) if after: # both sides -> yield as 3 separate tokens yield before yield '.' yield after else: # nope -> yield as one token yield tok else: # not a merge or split case, just yield the token yield tok DiscussionHere's an example of use of this recipe's code: >>> x = 'p{z:=23, w:+7}: m :+ 23.4' >>> print ' / '.join(tokens_of(x)) p / { / z / := / 23 / , / w / :+ / 7 / } / : / m / :+ / 23 / . / 4 / In this recipe, I yield tokens only as substrings of the string I'm lexing, rather than the whole tuple yielded by tokenize.generate_tokens, including such items as token position within the overall string (by line and column). If your needs are more sophisticated than mine, you should simply peek_ahead on whole token tuples (while I'm simplifying things by picking up just the substring, item 1, out of each token tuple, by passing to peek_ahead a generator expression), and compute start and end positions appropriately when splitting or merging. For example, if you're merging two adjacent tokens, the overall token has the same start position as the first, and the same end position as the second, of the two tokens you're merging. The peek_ahead iterator wrapper class can often be useful in many kinds of lexing and parsing tasks, exactly because such tasks are well suited to operating on streams (which are well represented by iterators) but often require a level of peek-ahead and/or push-back ability. You can often get by with just one level; if you need more than one level, consider having your wrapper hold a container of peeked-ahead or pushed-back tokens. Python 2.4's collections.deque container implements a double-ended queue, which is particularly well suited for such tasks. For a more powerful look-ahead iterator wrapper, see Recipe 19.18. See AlsoLibrary Reference and Python in a Nutshell sections on the Python Standard Library modules tokenize and cStringIO; Recipe 19.18 for a more powerful look-ahead iterator wrapper. |