Recipe 1.15. Expanding and Compressing Tabs

Credit: Alex Martelli, David Ascher


You want to convert tabs in a string to the appropriate number of spaces, or vice versa.


Changing tabs to the appropriate number of spaces is a reasonably frequent task, easily accomplished with Python strings' expandtabs method. Because strings are immutable, the method returns a new string object, a modified copy of the original one. However, it's easy to rebind a string variable name from the original to the modified-copy value:

mystring = mystring.expandtabs( )

This doesn't change the string object to which mystring originally referred, but it does rebind the name mystring to a newly created string object, a modified copy of mystring in which tabs are expanded into runs of spaces. expandtabs, by default, uses a tab length of 8; you can pass expandtabs an integer argument to use as the tab length.

Changing spaces into tabs is a rare and peculiar need. Compression, if that's what you're after, is far better performed in other ways, so Python doesn't offer a built-in way to "unexpand" spaces into tabs. We can, of course, write our own function for the purpose. String processing tends to be fastest in a split/process/rejoin approach, rather than with repeated overall string transformations:

def unexpand(astring, tablen=8):     import re     # split into alternating space and non-space sequences     pieces = re.split(r'( +)', astring.expandtabs(tablen))     # keep track of the total length of the string so far     lensofar = 0     for i, piece in enumerate(pieces):         thislen = len(piece)         lensofar += thislen         if piece.isspace( ):             # change each space sequences into tabs+spaces             numblanks = lensofar % tablen             numtabs = (thislen-numblanks+tablen-1)/tablen             pieces[i] = '\t'*numtabs + ' '*numblanks     return ''.join(pieces)

Function unexpand, as written in this example, works only for a single-line string; to deal with a multi-line string, use ''.join([ unexpand(s) for s in astring.splitlines(True) ]).


While regular expressions are never indispensable for the purpose of manipulating strings in Python, they are occasionally quite handy. Function unexpand, as presented in the recipe, for example, takes advantage of one extra feature of re.split with respect to string's split method: when the regular expression contains a (parenthesized) group, re.split returns a list where the split pieces are interleaved with the "splitter" pieces. So, here, we get alternate runs of nonblanks and blanks as items of list pieces; the for loop keeps track of the length of string it has seen so far, and changes pieces that are made of blanks to as many tabs as possible, plus as many blanks are needed to maintain the overall length.

Some programming tasks that could still be described as expanding tabs are unfortunately not quite as easy as just calling the expandtabs method. A category that does happen with some regularity is to fix Python source files, which use a mix of tabs and spaces for indentation (a very bad idea), so that they instead use spaces only (which is the best approach). This could entail extra complications, for example, when you need to guess the tab length (and want to end up with the standard four spaces per indentation level, which is strongly advisable). It can also happen when you need to preserve tabs that are inside strings, rather than tabs being used for indentation (because somebody erroneously used actual tabs, rather than '\t', to indicate tabs in strings), or even because you're asked to treat docstrings differently from other strings. Some cases are not too badfor example, when you want to expand tabs that occur only within runs of whitespace at the start of each line, leaving any other tab alone. A little function using a regular expression suffices:

def expand_at_linestart(P, tablen=8):     import re     def exp(mo):         return ).expand(tablen)     return ''.join([ re.sub(r'^\s+', exp, s) for s in P.splitlines(True) ])

This function expand_at_linestart exploits the re.sub function, which looks for a regular expression in a string and, each time it gets a match, calls a function, passing the match object as the argument, to obtain the string to substitute in place of the match. For convenience, expand_at_linestart is coded to deal with a multiline string argument P, performing the list comprehension over the results of the splitlines call, and the '\n'.join of the whole. Of course, this convenience does not stop the function from being able to deal with a single-line P.

If your specifications regarding which tabs are to be expanded are even more complex, such as needing to deal differently with tabs depending on whether they're inside or outside of strings, and on whether or not strings are docstrings, at the very least, you need to perform a tokenization. In addition, you may also have to perform a full parse of the source code you're dealing with, rather than using simple string or regular-expression operations. If this is the case, you can expect a substantial amount of work. Some beginning pointers to help you get started may be found in Chapter 16.

If you ever find yourself sweating out this kind of task, you will no doubt get excellent motivation in the future for following the normal and recommended Python style in the source code you write or edit: only spaces, four per indentation level, no tabs, and always '\t', never an actual tab character, to include a tab in a string literal. Your favorite editor can no doubt be told to enforce all of these conventions whenever a Python source file is saved; the editor that comes with IDLE (the free integrated development environment that comes with Python), for example, supports these conventions. It is much easier to arrange your editor so that the problem never arises, rather than striving to fix it after the fact!

See Also

Documentation for the expandtabs method of strings in the "Sequence Types" section of the Library Reference; Perl Cookbook recipe 1.7; Library Reference and Python in a Nutshell documentation of module re.

