Recipe1.6.Combining Strings


Recipe 1.6. Combining Strings

Credit: Luther Blissett

Problem

You have several small strings that you need to combine into one larger string.

Solution

To join a sequence of small strings into one large string, use the string operator join. Say that pieces is a list whose items are strings, and you want one big string with all the items concatenated in order; then, you should code:

largeString = ''.join(pieces)

To put together pieces stored in a few variables, the string-formatting operator % can often be even handier:

largeString = '%s%s something %s yet more' % (small1, small2, small3)

Discussion

In Python, the + operator concatenates strings and therefore offers seemingly obvious solutions for putting small strings together into a larger one. For example, when you have pieces stored in a few variables, it seems quite natural to code something like:

largeString = small1 + small2 + ' something ' + small3 + ' yet more'

And similarly, when you have a sequence of small strings named pieces, it seems quite natural to code something like:

largeString = '' for piece in pieces:     largeString += piece

Or, equivalently, but more fancifully and compactly:

import operator largeString = reduce(operator.add, pieces, '')

However, it's very important to realize that none of these seemingly obvious solution is goodthe approaches shown in the "Solution" are vastly superior.

In Python, string objects are immutable. Therefore, any operation on a string, including string concatenation, produces a new string object, rather than modifying an existing one. Concatenating N strings thus involves building and then immediately throwing away each of N-1 intermediate results. Performance is therefore vastly better for operations that build no intermediate results, but rather produce the desired end result at once.

Python's string-formatting operator % is one such operation, particularly suitable when you have a few pieces (e.g., each bound to a different variable) that you want to put together, perhaps with some constant text in addition. Performance is not a major issue for this specific kind of task. However, the % operator also has other potential advantages, when compared to an expression that uses multiple + operations on strings. % is more readable, once you get used to it. Also, you don't have to call str on pieces that aren't already strings (e.g., numbers), because the format specifier %s does so implicitly. Another advantage is that you can use format specifiers other than %s, so that, for example, you can control how many significant digits the string form of a floating-point number should display.

What Is "a Sequence?"

Python does not have a specific type called sequence, but sequence is still an often-used term in Python. sequence, strictly speaking, means: a container that can be iterated on, to get a finite number of items, one at a time, and that also supports indexing, slicing, and being passed to the built-in function len (which gives the number of items in a container). Python lists are the "sequences" you'll meet most often, but there are many others (strings, unicode objects, tuples, array.arrays, etc.).

Often, one does not need indexing, slicing, and lenthe ability to iterate, one item at a time, suffices. In that case, one should speak of an iterable (or, to focus on the finite number of items issue, a bounded iterable). Iterables that are not sequences include dictionaries (iteration gives the keys of the dictionary, one at a time in arbitrary order), file objects (iteration gives the lines of the text file, one at a time), and many more, including iterators and generators. Any iterable can be used in a for loop statement and in many equivalent contexts (the for clause of a list comprehension or Python 2.4 generator expression, and also many built-ins such as min, max, zip, sum, str.join, etc.).

At http://www.python.org/moin/PythonGlossary, you can find a Python Glossary that can help you with these and several other terms. However, while the editors of this cookbook have tried to adhere to the word usage that the glossary describes, you will still find many places where this book says a sequence or an iterable or even a list, where, by strict terminology, one should always say a bounded iterable. For example, at the start of this recipe's Solution, we say "a sequence of small strings" where, in fact, any bounded iterable of strings suffices. The problem with using "bounded iterable" all over the place is that it would make this book read more like a mathematics textbook than a practical programming book! So, we have deviated from terminological rigor where readability, and maintaining in the book a variety of "voices", were better served by slightly imprecise terminology that is nevertheless entirely clear in context.


When you have many small string pieces in a sequence, performance can become a truly important issue. The time needed to execute a loop using + or += (or a fancier but equivalent approach using the built-in function reduce) grows with the square of the number of characters you are accumulating, since the time to allocate and fill a large string is roughly proportional to the length of that string. Fortunately, Python offers an excellent alternative. The join method of a string object s takes as its only argument a sequence of strings and produces a string result obtained by concatenating all items in the sequence, with a copy of s joining each item to its neighbors. For example, ''.join(pieces) concatenates all the items of pieces in a single gulp, without interposing anything between them, and ', '.join(pieces) concatenates the items putting a comma and a space between each pair of them. It's the fastest, neatest, and most elegant and readable way to put a large string together.

When the pieces are not all available at the same time, but rather come in sequentially from input or computation, use a list as an intermediate data structure to hold the pieces (to add items at the end of a list, you can call the append or extend methods of the list). At the end, when the list of pieces is complete, call ''.join(thelist) to obtain the big string that's the concatenation of all pieces. Of all the many handy tips and tricks I could give you about Python strings, I consider this one by far the most significant: the most frequent reason some Python programs are too slow is that they build up big strings with + or +=. So, train yourself never to do that. Use, instead, the ''.join approach recommented in this recipe.

Python 2.4 makes a heroic attempt to ameliorate the issue, reducing a little the performance penalty due to such erroneous use of +=. While ''.join is still way faster and in all ways preferable, at least some newbie or careless programmer gets to waste somewhat fewer machine cycles. Similarly, psyco (a specializing just-in-time [JIT] Python compiler found at http://psyco.sourceforge.net/), can reduce the += penalty even further. Nevertheless, ''.join remains the best approach in all cases.

See Also

The Library Reference and Python in a Nutshell sections on string methods, string-formatting operations, and the operator module.



Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2004
Pages: 420

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net