Introduction


Credit: Tim Peters, PythonLabs

Algorithm research is what drew me to Pythonand I fell in love. It wasn't love at first sight, but it was an attraction that grew into infatuation, which grew steadily into love. And that love shows no signs of fading. Why? I've worked in fields pushing the state of the art, and, in a paradoxical nutshell, Python code is easy to throw away!

When you're trying to solve a problem that may not have been solved before, you may have some intuitions about how to proceed, but you rarely know in advance exactly what needs to be done. The only way to proceed is to try things, many things, everything you can think of, just to see what happens. Python makes such exploration easier by minimizing the time and pain from conception to code: if your colleagues are using, for example, C or Java, it's not unusual for you to try (and discard) six different approaches in Python while they're still getting the bugs out of their first attempt.

In addition, you will have naturally grown classes and modules that capture key parts of the problem domain, simply because you find the need to keep reinventing them when starting over from scratch. I've used many languages in my computer career, and I know of none more productive than Python for prototyping. Best of all, while being an expert is often helpful, moderate skill in Python is much easier to obtain than for many other languages, yet much more productive for research and prototyping than merely moderate skill in any other language I've used. You don't have to be an expert to start!

So if you're in the research businessand every programmer who doesn't know everything occasionally isyou've got a nearly perfect language in Python. How then do you develop the intuitions that can generate a myriad of plausible approaches to try? Experience is the final answer, as we all get better at what we do often, but studying the myriad approaches other people have tried develops a firm base from which to explore. Toward that end, here are the most inspiring algorithm books I've read. They'll teach you possibilities you may never have discovered on your own:


John Bentley, Programming Pearls and More Programming Pearls (Addison-Wesley)

Every programmer should read these books from cover to cover for sheer joy. The chapters are extended versions of a popular column Bentley wrote for the Communications of the Association for Computing Machinery (CACM). Each chapter is generally self-contained, covering one or two lovely (and often surprising, in the "Aha! why didn't I think of that?!" sense) techniques of real practical value.


Robert Sedgewick, Algorithms in C++ or Algorithms in C (Addison-Wesley)

These books cover the most important general algorithms, organized by problem domain, and provide brief but cogent explanations, along with working code. The books cover the same material; the difference is in which computer language is used for the code. I recommend the C++ book for Python programmers, because idiomatic Python is closer to C++ than to C. Sedgewick's use of C++ is generally simple and easily translated to equivalent Python. This is the first book to reach for when you need to tackle a new area quickly.


Donald Knuth, The Art of Computer Programming, series (Addison-Wesley)

For experts (and those who aspire to expertise), this massive series in progress is the finest in-depth exposition of the state of the art. Nothing compares to its unique combination of breadth and depth, rigor, and historical perspective. Note that these books aren't meant to be read, they have to be actively studied, and many valuable insights are scattered in answers to the extensive exercises. While the books include detailed analysis, there's virtually no working code, except for programs written in assembly language for a hypothetical machine of archaic design (yes, it can be maddeningly obscure). It can be hard going at times, but few books so richly reward time invested.

To hone your skills, you can practice on an endless source of problems from the wonderful On-Line Encyclopedia of Integer Sequences, at http://www.research.att.com/~njas/sequences/Seis.html. When stress-testing upcoming Python releases, I sometimes pick a sequence at random from its list of sequences needing more terms and write a program to attempt an extension the sequence. Sometimes I'm able to extend a sequence that hasn't been augmented in years, in large part because Python has so many powerful features for rapid construction of new algorithms. Then the new terms are contributed to the database, where they benefit others. Give it a try! You may love it, but even if you hate it, you'll certainly find it challenging.

Timing and timeit.py

The first edition of this book contained a lengthy discussion of the difficulties in timing alternative approaches. Such difficulties include the fact that the resolution of time.time varies across platforms, and time.clock measures different things on different platforms (e.g., process CPU time on Linux systems, wall-clock time on Windows).

It may still be important for some to learn all those details, but Python 2.3 introduced a new timeit module, which captures best practice and is perfectly suited to timing small programs with a minimum of fuss and pitfalls. Everyone should learn how to use timeit, and basic usage is very easy to learn.

The simplest use of timeit is to pass one or more Python statements on the command line. Of course, shell syntax varies across platforms, so you may need to adjust these statements to the shell you use:

% python timeit.py "100 + 200" 10000000 loops, best of 3: 0.0932 usec per loop % python timeit.py "100 - 200" 10000000 loops, best of 3: 0.0931 usec per loop

As expected, integer addition and subtraction are just about equally expensive. (Don't fall into the trap of attributing any significance to differences as tiny as this one!) timeit picks the best way of measuring time on your platform and runs your code in a loop. The module tries a few times first to determine how many iterations to use in the loop, aiming at a total loop time between 0.2 and 2 seconds. When it determines a suitable number of iterations for the loop, it then runs the loop three times, reports the shortest time, and computes the time per loop iteration. The iterations per loop, and number of loops to run, can be forced to specific values with command-line options. See the Python Library Reference for details. (It's part of Python's online documentation and probably also comes in some handy form with your version of Python.)

As always, you should keep your machine as quiet as possible when running timing tests. The primary reason timeit runs three repetitions of the loop and reports the minimum time is to guard against skewed results due to other machine activity. This is especially important when running snippets that do very little work, such as the preceding examples. In such cases, even just one unfortunate interruption can grossly increase the reported time. Even so, on my quiet machine, snippets that run this fast can still yield confusing results:

% python timeit.py "100 + 200; 100 - 200" 10000000 loops, best of 3: 0.151 usec per loop % python timeit.py "100 + 200" "100 - 200" 10000000 loops, best of 3: 0.151 usec per loop

One correct conclusion is that modern Python no longer has a time penalty for writing two statements on two lines, instead of squashing them together on one line separated by a semicolon. Older Pythons generated a SET_LINENO opcode at the start of each logical line of code, and those opcodes consumed time to execute!

A more puzzling result is that adding and subtracting in one shot took 0.151 usec, but adding alone and subtracting alone took 0.0932 usec each. Why didn't we get 2*0.093 = 0.186 usec in the second experiment? The explanation is quite simple: timeit uses a fast iteration technique and doesn't try to subtract the iteration overhead from the reported results. When timing very fast snippets, this can be mildly disconcerting. Let's try to measure the overhead by timing a do-nothing statement:

% python timeit.py "pass" 10000000 loops, best of 3: 0.0203 usec per loop

While 0.02 usec is tiny, it's significant compared to the 0.093 usec reported for an integer add! Of course this effect diminishes to insignificance when timing more expensive operations:

% python timeit.py "100**100" 100000 loops, best of 3: 4.04 usec per loop % python timeit.py "200**200" 100000 loops, best of 3: 9.03 usec per loop % python timeit.py "100**100" "200**200" 100000 loops, best of 3: 13.1 usec per loop

Large integer exponentiation is much more expensive than adding small integers, and here the sum of the times for doing 100**100 and 200**200 in isolation is very close to the time for doing both at once.

The timeit module supports several other command-line options, and a programmatic interface too, but I'll defer to the Python Library Reference for that information. To start making productive use of timeit, the only other option you need to know about is the ability to pass "setup" statements on the command line. These statements execute once, outside the loop containing the code you're timing. For example, import statements are often used, as well as code that populates data structures. For example (assuming a backslash \ is your shell's way to indicate that a long logical line continues in the next physical line):

% python timeit.py -s "import random" \   -s "x=range(100000); random.shuffle(x)" "sorted(x)" 10 loops, best of 3: 152 msec per loop

For each of the three loops, timeit constructed the randomly ordered array just once, then ran sorted(x) repeatedly inside the loop. This was so expensive that timeit ran only 10 iterations per loop and changed its reporting units from microseconds to milliseconds. (In Python 2.3, timeit always reported in microseconds, but in version 2.4, it tries to be more helpful by picking the appropriate reporting units.) This is very different from:

% python timeit.py "import random" \   "x=range(100000); random.shuffle(x)" "sorted(x)" 10 loops, best of 3: 309 msec per loop

This snippet timed all the operations: importing random, building the list, randomly permuting the list, and sorting the list. This preparation code takes longer than sorting does! You may be surprised that we see from the reported times that it took at least as long to build and shuffle the list as it took to sort it. The first two operations take O(n) time, but sorting random data takes O(n log n) time; given this, how can this strange measurement be explained? Why didn't sorting take longer?

I won't explain that mystery here but will point out a more significant lesson: timing code always uncovers mysteries, and a timing tool as easy to use as timeit can be addictive. So be careful what you measure! Measuring itself will consume more of your time than you expect. As noted innumerable times by innumerable authors, the speed of most of your code doesn't matter at all. Find the 10% that consumes most of the time before worrying about any of it. When you find the true bottlenecks, timeit can help you measure the speed of alternatives objectivelyand you may be surprised by what you find.



Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2004
Pages: 420

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net