Recipe17.7.Accessing a Python Sequence Item-by-Item with the Iterator Protocol


Recipe 17.7. Accessing a Python Sequence Item-by-Item with the Iterator Protocol

Credit: Luther Blissett

Problem

You want to write a Python-callable C extension that takes as an argument a Python sequence (or other iterable) and accesses it sequentially, one item at a time, requiring no extra storage.

Solution

If you can afford to access the sequence item-by-item, without knowing in advance the number of items it has, you can often save memory by using PyObject_GetIter instead of PySequence_Fast:

#include <Python.h> static PyObject *totalIter(PyObject *self, PyObject *args) {     PyObject* seq;     PyObject* item;     double result;     /* get one argument as an iterator */     if(!PyArg_ParseTuple(args, "O", &seq))         return 0;     seq = PyObject_GetIter(seq);     if(!seq)         return 0;     /* process data sequentially */     result = 0.0;     while((item=PyIter_Next(seq))) {         PyObject *fitem;         fitem = PyNumber_Float(item);         if(!fitem) {             Py_DECREF(seq);             Py_DECREF(item);             PyErr_SetString(PyExc_TypeError, "all items must be numbers");             return 0;         }         result += PyFloat_AS_DOUBLE(fitem);         Py_DECREF(fitem);         Py_DECREF(item);     }     /* clean up and return result */     Py_DECREF(seq);     return Py_BuildValue("d", result); } static PyMethodDef totitMethods[  ] = {     {"totit", totalIter, METH_VARARGS, "Sum a sequence of numbers."},     {0} /* sentinel */ }; void inittotit(void) {     (void) Py_InitModule("totit", totitMethods); }

Discussion

PyObject_GetIter is appropriate only when it's OK for the rest of your C code to get the items one at a time, without knowing in advance the number of items in total. When this condition is met, PyObject_GetIter gives you roughly the same performance as PySequence_Fast (if the input argument is a list or tuple), but it can save memory allocation, and therefore can run faster, if the input argument is an iterator or another kind of sequence or iterable. In this recipe's function, since we are just summing the items, it is indeed perfectly OK to get them one at a time, and we don't need to know in advance the total number; therefore, using PyObject_GetIter is preferable. (In the real world, you would use Python's built-in function sum for this specific functionality, rather than coding a dedicated C function, but then, this is meant to be just an example!)

PyObject_GetIter takes one argument: a Python object from which an iterator is desired (much like Python's iter built-in function). It either returns 0, indicating an error, or an iterator object, on which you can repeatedly call PyIter_Next to get the next item (or 0, NULL, which does not indicate an error, but rather indicates the end of the iteration). Both PyObject_GetIter and PyIter_Next return new references, so we must use Py_DECREF when we're done with the respective objects.

As usual, the best way to build this extension (assuming that you've saved it as a file named totit.c) is with the distutils package. Place in the same directory as the C source a file named setup.py such as:

from distutils.core import setup, Extension setup(name="totit", maintainer="Luther Blissett", maintainer_email=     "situ@tioni.st", ext_modules=[Extension('totit', sources=['totit.c'])] )

then build and install by running:

<m>$ python setup.py install</m>

Part of the appeal of this approach is that it works on any platform, assuming that you have access to the same C compiler used to build your version of Python, and permission to write on the site-packages directory where the resulting dynamically loaded library gets installed.

Since Python extensions are often coded in C to maximize performance, it's interesting to measure performance compared to pure Python code dealing with the same task. A typical measurement setup might be a script such as the following timon.py:

import timeit, operator from total import total from totit import totit def timo(fn, sq, init):     T = timeit.Timer('timon.%s(%s)'%(fn,sq), 'import timon\n'+init)     print ' %5.5s: %5.2f' % (fn, T.timeit(40000)) def totpy(x):     result = 0.0     for item in x: result += item     return result def totre(x):     return reduce(operator.add, x, 0.0) def totsu(x):     return sum(x, 0.0) if _ _name_ _ == '_ _main_ _':     print 'on lists:'     for f in 'totre totpy total totit totsu'.split( ):         timo(f, 'seq', 'seq=range(2000)')     print 'on iters:'     for f in 'totre totpy total totit totsu'.split( ):         timo(f, 'g( )', 'def g( ):\n  for x in range(2000): yield x')

This script uses the timeit module of the Python Standard Library to measure accurately 40,000 calls to each function on 2,000-item lists and 2,000-item generators. The timeit.Timer constructor takes two string arguments: first the statement we're timing, then the setup statements that run before timing begins. Here, the statement we're timing calls functions in this module; therefore, the setup statements must import this modulewhich is why we add the import timon at the beginning of the setup statement string. I have also taken care to make all these functions strictly comparable, by having them all sum floats (not just ints). This purpose is the reason that I provide the explicit 0.0 initial arguments to built-in functions reduce and sum.

On my machine, running with the command-line switch -O so that Python can optimize operations a little bit, the timing results on Python 2.3 are:

<m>$ python -O timon.py</m> on lists:  totre: 136.04  totpy: 118.18  total: 56.61  totit: 59.66  totsu: 74.11 on iters:  totre: 220.86  totpy: 198.98  total: 199.72  totit: 201.70  totsu: 157.44

As you can see, the most important optimization is to avoid the "attractive nuisance" of the reduce built-in function: even a pure Python loop is faster! When we're dealing with lists, the special-purpose C-coded extensions presented in the last two recipes are fastest; but when we're dealing with generators, the fastest solution is provided by the built-in function sum. In practice, one would always use sum for this functionality, rather than bothering to code or expose special-purpose C functions.

See Also

The Extending and Embedding manual is available as part of the standard Python documentation set at http://www.python.org/doc/current/ext/ext.html; documentation on the Python C API is at http://www.python.org/doc/current/api/api.html; the section "Distributing Python Modules" in the standard Python documentation set is still incomplete but is a good source of information on the distutils package: Chapter 19 of this book covers iterators and generators in pure Python terms; Python in a Nutshell covers the essentials of extending and embedding, of the Python C API, of the distutils package, and of iterators; Python's Library Reference covers the timeit module.



Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2004
Pages: 420

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net