Recipe2.5.Counting Lines in a File

Recipe 2.5. Counting Lines in a File

Credit: Luther Blissett

Problem

You need to compute the number of lines in a file.

Solution

The simplest approach for reasonably sized files is to read the file as a list of lines, so that the count of lines is the length of the list. If the file's path is in a string bound to a variable named thefilepath, all the code you need to implement this approach is:

count = len(open(thefilepath, 'rU').readlines( ))

For a truly huge file, however, this simple approach may be very slow or even fail to work. If you have to worry about humongous files, a loop on the file always works:

count = -1 for count, line in enumerate(open(thefilepath, 'rU')):     pass count += 1

A tricky alternative, potentially faster for truly humongous files, for when the line terminator is '\n' (or has '\n' as a substring, as happens on Windows):

count = 0 thefile = open(thefilepath, 'rb') while True:     buffer = thefile.read(8192*1024)     if not buffer:         break     count += buffer.count('\n') thefile.close( )

The 'rb' argument to open is necessary if you're after speedwithout that argument, this snippet might be very slow on Windows.

Discussion

When an external program counts a file's lines, such as wc -l on Unix-like platforms, you can of course choose to use that (e.g., via os.popen). However, it's generally simpler, faster, and more portable to do the line-counting in your own program. You can rely on almost all text files having a reasonable size, so that reading the whole file into memory at once is feasible. For all such normal files, the len of the result of readlines gives you the count of lines in the simplest way.

If the file is larger than available memory (say, a few hundred megabytes on a typical PC today), the simplest solution can become unacceptably slow, as the operating system struggles to fit the file's contents into virtual memory. It may even fail, when swap space is exhausted and virtual memory can't help any more. On a typical PC, with 256MB RAM and virtually unlimited disk space, you should still expect serious problems when you try to read into memory files above, say, 1 or 2 GB, depending on your operating system. (Some operating systems are much more fragile than others in handling virtual-memory issues under such overly stressed load conditions.) In this case, looping on the file object, as shown in this recipe's Solution, is better. The enumerate built-in keeps the line count without your code having to do it explicitly.

Counting line-termination characters while reading the file by bytes in reasonably sized chunks is the key idea in the third approach. It's probably the least immediately intuitive, and it's not perfectly cross-platform, but you might hope that it's fastest (e.g., when compared with recipe 8.2 in the Perl Cookbook).

However, in most cases, performance doesn't really matter all that much. When it does matter, the time-sink part of your program might not be what your intuition tells you it is, so you should never trust your intuition in this matterinstead, always benchmark and measure. For example, consider a typical Unix syslog file of middling size, a bit over 18 MB of text in 230,000 lines:

[situ@tioni nuc]$ wc nuc  231581 2312730 18508908 nuc

And consider the following testing-and-benchmark framework script, bench.py:

import time def timeo(fun, n=10):     start = time.clock( )     for i in xrange(n): fun( )     stend = time.clock( )     thetime = stend-start     return fun._ _name_ _, thetime import os def linecount_w( ):     return int(os.popen('wc -l nuc').read( ).split( )[0]) def linecount_1( ):     return len(open('nuc').readlines( )) def linecount_2( ):     count = -1     for count, line in enumerate(open('nuc')): pass     return count+1 def linecount_3( ):     count = 0     thefile = open('nuc', 'rb')     while True:         buffer = thefile.read(65536)         if not buffer: break         count += buffer.count('\n')     return count for f in linecount_w, linecount_1, linecount_2, linecount_3:     print f._ _name_ _, f( ) for f in linecount_1, linecount_2, linecount_3:     print "%s: %.2f"%timeo(f)

First, I print the line-counts obtained by all methods, thus ensuring that no anomaly or error has occurred (counting tasks are notoriously prone to off-by-one errors). Then, I run each alternative 10 times, under the control of the timing function timeo, and look at the results. Here they are, on the old but reliable machine I measured them on:

[situ@tioni nuc]$ python -O bench.py linecount_w 231581 linecount_1 231581 linecount_2 231581 linecount_3 231581 linecount_1: 4.84 linecount_2: 4.54 linecount_3: 5.02

As you can see, the performance differences hardly matter: your users will never even notice a difference of 10% or so in one auxiliary task. However, the fastest approach (for my particular circumstances, on an old but reliable PC running a popular Linux distribution, and for this specific benchmark) is the humble loop-on-every-line technique, while the slowest one is the fancy, ambitious technique that counts line terminators by chunks. In practice, unless I had to worry about files of many hundreds of megabytes, I'd always use the simplest approach (i.e., the first one presented in this recipe).

Measuring the exact performance of code snippets (rather than blindly using complicated approaches in the hope that they'll be faster) is very importantso important, indeed, that the Python Standard Library includes a module, timeit, specifically designed for such measurement tasks. I suggest you use timeit, rather than coding your own little benchmarks as I have done here. The benchmark I just showed you is one I've had around for years, since well before timeit appeared in the standard Python library, so I think I can be forgiven for not using timeit in this specific case!

Recipe2.5.Counting Lines in a File