Recipe2.1.Reading from a File

Recipe 2.1. Reading from a File

Credit: Luther Blissett

Problem

You want to read text or data from a file.

Solution

Here's the most convenient way to read all of the file's contents at once into one long string:

all_the_text = open('thefile.txt').read( )    # all text from a text file all_the_data = open('abinfile', 'rb').read( ) # all data from a binary file

However, it is safer to bind the file object to a name, so that you can call close on it as soon as you're done, to avoid ending up with open files hanging around. For example, for a text file:

file_object = open('thefile.txt') try:     all_the_text = file_object.read( ) finally:     file_object.close( )

You don't necessarily have to use the TRy/finally statement here, but it's a good idea to use it, because it ensures the file gets closed even when an error occurs during reading.

The simplest, fastest, and most Pythonic way to read a text file's contents at once as a list of strings, one per line, is:

list_of_all_the_lines = file_object.readlines( )

This leaves a '\n' at the end of each line; if you don't want that, you have alternatives, such as:

list_of_all_the_lines = file_object.read( ).splitlines( ) list_of_all_the_lines = file_object.read( ).split('\n') list_of_all_the_lines = [L.rstrip('\n') for L in file_object]

The simplest and fastest way to process a text file one line at a time is simply to loop on the file object with a for statement:

for line in file_object:     process line

This approach also leaves a '\n' at the end of each line; you may remove it by starting the for loop's body with:

    line = line.rstrip('\n')

or even, when you're OK with getting rid of trailing whitespace from each line (not just a trailing '\n'), the generally handier:

    line = line.rstrip( )

Discussion

Unless the file you're reading is truly huge, slurping it all into memory in one gulp is often fastest and most convenient for any further processing. The built-in function open creates a Python file object (alternatively, you can equivalently call the built-in type file). You call the read method on that object to get all of the contents (whether text or binary) as a single long string. If the contents are text, you may choose to immediately split that string into a list of lines with the split method or the specialized splitlines method. Since splitting into lines is frequently needed, you may also call readlines directly on the file object for faster, more convenient operation.

You can also loop directly on the file object, or pass it to callables that require an iterable, such as list or maxwhen thus treated as an iterable, a file object open for reading has the file's text lines as the iteration items (therefore, this should be done for text files only). This kind of line-by-line iteration is cheap in terms of memory consumption and fairly speedy too.

On Unix and Unix-like systems, such as Linux, Mac OS X, and other BSD variants, there is no real distinction between text files and binary data files. On Windows and very old Macintosh systems, however, line terminators in text files are encoded, not with the standard '\n' separator, but with '\r\n' and '\r', respectively. Python translates these line-termination characters into '\n' on your behalf. This means that you need to tell Python when you open a binary file, so that it won't perform such translation. To do so, use 'rb' as the second argument to open. This is innocuous even on Unix-like platforms, and it's a good habit to distinguish binary files from text files even there, although it's not mandatory in that case. Such good habits will make your programs more immediately understandable, as well as more compatible with different platforms.

If you're unsure about which line-termination convention a certain text file might be using, use 'rU' as the second argument to open, requesting universal endline translation. This lets you freely interchange text files among Windows, Unix (including Mac OS X), and old Macintosh systems, without worries: all kinds of line-ending conventions get mapped to '\n', whatever platform your code is running on.

You can call methods such as read directly on the file object produced by the open function, as shown in the first snippet of the solution. When you do so, you no longer have a reference to the file object as soon as the reading operation finishes. In practice, Python notices the lack of a reference at once, and immediately closes the file. However, it is better to bind a name to the result of open, so that you can call close yourself explicitly when you are done with the file. This ensures that the file stays open for as short a time as possible, even on platforms such as Jython, IronPython, and other hypothetical future versions of Python, on which more advanced garbage-collection mechanisms might delay the automatic closing that the current version of C-based Python performs at once. To ensure that a file object is closed even if errors happen during its processing, the most solid and prudent approach is to use the try/finally statement:

file_object = open('thefile.txt') try:     for line in file_object:         process line finally:     file_object.close( )

Be careful not to place the call to open inside the try clause of this try/finally statement (a rather common error among beginners). If an error occurs during the opening, there is nothing to close, and besides, nothing gets bound to name file_object, so you definitely don't want to call file_object.close()!

If you choose to read the file a little at a time, rather than all at once, the idioms are different. Here's one way to read a binary file 100 bytes at a time, until you reach the end of the file:

file_object = open('abinfile', 'rb') try:     while True:         chunk = file_object.read(100)         if not chunk:             break         do_something_with(chunk) finally:     file_object.close( )

Passing an argument N to the read method ensures that read will read only the next N bytes (or fewer, if the file is closer to the end). read returns the empty string when it reaches the end of the file. Complicated loops are best encapsulated as reusable generators. In this case, we can encapsulate the logic only partially, because a generator's yield keyword is not allowed in the try clause of a try/finally statement. Giving up on the assurance of file closing afforded by try/finally, we can therefore settle for:

def read_file_by_chunks(filename, chunksize=100):     file_object = open(filename, 'rb')     while True:         chunk = file_object.read(chunksize)         if not chunk:             break         yield chunk     file_object.close( )

Once this read_file_by_chunks generator is available, your application code to read and process a binary file by fixed-size chunks becomes extremely simple:

for chunk in read_file_by_chunks('abinfile'):     do_something_with(chunk)

Reading a text file one line at a time is a frequent task. Just loop on the file object, as in:

for line in open('thefile.txt', 'rU'):     do_something_with(line)

Here, too, in order to be 100% certain that no uselessly open file object will ever be left just hanging around, you may want to code this snippet in a more rigorously correct and prudent way:

file_object = open('thefile.txt', 'rU'): try:     for line in file_object:         do_something_with(line) finally:     file_object.close( )

Recipe2.1.Reading from a File