Recipe2.14.Rewinding an Input File to the Beginning


Recipe 2.14. Rewinding an Input File to the Beginning

Credit: Andrew Dalke

Problem

You need to make an input file object (with data coming from a socket or other input file handle) rewindable back to the beginning so you can read it over.

Solution

Wrap the file object into a suitable class:

from cStringIO import StringIO class RewindableFile(object):     """ Wrap a file handle to allow seeks back to the beginning. """     def _ _init_ _(self, input_file):         """ Wraps input_file into a file-like object with rewind. """         self.file = input_file         self.buffer_file = StringIO( )         self.at_start = True         try:             self.start = input_file.tell( )         except (IOError, AttributeError):             self.start = 0         self._use_buffer = True     def seek(self, offset, whence=0):         """ Seek to a given byte position.         Must be: whence == 0 and offset == self.start         """         if whence != 0:             raise ValueError("whence=%r; expecting 0" % (whence,))         if offset != self.start:             raise ValueError("offset=%r; expecting %s" % (offset, self.start))         self.rewind( )     def rewind(self):         """ Simplified way to seek back to the beginning. """         self.buffer_file.seek(0)         self.at_start = True     def tell(self):         """ Return the current position of the file (must be at start).  """         if not self.at_start:             raise TypeError("RewindableFile can't tell except at start of file")         return self.start     def _read(self, size):         if size < 0:             # read all the way to the end of the file             y = self.file.read( )             if self._use_buffer:                 self.buffer_file.write(y)             return self.buffer_file.read( ) + y         elif size == 0:          # no need to actually read the empty string             return ""         x = self.buffer_file.read(size)         if len(x) < size:             y = self.file.read(size - len(x))             if self._use_buffer:                 self.buffer_file.write(y)             return x + y         return x     def read(self, size=-1):         """ Read up to 'size' bytes from the file.         Default is -1, which means to read to end of file.         """         x = self._read(size)         if self.at_start and x:             self.at_start = False         self._check_no_buffer( )         return x     def readline(self):         """ Read a line from the file. """         # Can we get it out of the buffer_file?         s = self.buffer_file.readline( )         if s[-1:] == "\n":             return s         # No, so read a line from the input file         t = self.file.readline( )         if self._use_buffer:             self.buffer_file.write(t)         self._check_no_buffer( )         return s + t     def readlines(self):         """read all remaining lines from the file"""         return self.read( ).splitlines(True)     def _check_no_buffer(self):         # If 'nobuffer' has been called and we're finished with the buffer file,         # get rid of the buffer, redirect everything to the original input file.         if not self._use_buffer and \                self.buffer_file.tell( ) == len(self.buffer_file.getvalue( )):             # for top performance, we rebind all relevant methods in self             for n in 'seek tell read readline readlines'.split( ):                 setattr(self, n, getattr(self.file, n, None))             del self.buffer_file     def nobuffer(self):         """tell RewindableFile to stop using the buffer once it's exhausted"""         self._use_buffer = False

Discussion

Sometimes, data coming from a socket or other input file handle isn't what it was supposed to be. For example, suppose you are reading from a buggy server, which is supposed to return an XML stream, but sometimes returns an unformatted error message instead. (This scenario often occurs because many servers don't handle incorrect input very well.)

This recipe's RewindableFile class helps you solve this problem. r = RewindableFile(f) wraps the original input stream f into a "rewindable file" instance r which essentially mimics f's behavior but also provides a buffer. Read requests to r are forwarded to f, and the data thus read gets appended to a buffer, then returned to the caller. The buffer contains all the data read so far.

r can be told to rewind, meaning to seek back to the start position. The next read request will come from the buffer, until the buffer has been read, in which case it gets the data from the input stream again. The newly read data is also appended to the buffer.

When buffering is no longer needed, call the nobuffer method of r. This tells r that, once it's done reading the buffer's current contents, it can throw the buffer away. After nobuffer is called, the behavior of seek is no longer defined.

For example, suppose you have a server that gives either an error message of the form ERROR: cannot do that, or an XML data stream, starting with '<?xml'...:

    import RewindableFile     infile = urllib2.urlopen("http://somewhere/")     infile = RewindableFile.RewindableFile(infile)     s = infile.readline( )     if s.startswith("ERROR:"):           raise Exception(s[:-1])     infile.seek(0)     infile.nobuffer( )   # Don't buffer the data any more      ... process the XML from infile ...

One sometimes-useful Python idiom is not supported by the class in this recipe: you can't reliably stash away the bound methods of a RewindableFile instance. (If you don't know what bound methods are, no problem, of course, since in that case you surely won't want to stash them anywhere!). The reason for this limitation is that, when the buffer is empty, the RewindableFile code reassigns the input file's read, readlines, etc., methods, as instance variables of self. This gives slightly better performance, at the cost of not supporting the infrequently-used idiom of saving bound methods. See Recipe 6.11 for another example of a similar technique, where an instance irreversibly changes its own methods.

The tell method, which gives the current location of a file, can be called on an instance of RewindableFile only right after wrapping, and before any reading, to get the beginning byte location. The RewindableFile implementation of tell TRies to get the real position from the wrapped file, and use that as the beginning location. If the wrapped file does not support tell, then the RewindableFile implementation of tell just returns 0.

See Also

Site http://www.dalkescientific.com/Python/ for the latest version of this recipe's code; Library Reference and Python in a Nutshell docs on file objects and module cStringIO; Recipe 6.11 for another example of an instance affecting an irreversible behavior change on itself by rebinding its methods.



Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2004
Pages: 420

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net