Recipe1.11.Checking Whether a String Is Text or Binary


Recipe 1.11. Checking Whether a String Is Text or Binary

Credit: Andrew Dalke

Problem

Python can use a plain string to hold either text or arbitrary bytes, and you need to determine (heuristically, of course: there can be no precise algorithm for this) which of the two cases holds for a certain string.

Solution

We can use the same heuristic criteria as Perl does, deeming a string binary if it contains any nulls or if more than 30% of its characters have the high bit set (i.e., codes greater than 126) or are strange control codes. We have to code this ourselves, but this also means we easily get to tweak the heuristics for special application needs:

from _ _future_ _ import division           # ensure / does NOT truncate import string text_characters = "".join(map(chr, range(32, 127))) + "\n\r\t\b" _null_trans = string.maketrans("", "") def istext(s, text_characters=text_characters, threshold=0.30):     # if s contains any null, it's not text:     if "\0" in s:         return False     # an "empty" string is "text" (arbitrary but reasonable choice):     if not s:         return True     # Get the substring of s made up of non-text characters     t = s.translate(_null_trans, text_characters)     # s is 'text' if less than 30% of its characters are non-text ones:     return len(t)/len(s) <= threshold

Discussion

You can easily do minor customizations to the heuristics used by function istext by passing in specific values for the threshold, which defaults to 0.30 (30%), or for the string of those characters that are to be deemed "text" (which defaults to normal ASCII characters plus the four "normal" control characters, meaning ones that are often found in text). For example, if you expected Italian text encoded as ISO-8859-1, you could add the accented letters used in Italian, "àèéìÃ2Ã1", to the text_characters argument.

Often, what you need to check as being either binary or text is not a string, but a file. Again, we can use the same heuristics as Perl, checking just the first block of the file with the istext function shown in this recipe's Solution:

def istextfile(filename, blocksize=512, **kwds):     return istext(open(filename).read(blocksize), **kwds)

Note that, by default, the expression len(t)/len(s) used in the body of function istext would truncate the result to 0, since it is a division between integer numbers. In some future version (probably Python 3.0, a few years away), Python will change the meaning of the / operator so that it performs division without truncationif you really do want truncation, you should use the truncating-division operator, //.

However, Python has not yet changed the semantics of division, keeping the old one by default in order to ensure backwards compatibility. It's important that the millions of lines of code of Python programs and modules that already exist keep running smoothly under all new 2.x versions of Pythononly upon a change of major language version number, no more often than every decade or so, is Python allowed to change in ways that aren't backwards-compatible.

Since, in the small module containing this recipe's Solution, it's handy for us to get the division behavior that is scheduled for introduction in some future release, we start our module with the statement:

from _ _future_ _ import division

This statement doesn't affect the rest of the program, only the specific module that starts with this statement; throughout this module, / performs "true division" (without truncation). As of Python 2.3 and 2.4, division is the only thing you may want to import from _ _future_ _. Other features that used to be scheduled for the future, nested_scopes and generators, are now part of the language and cannot be turned offit's innocuous to import them, but it makes sense to do so only if your program also needs to run under some older version of Python.

See Also

Recipe 1.10 for more details about function maketrans and string method translate; Language Reference for details about true versus truncating division.



Python Cookbook
Python Cookbook
ISBN: 0596007973
EAN: 2147483647
Year: 2004
Pages: 420

flylib.com © 2008-2017.
If you may any questions please contact us: flylib@qtcs.net