Recipe12.4.Autodetecting XML Encoding

Recipe 12.4. Autodetecting XML Encoding

Credit: Paul Prescod

Problem

You have XML documents that may use a large variety of Unicode encodings, and you need to find out which encoding each document is using.

Solution

This task is one that we need to code ourselves, rather than getting an existing package to perform it, if we want complete generality:

import codecs, encodings """ Caller will hand this library a buffer string, and ask us to convert     the buffer, or autodetect what codec the buffer probably uses. """ # 'None' stands for a potentially variable byte ("##" in the XML spec...) autodetect_dict={ # bytepattern          : ("name",                 (0x00, 0x00, 0xFE, 0xFF) : ("ucs4_be"),                 (0xFF, 0xFE, 0x00, 0x00) : ("ucs4_le"),                 (0xFE, 0xFF, None, None) : ("utf_16_be"),                 (0xFF, 0xFE, None, None) : ("utf_16_le"),                 (0x00, 0x3C, 0x00, 0x3F) : ("utf_16_be"),                 (0x3C, 0x00, 0x3F, 0x00) : ("utf_16_le"),                 (0x3C, 0x3F, 0x78, 0x6D) : ("utf_8"),                 (0x4C, 0x6F, 0xA7, 0x94) : ("EBCDIC"),                 } def autoDetectXMLEncoding(buffer):     """ buffer -> encoding_name         The buffer string should be at least four bytes long.         Returns None if encoding cannot be detected.         Note that encoding_name might not have an installed         decoder (e.g., EBCDIC)     """     # A more efficient implementation would not decode the whole     # buffer at once, but then we'd have to decode a character at     # a time looking for the quote character, and that's a pain     encoding = "utf_8" # According to the XML spec, this is the default                        # This code successively tries to refine the default:                        # Whenever it fails to refine, it falls back to                        # the last place encoding was set     bytes = byte1, byte2, byte3, byte4 = map(ord, buffer[0:4])     enc_info = autodetect_dict.get(bytes, None)     if not enc_info: # Try autodetection again, removing potentially                      # variable bytes         bytes = byte1, byte2, None, None         enc_info = autodetect_dict.get(bytes)     if enc_info:         encoding = enc_info # We have a guess...these are                             # the new defaults         # Try to find a more precise encoding using XML declaration         secret_decoder_ring = codecs.lookup(encoding)[1]         decoded, length = secret_decoder_ring(buffer)         first_line = decoded.split("\n", 1)[0]         if first_line and first_line.startswith(u"<?xml"):             encoding_pos = first_line.find(u"encoding")             if encoding_pos!=-1:                 # Look for double quotes                 quote_pos = first_line.find('"', encoding_pos)                 if quote_pos==-1:                 # Look for single quote                     quote_pos = first_line.find("'", encoding_pos)                 if quote_pos>-1:                     quote_char = first_line[quote_pos]                     rest = first_line[quote_pos+1:]                     encoding = rest[:rest.find(quote_char)]     return encoding

Discussion

The XML specification describes the outline of an algorithm for detecting the Unicode encoding that an XML document uses. This recipe implements that algorithm and helps your XML-processing programs determine which encoding is being used by a specific document.

The default encoding (unless we can determine another one specifically) must be UTF-8, as it is part of the specifications that define XML. Certain byte patterns in the first four, or sometimes even just the first two, bytes of the text can identify a different encoding. For example, if the text starts with the two bytes 0xFF, 0xFE we can be certain that these bytes are a byte-order mark that identifies the encoding type as little-endian (low byte before high byte in each character) and the encoding itself as UTF-16 (or the 32-bits-per-character UCS-4, if the next two bytes in the text are 0, 0).

If we get as far as this, we must also examine the first line of the text. For this purpose, we decode the text from a bytestring into Unicode, with the encoding determined so far and detect the first line-end '\n' character. If the first line begins with u'<?xml', it's an XML declaration and may explicitly specify an encoding by using the keyword encoding as an attribute. The nested if statements in the recipe check for that case, and, if they find an encoding thus specified, the recipe returns the encoding thus found as the encoding the recipe has determined. This step is absolutely crucial, since any text starting with the single-byte ASCII-like representation of the XML declaration, <?xml, would be otherwise erroneously identified as encoded in UTF-8, while its explicit encoding attribute may specify it as being, for example, one of the ISO-8859 standard encodings.

This recipe makes the assumption that, as the XML specs require, the XML declaration, if any, is terminated by an end-of-line character. If you need to deal with almost-XML documents that are malformed in this very specific way (i.e., an incorrect XML declaration that is not terminated by an end-of-line character), you may need to apply some heuristic adjustments, for example, through regular expressions. However, it's impossible to offer precise suggestions, since malformedness may come in such a wide variety of errant forms.

This code detects a variety of encodings, including some that are not yet supported by Python's Unicode decoders. So, the fact that you can decipher the encoding does not guarantee that you can then decipher the document itself!

Recipe12.4.Autodetecting XML Encoding