(New in 2.0) The unicodedata module contains Unicode character properties, such as character categories, decomposition data, and numerical values. Its use is shown in Example 8-3.
Example 8-3. Using the unicodedata Module
File: unicodedata-example-1.py import unicodedata for char in [u"A", u"-", u"1", u"N{LATIN CAPITAL LETTER O WITH DIAERESIS}"]: print repr(char), print unicodedata.category(char), print repr(unicodedata.decomposition(char)), print unicodedata.decimal(char, None), print unicodedata.numeric(char, None) u'A' Lu '' None None u'-' Pd '' None None u'1' Nd '' 1 1.0 u'303226' Lu '004F 0308' None None
Note that in Python 2.0, properties for CJK ideographs and Hangul syllables are missing. This affects characters in the range 0x3400-0x4DB5, 0x4E00-0x9FA5, and 0xAC00-D7A3. The first character in each range has correct properties, so you can work around this problem by simply mapping each character to the beginning:
def remap(char): # fix for broken unicode property database in Python 2.0 c = ord(char) if 0x3400 <= c <= 0x4DB5: return unichr(0x3400) if 0x4E00 <= c <= 0x9FA5: return unichr(0x4E00) if 0xAC00 <= c <= 0xD7A3: return unichr(0xAC00) return char
This bug has been fixed in Python 2.1.
Core Modules
More Standard Modules
Threads and Processes
Data Representation
File Formats
Mail and News Message Processing
Network Protocols
Internationalization
Multimedia Modules
Data Storage
Tools and Utilities
Platform-Specific Modules
Implementation Support Modules
Other Modules