Section 9.6. Unicode

9.6. Unicode

Plain strings are converted into Unicode strings either explicitly, with the unicode built-in, or implicitly, when you pass a plain string to a function that expects Unicode. In either case, the conversion is done by an auxiliary object known as a codec (for coder-decoder). A codec can also convert Unicode strings to plain strings, either explicitly, with the encode method of Unicode strings, or implicitly.

To identify a codec, pass the codec name to unicode or encode. When you pass no codec name, and for implicit conversion, Python uses a default encoding, normally 'ascii'. You can change the default encoding in the startup phase of a Python program, as covered in "The site and sitecustomize Modules" on page 338; see also setdefaultencoding on page 170. However, such a change is not a good idea for most "serious" Python code: it might too easily interfere with code in the standard Python libraries or third-party modules, written to expect the normal 'ascii'.

Every conversion has a parameter errors, a string specifying how conversion errors are to be handled. The default is 'strict', meaning any error raises an exception. When errors is 'replace', the conversion replaces each character that causes an error with '?' in a plain-string result and with u'\ufffd' in a Unicode result. When errors is 'ignore', the conversion silently skips characters that cause errors. When errors is 'xmlcharrefreplace', the conversion replaces each character that causes an error with the XML character reference representation of that character in the result. You may also code your own function to implement a conversion-error-handling strategy and register it under an appropriate name by calling codecs.register_error.

9.6.1. The codecs Module

The mapping of codec names to codec objects is handled by the codecs module. This module also lets you develop your own codec objects and register them so that they can be looked up by name, just like built-in codecs. Module codecs also lets you look up any codec explicitly, obtaining the functions the codec uses for encoding and decoding, as well as factory functions to wrap file-like objects. Such advanced facilities of module codecs are rarely used, and are not covered further in this book.

The codecs module, together with the encodings package of the standard Python library, supplies built-in codecs useful to Python developers dealing with internationalization issues. Python comes with over 100 codecs; a list of these codecs, with a brief explanation of each, is at http://docs.python.org/lib/standard-encodings.html. Any supplied codec can be installed as the site-wide default by module sitecustomize, but the preferred usage is to always specify the codec by name whenever you are converting explicitly between plain and Unicode strings. The codec installed by default is 'ascii', which accepts only characters with codes between 0 and 127, the 7-bit range of the American Standard Code for Information Interchange (ASCII) that is common to almost all encodings. A popular codec is 'latin-1', a fast, built-in implementation of the ISO 8859-1 encoding that offers a one-byte-per-character encoding of all special characters needed for Western European languages.

The codecs module also supplies codecs implemented in Python for most ISO 8859 encodings, with codec names from 'iso8859-1' to 'iso8859-15'. On Windows systems only, the codec named 'mbcs' wraps the platform's multibyte character set conversion procedures. Many codecs specifically support Asian languages. Module codecs also supplies several standard code pages (codec names from 'cp037' to 'cp1258'), Mac-specific encodings (codec names from 'mac-cyrillic' to 'mac-turkish'), and Unicode standard encodings 'utf-8' and 'utf-16' (the latter also has specific big-endian and little-endian variants: 'utf-16-be' and 'utf-16-le'). For use with UTF-16, module codecs also supplies attributes BOM_BE and BOM_LE, byte-order marks for big-endian and little-endian machines, respectively, and BOM, the byte-order mark for the current platform.

Module codecs also supplies a function to let you register your own conversion-error-handling functions.

register_error
register_error(name,func)

name must be a string. func must be callable with one argument e that's an instance of exception UnicodeDecodeError and must return a tuple with two items: the Unicode string to insert in the converted-string result and the index from which to continue the conversion (the latter is normally e.end). The function's body can use e.encoding, the name of the codec of this conversion, and e.object[e.start:e.end], the substring that caused the conversion error.

Module codecs also supplies two functions to ease dealing with files of encoded text.

EncodedFile
EncodedFile(file,datacodec,filecodec=None,errors='strict')

Wraps the file-like object file, returning a file-like object ef that implicitly and transparently applies the given encodings to all data read from or written to the file. When you write a string s to ef, ef first decodes s with the codec named by datacodec, then encodes the result with the codec named by filecodec and writes it to file. When you read a string, ef applies filecodec first, then datacodec. When filecodec is None, ef uses datacodec for both steps in either direction.

For example, if you want to write strings that are encoded in latin-1 to sys.stdout and have the strings come out in utf-8, use the following:

import sys, codecs sys.stdout = codecs.EncodedFile(sys.stdout,'latin-1','utf-8')

open
open(filename,mode='rb',encoding=None,errors='strict', buffering=1)

Uses the built-in function open (covered in "Creating a File Object with open" on page 216) to supply a file-like object that accepts and/or provides Unicode strings to/from Python client code, while the underlying file can either be in Unicode (when encoding is None) or use the codec named by encoding. For example, if you want to write Unicode strings to file uni.txt and have the strings implicitly encoded as latin-1 in the file, replacing with '?' any character that cannot be encoded in Latin-1, use the following:

import codecs flout = codecs.open('uni.txt','w','latin-1','replace') # now you can write Unicode strings directly to flout flout.write(u'élève') flout.close( )

9.6.2. The unicodedata Module

The unicodedata module supplies easy access to the Unicode Character Database. Given any Unicode character, you can use functions supplied by module unicodedata to obtain the character's Unicode category, official name (if any), and other, more exotic information. You can also look up the Unicode character (if any) that corresponds to a given official name. Such advanced facilities are rarely needed, and are not covered further in this book.