If you’ve ever come across text in a foreign language that contains lots of
???? characters or have written some Python code and received a message such as
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6: ordinal not in range(128) then you have run into a problem with character sets, encodings, Unicode and the like.
The truth is that many developers are put off by Unicode because most of the time it is possible to muddle through rather than take the time to learn the basics. To make the problem worse if you have a system that manages to fudge the issues and just about work and then start trying to do things properly with Unicode it often highlights problems in other parts of your code.
The good news is that Python has great Unicode support, so the rest of
this article will show you how to correctly use Unicode in Pylons to avoid
? characters and
What is Unicode?¶
When computers were first being used the characters that were most important were unaccented English letters. Each of these letters could be represented by a number between 32 and 127 and thus was born ASCII, a character set where space was 32, the letter “A” was 65 and everything could be stored in 7 bits.
Most computers in those days were using 8-bit bytes so people quickly realized that they could use the codes 128-255 for their own purposes. Different people used the codes 128-255 to represent different characters and before long these different sets of characters were also standardized into code pages. This meant that if you needed some non-ASCII characters in a document you could also specify a codepage which would define which extra characters were available. For example Israel DOS used a code page called 862, while Greek users used 737. This just about worked for Western languages provided you didn’t want to write an Israeli document with Greek characters but it didn’t work at all for Asian languages where there are many more characters than can be represented in 8 bits.
Unicode is a character set that solves these problems by uniquely defining
every character that is used anywhere in the world. Rather than defining a
character as a particular combination of bits in the way ASCII does, each
character is assigned a code point. For example the word
hello is made
from code points
U+0048 U+0065 U+006C U+006C U+006F. The full list of code
points can be found at http://www.unicode.org/charts/.
There are lots of different ways of encoding Unicode code points into bits but the most popular encoding is UTF-8. Using UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes. This has the useful side effect that English text looks exactly the same in UTF-8 as it did in ASCII, because for every ASCII character with hexadecimal value 0xXY, the corresponding Unicode code point is U+00XY. This backwards compatibility is why if you are developing an application that is only used by English speakers you can often get away without handling characters properly and still expect things to work most of the time. Of course, if you use a different encoding such as UTF-16 this doesn’t apply since none of the code points are encoded to 8 bits.
The important things to note from the discussion so far are that:
- Unicode can represent pretty much any character in any writing system in widespread use today
- Unicode uses code points to represent characters and the way these map to bits in memory depends on the encoding
- The most popular encoding is UTF-8 which has several convenient properties
- It can handle any Unicode code point
- A Unicode string is turned into a string of bytes containing no embedded zero bytes. This avoids byte-ordering issues, and means UTF-8 strings can be processed by C functions such as strcpy() and sent through protocols that can’t handle zero bytes
- A string of ASCII text is also valid UTF-8 text
- UTF-8 is fairly compact; the majority of code points are turned into two bytes, and values less than 128 occupy only a single byte.
- If bytes are corrupted or lost, it’s possible to determine the start of the next UTF-8-encoded code point and resynchronize.
Since Unicode 3.1, some extensions have even been defined so that the defined range is now U+000000 to U+10FFFF (21 bits), and formally, the character set is defined as 31-bits to allow for future expansion. It is a myth that there are 65,536 Unicode code points and that every Unicode letter can really be squeezed into two bytes. It is also incorrect to think that UTF-8 can represent less characters than UTF-16. UTF-8 simply uses a variable number of bytes for a character, sometimes just one byte (8 bits).
Unicode in Python¶
In Python Unicode strings are expressed as instances of the built-in
unicode type. Under the hood, Python represents Unicode strings as either
16 or 32 bit integers, depending on how the Python interpreter was compiled.
unicode() constructor has the signature
errors]). All of its arguments should be 8-bit strings. The first argument is
converted to Unicode using the specified encoding; if you leave off the
encoding argument, the ASCII encoding is used for the conversion, so characters
greater than 127 will be treated as errors:
>>> unicode('hello') u'hello' >>> s = unicode('hello') >>> type(s) <type 'unicode'> >>> unicode('hello' + chr(255)) Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6: ordinal not in range(128)
errors argument specifies what to do if the string can’t be decoded to
ascii. Legal values for this argument are
'strict' (raise a
'replace' (replace the character that
can’t be decoded with another one), or
'ignore' (just leave the character
out of the Unicode result).
>>> unicode('\x80abc', errors='strict') Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0: ordinal not in range(128) >>> unicode('\x80abc', errors='replace') u'\ufffdabc' >>> unicode('\x80abc', errors='ignore') u'abc'
It is important to understand the difference between encoding and decoding.
Unicode strings are considered to be the Unicode code points but any
representation of the Unicode string has to be encoded to something else, for
example UTF-8 or ASCII. So when you are converting an ASCII or UTF-8 string to
Unicode you are decoding it and when you are converting from Unicode to UTF-8
or ASCII you are encoding it. This is why the error in the example above says
that the ASCII codec cannot decode the byte
0x80 from ASCII to Unicode
because it is not in the range(128) or 0-127. In fact
0x80 is hex for 128
which the first number outside the ASCII range. However if we tell Python that
0x80 is encoded with the
'8859' character sets (which incidentally are different names for the same
thing) we get the result we expected:
>>> unicode('\x80', encoding='latin-1') u'\x80'
The character encodings Python supports are listed at http://docs.python.org/lib/standard-encodings.html
Unicode objects in Python have most of the same methods that normal Python
strings provide. Python will try to use the
'ascii' codec to convert
strings to Unicode if you do an operation on both types:
>>> a = 'hello' >>> b = unicode(' world!') >>> print a + b u'hello world!'
You can encode a Unicode string using a particular encoding like this:
>>> u'Hello World!'.encode('utf-8') 'Hello World!'
Unicode Literals in Python Source Code¶
In Python source code, Unicode literals are written as strings prefixed with the ‘u’ or ‘U’ character:
>>> u'abcdefghijk' >>> U'lmnopqrstuv'
You can also use
''' versions too. For example:
>>> u"""This ... is a really long ... Unicode string"""
Specific code points can be written using the
\u escape sequence, which is
followed by four hex digits giving the code point. If you use
you specify 8 hex digits instead of 4. Unicode literals can also use the same
escape sequences as 8-bit strings, including
\x only takes two
hex digits so it can’t express all the available code points. You can add
characters to Unicode strings using the
unichr() built-in function and find
out what the ordinal is with
Here is an example demonstrating the different alternatives:
>>> s = u"\x66\u0072\u0061\U0000006e" + unichr(231) + u"ais" >>> # ^^^^ two-digit hex escape >>> # ^^^^^^ four-digit Unicode escape >>> # ^^^^^^^^^^ eight-digit Unicode escape >>> for c in s: print ord(c), ... 97 102 114 97 110 231 97 105 115 >>> print s français
Using escape sequences for code points greater than 127 is fine in small doses but Python 2.4 and above support writing Unicode literals in any encoding as long as you declare the encoding being used by including a special comment as either the first or second line of the source file:
#!/usr/bin/env python # -*- coding: latin-1 -*- u = u'abcdé' print ord(u[-1])
If you don’t include such a comment, the default encoding used will be ASCII. Versions of Python before 2.4 were Euro-centric and assumed Latin-1 as a default encoding for string literals; in Python 2.4, characters greater than 127 still work but result in a warning. For example, the following program has no encoding declaration:
#!/usr/bin/env python u = u'abcdé' print ord(u[-1])
When you run it with Python 2.4, it will output the following warning:
sys:1: DeprecationWarning: Non-ASCII character '\xe9' in file testas.py on line 2, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
and then the following output:
For real world use it is recommended that you use the UTF-8 encoding for your file but you must be sure that your text editor actually saves the file as UTF-8 otherwise the Python interpreter will try to parse UTF-8 characters but they will actually be stored as something else.
Windows users who use the SciTE editor can specify the encoding of their file from the menu using the
If you are working with Unicode in detail you might also be interested in the
unicodedata module which can be used to find out Unicode properties such as a character’s name, category, numeric value and the like.
Input and Output¶
We now know how to use Unicode in Python source code but input and output can also be different using Unicode. Of course, some libraries natively support Unicode and if these libraries return Unicode objects you will not have to do anything special to support them. XML parsers and SQL databases frequently support Unicode for example.
If you remember from the discussion earlier, Unicode data consists of code
points. In order to send Unicode data via a socket or write it to a file you
usually need to encode it to a series of bytes and then decode the data back to
Unicode when reading it. You can of course perform the encoding manually
reading a byte at the time but since encodings such as UTF-8 can have variable
numbers of bytes per character it is usually much easier to use Python’s
built-in support in the form of the
The codecs module includes a version of the
open() function that
returns a file-like object that assumes the file’s contents are in a specified
encoding and accepts Unicode parameters for methods such as
The function’s parameters are open(filename, mode=’rb’, encoding=None,
mode can be ‘r’, ‘w’, or ‘a’, just like the
corresponding parameter to the regular built-in
open() function. You can
+ character to update the file.
buffering is similar to the
standard function’s parameter.
encoding is a string giving the encoding to
use, if not specified or specified as
None, a regular Python file object
that accepts 8-bit strings is returned. Otherwise, a wrapper object is
returned, and data written to or read from the wrapper object will be converted
errors specifies the action for encoding errors and can be one
of the usual values of
'replace' which we
saw right at the begining of this document when we were encoding strings in
Python source files.
Here is an example of how to read Unicode from a UTF-8 encoded file:
import codecs f = codecs.open('unicode.txt', encoding='utf-8') for line in f: print repr(line)
It’s also possible to open files in update mode, allowing both reading and writing:
f = codecs.open('unicode.txt', encoding='utf-8', mode='w+') f.write(u"\x66\u0072\u0061\U0000006e" + unichr(231) + u"ais") f.seek(0) print repr(f.readline()[:1]) f.close()
Notice that we used the
repr() function to display the Unicode data. This
is very useful because if you tried to print the Unicode data directly, Python
would need to encode it before it could be sent the console and depending on
which characters were present and the character set used by the console, an
error might be raised. This is avoided if you use
The Unicode character
U+FEFF is used as a byte-order mark or BOM, and is often written as the first character of a file in order to assist with auto-detection of the file’s byte ordering. Some encodings, such as UTF-16, expect a BOM to be present at the start of a file, but with others such as UTF-8 it isn’t necessary.
When such an encoding is used, the BOM will be automatically written as the first character and will be silently dropped when the file is read. There are variants of these encodings, such as ‘utf-16-le’ and ‘utf-16-be’ for little-endian and big-endian encodings, that specify one particular byte ordering and don’t skip the BOM.
Some editors including SciTE will put a byte order mark (BOM) in the text file when saved as UTF-8, which is strange because UTF-8 doesn’t need BOMs.
Most modern operating systems support the use of Unicode filenames. The filenames are transparently converted to the underlying filesystem encoding. The type of encoding depends on the operating system.
On Windows 9x, the encoding is
On Mac OS X, the encoding is
On Unix, the encoding is the user’s preference according to the result of nl_langinfo(CODESET), or None if the nl_langinfo(CODESET) failed.
On Windows NT+, file names are Unicode natively, so no conversion is performed.
getfilesystemencoding still returns
mbcs, as this is the encoding that
applications should use when they explicitly want to convert Unicode strings to
byte strings that are equivalent when used as file names.
mbcs is a special encoding for Windows that effectively means “use
whichever encoding is appropriate”. In Python 2.3 and above you can find out
the system encoding with
Most file and directory functions and methods support Unicode. For example:
filename = u"\x66\u0072\u0061\U0000006e" + unichr(231) + u"ais" f = open(filename, 'w') f.write('Some data\n') f.close()
Other functions such as
os.listdir() will return Unicode if you pass a
Unicode argument and will try to return strings if you pass an ordinary 8 bit
string. For example running this example as
filename = u"Sample " + unichar(5000) f = open(filename, 'w') f.close() import os print os.listdir('.') print os.listdir(u'.')
will produce the following output:
['Sample?', 'test.py'] [u'Sample\u1388', u'test.py']
Applying this to Web Programming¶
So far we’ve seen how to use encoding in source files and seen how to decode text to Unicode and encode it back to text. We’ve also seen that Unicode objects can be manipulated in similar ways to strings and we’ve seen how to perform input and output operations on files. Next we are going to look at how best to use Unicode in a web app.
The main rule is this:
Your application should use Unicode for all strings internally, decoding any input to Unicode as soon as it enters the application and encoding the Unicode to UTF-8 or another encoding only on output.
If you fail to do this you will find that
UnicodeDecodeError s will start
popping up in unexpected places when Unicode strings are used with normal 8-bit
strings because Python’s default encoding is ASCII and it will try to decode
the text to ASCII and fail. It is always better to do any encoding or decoding
at the edges of your application otherwise you will end up patching lots of
different parts of your application unnecessarily as and when errors pop up.
Unless you have a very good reason not to it is wise to use UTF-8 as the default encoding since it is so widely supported.
The second rule is:
Always test your application with characters above 127 and above 255 wherever possible.
If you fail to do this you might think your application is working fine, but as soon as your users do put in non-ASCII characters you will have problems. Using arabic is always a good test and www.google.ae is a good source of sample text.
The third rule is:
Always do any checking of a string for illegal characters once it’s in the form that will be used or stored, otherwise the illegal characters might be disguised.
For example, let’s say you have a content management system that takes a Unicode filename, and you want to disallow paths with a ‘/’ character. You might write this code:
def read_file(filename, encoding): if '/' in filename: raise ValueError("'/' not allowed in filenames") unicode_name = filename.decode(encoding) f = open(unicode_name, 'r') # ... return contents of file ...
This is INCORRECT. If an attacker could specify the ‘base64’ encoding, they
L2V0Yy9wYXNzd2Q= which is the base-64 encoded form of the string
'/etc/passwd' which is a file you clearly don’t want an attacker to get
hold of. The above code looks for
/ characters in the encoded form and
misses the dangerous character in the resulting decoded form.
Those are the three basic rules so now we will look at some of the places you might want to perform Unicode decoding in a Pylons application.
Pylons automatically coerces incoming form parameters (
GET (quote GET) and
params) into unicode objects (as of Pylons 0.9.6).
The request object contains a
charset (encoding) attribute defining what the parameters should be decoded to (via value.decode(charset, errors)), and the decoding
The unicode conversion of parameters can be disabled when
charset is set to
def index(self): #request.charset = 'utf-8' # utf-8 is the default charset #request.errors = 'replace' # replace is the default error handler # a MultiDict-like object of string names and unicode values decoded_get = request.GET # The raw data is always still available when charset is None request.charset = None raw_get = request.GET raw_params = request.params
Pylons can also be configured to not coerece parameters to unicode objects by
default. This is done by setting the following in the Pylons config object (at
the bottom of your project’s
# Don't coerce parameters to unicode config['pylons.request_options']['charset'] = None # You can also change the default error handler #config['pylons.request_options']['errors'] = 'strict'
request object is instructed to always automatically decode to
unicode via the
request_settings dictionary, the dictionary’s
value acts as a fallback charset. If a
charset was sent by the browser (via
Content-Type header), the browser’s value will take precedent: this
takes place when the
request object is constructed.
FieldStorage (file upload) objects will be handled specially for unicode
parameters: what’s provided is a copy of the original
with a unicode version of its
See File Uploads for more information on working with file uploads/
Only parameter values (not their associated names) are decoded to unicode by default. Since parameter names commonly map directly to Python variable names (which are restricted to the ASCII character set), it’s usually preferable to handle them as strings. For example, passing form parameters to a function as keyword arguments (e.g. **request.params.mixed()) doesn’t work with unicode keys.
WSGIRequest decode parameter names anyway, enable the
decode_param_names option on either the WSGIRequest object or the
name attributes are
also decoded to unicode when this option is enabled.
Pylons uses Mako as its default templating language. Mako handles all content
as unicode internally. It only deals in raw strings upon the final rendering of
the template (the Mako
render() function, used by the Pylons
function/Buffet plugin). The encoding of the rendered string can be configured;
Pylons sets the default value to UTF-8. To change this value, edit your
config/environment.py file and add the following option:
# Customize templating options via this variable tmpl_options = config['buffet.template_options'] tmpl_options['mako.output_encoding'] = 'utf-8'
utf-8 with the encoding you wish to use.
More information can be found at Mako’s Unicode Chapter.
Web pages should be generated with a specific encoding, most likely UTF-8. At
the very least, that means you should specify the following in the
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
The charset should also be specified in the
Content-Type header (which
Pylons automatically does for you):
response.headers['Content-type'] = 'text/html; charset=utf-8'
Pylons has a notion of
response_options, complimenting the
request_options mentioned in the Request Parameters section above. The
default request charset can be changed by setting the following in the Pylons
config object (at the bottom of your project’s
config['pylons.response_options']['charset'] = 'utf-8'
utf-8 with the charset you wish to use.
If you specify that your output is UTF-8, generally the web browser will
give you UTF-8. If you want the browser to submit data using a different
character set, you can set the encoding by adding the
tag to your form. Here is an example:
<form accept-encoding="US-ASCII" ...>
However, be forewarned that if the user tries to give you non-ASCII text, then:
- Firefox will translate the non-ASCII text into HTML entities.
- IE will ignore your suggested encoding and give you UTF-8 anyway.
The lesson to be learned is that if you output UTF-8, you had better be
prepared to accept UTF-8 by decoding the data in
described in the section above entitled Request Parameters‘.
Another technique which is sometimes used to determine the character set is to use an algorithm to analyse the input and guess the encoding based on probabilities.
For instance, if you get a file, and you don’t know what encoding it is encoded in, you can often rename the file with a .txt extension and then try to open it in Firefox. Then you can use the “View->Character Encoding” menu to try to auto-detect the encoding.
Your database driver should automatically convert from Unicode objects to a particular charset when writing and back again when reading. Again it is normal to use UTF-8 which is well supported.
You should check your database’s documentation for information on how it handles Unicode.
For example MySQL’s Unicode documentation is here http://dev.mysql.com/doc/refman/5.0/en/charset-unicode.html
Also note that you need to consider both the encoding of the database and the encoding used by the database driver.
If you’re using MySQL together with SQLAlchemy, see the following, as there are some bugs in MySQLdb that you’ll need to work around:
Hopefully you now understand the history of Unicode, how to use it in Python and where to apply Unicode encoding and decoding in a Pylons application. You should also be able to use Unicode in your web app remembering the basic rule to use UTF-8 to talk to the world, do the encode and decode at the edge of your application.
This information is based partly on the following articles which can be consulted for further information.:
Please feel free to report any mistakes to the Pylons mailing list or to the author. Any corrections or clarifications would be gratefully received.