Python Unicode: Difference between revisions

Revision as of 03:43, 6 January 2007

Python has good unicode support, but it is not necessarily easy to use. Some things to note:

You must test your application with real Unicode (not ASCII-encodable) text. You can miss lots of bugs if you just use normal ASCII text (i.e., a-z, no accents).

You should be careful not to confuse 8-bit strings (that contain binary data and are of type "str"), and text (that contains unicode data and is of type "unicode"). It's easy to substitute one for the other, until you use non-ASCII text, then you'll get a UnicodeEncode/DecodeError.

Some resources to learn about Unicode:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky -- general Unicode information
Unicode HOWTO by Andrew M Kuchling -- Python unicode help
Unicode in Python by Jason Orendorff -- some more Python help
ascii to latin1 (and back again). A thread about searching unicode text, so that for example a search for "televisao" matches "televisão"
Normalization FAQ (not Python-specific) -- a single unicode string can be encoded multiple ways via "surrogates". This introduces ambiguity. This talks about some of that.

@@ Line 16: / Line 16: @@
 * [http://www.tutorialsall.com/PYTHON/ascii-latin/ ascii to latin1] (and back again).  A thread about searching unicode text, so that for example a search for "televisao" matches "televisão"
 * [http://unicode.org/faq/normalization.html Normalization FAQ] (not Python-specific) -- a single unicode string can be encoded multiple ways via "surrogates".  This introduces ambiguity.  This talks about some of that.
+[[Category:Software development]]
+[[Category:Developers]]
+[[Category:Language support]]
+[[Category:Languages (international)]]