Unicode

From OLPC
Revision as of 19:48, 19 November 2007 by Crazy-chris (talk | contribs) (added python & pysqlite unicode howto)
Jump to: navigation, search

About Unicode

Unicode is the Universal Character Set (UCS) standard. It is precisely equivalent to the ISO/IEC 10646-1 standard in the characters defined and their numeric code points. The Chinese GB18030 standard contains precisely the same set of characters, but encodes them differently. Unicode adds considerably more information than ISO 10646, including character properties and recommended algorithms for some essential text-handling processes, such as mixing left-to-right and right-to-left writing systems in documents. Unicode 4.0 defines more than 90,000 characters, including more than 70,000 Chinese/Japanese/Korean (CJK) characters (hanzi, kanji, hanja). Work on the CJK repertoire is carried out by the Ideographic Rapporteur Group, made up primarily of experts from the countries that use these writing systems for their principal languages.

There are several controversies surrounding Unicode, including allegations of conspiracy to destroy cultures and technical incompetence. One persistent urban legend holds that Unicode is a 16-bit encoding that cannot handle more than 65,536 characters, even though it actually has 32-bit and variable-length encodings that are defined to have more than a million code points (17 planes of 65,536 characters each).

Unicode has the characters needed for more than 30 writing systems, including enough to write all of the official languages of every country in the world. Work continues on encoding the extra characters required for minority languages and for scholarship.

Unicode also includes a large number of symbols, including many for the typesetting of mathematical texts.

Unicode is the basis of all new computer standards where character handling beyond the level of US-ASCII is required. This includes standards from ISO, IETF/W3C (Internet Engineering Task Force and World Wide Web Consortium), IEEE (Institute of Electrical and Electronics Engineers), individual governments (ANSI in the US, DIN in Germany, etc.), and industry standards. Although several character sets have been proposed to compete with Unicode, none has achieved any official standing.

The Unicode webspace is at http://www.unicode.org where many resources such as code charts are available.

Developer Infos

Python

Strings

Python has two different string types: an 8-bit non-Unicode string type (str) and a 16-bit Unicode string type (unicode). Unicode strings are written with a leading u.

question1 = u'\u00bfHabla espa\u00f1ol?'  # ¿Habla español?
question2 = u'Wo ist Österreich?' 
print question2					# Österreich
print question2.encode('iso-8859-1', 'replace')	# Österreich
print question2.encode('utf-8', 'replace')	# Österreich

Files Input

import codecs
# Open a UTF-8 file in read mode
infile = codecs.open("infile.txt", "r", "utf-8")
# Read its contents as one large Unicode string.
text = infile.read()
# Close the file.
infile.close()

Unicode and Pysqlite

In pysqlite 1.x, you have two ways to trigger the use of a converter:

  • The magic "-- types" comment
  • Using the converter name as the type of your table definition. I. e. create table test(mytext unicode)
#-*- coding: ISO-8859-1 -*-
import sqlite

data = u"Österreich"

con = sqlite.connect(":memory:", client_encoding="utf-8")
cur = con.cursor()
cur.execute("-- types unicode")
cur.execute("select %s", (data,))
print cur.fetchone()

Further Reading