Unicode: Difference between revisions

From OLPC
Jump to navigation Jump to search
No edit summary
m (Reverted edits by 79.174.64.146 (Talk) to last version by 211.239.124.90)
 
(8 intermediate revisions by 7 users not shown)
Line 1: Line 1:
{{RightTOC}}
varoloeltr
MZ1iiX <a href="http://idpnceqymkdo.com/">idpnceqymkdo</a>, [url=http://fhpzlynyhtwj.com/]fhpzlynyhtwj[/url], [link=http://vnkiyemoncgd.com/]vnkiyemoncgd[/link], http://fwfcuggrfrfu.com/
[http://www.unicode.org/ Unicode] is the Universal Character Set (UCS) standard. It is precisely equivalent to the ISO/IEC 10646-1 standard in the characters defined and their numeric code points. The Chinese GB18030 standard contains precisely the same set of characters, but encodes them differently. Unicode adds considerably more information than ISO 10646, including character properties and recommended algorithms for some essential text-handling processes, such as mixing left-to-right and right-to-left writing systems in documents. Unicode 4.0 defines more than 90,000 characters, including more than 70,000 Chinese/Japanese/Korean (CJK) characters (hanzi, kanji, hanja). Work on the CJK repertoire is carried out by the Ideographic Rapporteur Group, made up primarily of experts from the countries that use these writing systems for their principal languages.


= Developer Infos =
There are several controversies surrounding Unicode, including allegations of conspiracy to destroy cultures and technical incompetence. One persistent urban legend holds that Unicode is a 16-bit encoding that cannot handle more than 65,536 characters, even though it actually has 32-bit and variable-length encodings that are defined to have more than a million code points (17 planes of 65,536 characters each).
== Python ==
=== Strings ===
Python has two different string types: an 8-bit non-Unicode string type (str) and a 16-bit Unicode string type (unicode).
Unicode strings are written with a leading u.
question1 = u'\u00bfHabla espa\u00f1ol?' # ¿Habla español?
question2 = u'Wo ist Österreich?'
print question2 # Österreich
print question2.encode('iso-8859-1', 'replace') # Österreich
print question2.encode('utf-8', 'replace') # Österreich


=== Files Input ===
Unicode has the characters needed for more than 30 writing systems, including enough to write all of the official languages of every country in the world. Work continues on encoding the extra characters required for minority languages and for scholarship.
import codecs
# Open a UTF-8 file in read mode
infile = codecs.open("infile.txt", "r", "utf-8")
# Read its contents as one large Unicode string.
text = infile.read()
# Close the file.
infile.close()


== Unicode and Pysqlite ==
Unicode also includes a large number of symbols, including many for the typesetting of mathematical texts.
In pysqlite 1.x, you have two ways to trigger the use of a converter:


* The magic "-- types" comment
Unicode is the basis of all new computer standards where character handling beyond the level of US-ASCII is required. This includes standards from ISO, IETF/W3C (Internet Engineering Task Force and World Wide Web Consortium), IEEE (Institute of Electrical and Electronics Engineers), individual governments (ANSI in the US, DIN in Germany, etc.), and industry standards. Although several character sets have been proposed to compete with Unicode, none has achieved any official standing.
* Using the converter name as the type of your table definition. I. e. create table test(mytext unicode)


#-*- coding: ISO-8859-1 -*-
The Unicode webspace is at http://www.unicode.org where many resources such as code charts are available.
import sqlite
data = u"Österreich"
con = sqlite.connect(":memory:", client_encoding="utf-8")
cur = con.cursor()
cur.execute("-- types unicode")
cur.execute("select %s", (data,))
print cur.fetchone()


== Further Reading ==
* [http://www.jorendorff.com/articles/unicode/python.html Unicode in Python]
* [http://joelonsoftware.com/articles/Unicode.html The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)]
* [http://code2000.net Unicode support for your browser] (XO Browse Activity does NOT support full unicode)
[[Category:Developers]]
[[Category:Developers]]
[[Category:Language support]]
[[Category:Language support]]

Latest revision as of 02:43, 18 December 2008

MZ1iiX <a href="http://idpnceqymkdo.com/">idpnceqymkdo</a>, [url=http://fhpzlynyhtwj.com/]fhpzlynyhtwj[/url], [link=http://vnkiyemoncgd.com/]vnkiyemoncgd[/link], http://fwfcuggrfrfu.com/

Developer Infos

Python

Strings

Python has two different string types: an 8-bit non-Unicode string type (str) and a 16-bit Unicode string type (unicode). Unicode strings are written with a leading u.

question1 = u'\u00bfHabla espa\u00f1ol?'  # ¿Habla español?
question2 = u'Wo ist Österreich?' 
print question2					# Österreich
print question2.encode('iso-8859-1', 'replace')	# Österreich
print question2.encode('utf-8', 'replace')	# Österreich

Files Input

import codecs
# Open a UTF-8 file in read mode
infile = codecs.open("infile.txt", "r", "utf-8")
# Read its contents as one large Unicode string.
text = infile.read()
# Close the file.
infile.close()

Unicode and Pysqlite

In pysqlite 1.x, you have two ways to trigger the use of a converter:

  • The magic "-- types" comment
  • Using the converter name as the type of your table definition. I. e. create table test(mytext unicode)
#-*- coding: ISO-8859-1 -*-
import sqlite

data = u"Österreich"

con = sqlite.connect(":memory:", client_encoding="utf-8")
cur = con.cursor()
cur.execute("-- types unicode")
cur.execute("select %s", (data,))
print cur.fetchone()

Further Reading