Unicode

Unicode is the Universal Character Set (UCS) standard. It is precisely equivalent to the ISO/IEC 10646-1 standard in the characters defined and their numeric code points. The Chinese GB18030 standard contains precisely the same set of characters, but encodes them differently. Unicode adds considerably more information than ISO 10646, including character properties and recommended algorithms for some essential text-handling processes, such as mixing left-to-right and right-to-left writing systems in documents. Unicode 4.0 defines more than 90,000 characters, including more than 70,000 Chinese/Japanese/Korean (CJK) characters (hanzi, kanji, hanja). Work on the CJK repertoire is carried out by the Ideographic Rapporteur Group, made up primarily of experts from the countries that use these writing systems for their principal languages.

There are several controversies surrounding Unicode, including allegations of conspiracy to destroy cultures and technical incompetence. One persistent urban legend holds that Unicode is a 16-bit encoding that cannot handle more than 65,536 characters, even though it actually has 32-bit and variable-length encodings that are defined to have more than a million code points (17 planes of 65,536 characters each).

Unicode has the characters needed for more than 30 writing systems, including enough to write all of the official languages of every country in the world. Work continues on encoding the extra characters required for minority languages and for scholarship.

Unicode is the basis of all new computer standards where character handling beyond the level of US-ASCII is required. This includes standards from ISO, IETF/W3C (Internet Engineering Task Force and World Wide Web Consortium), IEEE (Institute of Electrical and Electronics Engineers), individual governments (ANSI in the US, DIN in Germany, etc.), and industry standards. Although several character sets have been proposed to compete with Unicode, none has achieved any official standing.

Unicode

Navigation menu

Search