Japanese: Difference between revisions

From OLPC
Jump to navigation Jump to search
No edit summary
No edit summary
Line 1: Line 1:
Written Japanese includes three character sets: one is called "Kanji" that are Chinese Characters imported from China around 5th century. Quite a few Kanji and new words with Kanji were invented over time in Japan and exported back to China. The second is called "Hiragana" that are originally created as simplified Kanji. While each Kanji character is more close to a "word" in Indo-European languages' sense, Hiragana is used as phonetic characters. The other is "Katakana", that is also phonetic characters. In standard writing, Katakana is used to write foreign phrases, and often also used to give the emphasis (ala italic in English). Hiragana and Katakana are called "Kana" generically. The words or names in latin alphabets are often used in the sentenses, especially in technical documents.
Written Japanese includes three character sets: one stolen from the Chinese (every time we invaded them), one adopted from the set stolen from the Chinese, and one varied from the second set.


Several thousands of Kanji, about 50 (or 70 if you count the ones with accent) Hiragana, and about the same number of Katakana are used in a typical books and newspapers.
There ''is'' a keyboard layout, although most (perhaps all) modern systems simply use phonetic input using the standard 101 layout, after which a list of possible strings of characters are displayed based on the phonetic input. Japanese IME, thus, is harder to reproduce than most other languages (I don't know of any other languages with a similar input method), and I've actually yet to see a Linux equivalent of the JP IME on Windows.

Note that you can write complete text solely in Hiragana or Katakana. Children's books are written usually written in Hiragana and the children gradually learn Kanji. Kanji provides great optimization: the meaning that the characters carray, and the boundary between different characters in a text serve as if some sort of visual cue of word boundary.

Japanese is a pioneered the multi-byte character representation. In the earlier versions of encoding, several thousands characters, including Kanji, Hiragana, Katakana, and other graphical symbols can are encoded in two 8-bit bytes. (often in 94x94 = 8836 code space). The first 7-bits are (mostly) compatible with ASCII, so the user was able to mix ASCII and the extended characters in single plain text. The first scheme was JIS C 6226-1978, and its characteristics such as the 94x94 format, and the code point allocation of symbols were later used in the early domestic encodings in China and Korea. Its revisions and variations of the encoding are used (perhaps more than Unicode based representation) today.

While the Unicode's promise that the round trip conversion between the existing domestic encoding and Unicode was false, Unicode is faily well used to represent Japanese language today. Even with some reluctance, many system supports Unicode and use Unicode as their internal representations. It will be more so in the future.

It is notable that given the history of early adaptation of computing, the free available resource, willingness to make their work public, multilingualization and internationalization work of major open-source software project was done by, or largely contributed by, Japanese developers. The examples include, Emacs, Free Unices, X Window System, etc.

There are a few ways of text input with a keyboard. Usually, the characters are input phonetically first either in alphabetical pronounciation (Roma-ji) or in Kana, and then it is converted into meaningful mix of Kanji and Kana by using a dictionary and user choice. Such software module that converts the Kana to Kanji-Kana mix is called an Input Method or Input Method Editor.

There are many free implementations of such Input Methods. The first one was perhaps "Wnn" in 1985. It was later enhanced to handle Chinese and Korean and has been successfully used on the terminal emulators or text editors like Emacs on Unix/X Window System. Similar systems such as Canna, SKK, etc. are available, too. Such systems communicate with the X server first through its own invented protocols and later standarlized XIM or XInput protocols. While inputting Japanese text requires additional software modules, the environments has been well supported.

Revision as of 18:58, 2 May 2006

Written Japanese includes three character sets: one is called "Kanji" that are Chinese Characters imported from China around 5th century. Quite a few Kanji and new words with Kanji were invented over time in Japan and exported back to China. The second is called "Hiragana" that are originally created as simplified Kanji. While each Kanji character is more close to a "word" in Indo-European languages' sense, Hiragana is used as phonetic characters. The other is "Katakana", that is also phonetic characters. In standard writing, Katakana is used to write foreign phrases, and often also used to give the emphasis (ala italic in English). Hiragana and Katakana are called "Kana" generically. The words or names in latin alphabets are often used in the sentenses, especially in technical documents.

 Several thousands of Kanji, about 50 (or 70 if you count the ones with accent) Hiragana, and about the same number of Katakana are used in a typical books and newspapers.
 Note that you can write complete text solely in Hiragana or Katakana.  Children's books are written usually written in Hiragana and the children gradually learn Kanji.  Kanji provides great optimization: the meaning that the characters carray, and the boundary between different characters in a text serve as if some sort of visual cue of word boundary.
 Japanese is a pioneered the multi-byte character representation.  In the earlier versions of encoding, several thousands characters, including Kanji, Hiragana, Katakana, and other graphical symbols can are encoded in two 8-bit bytes. (often in 94x94 = 8836 code space).  The first 7-bits are (mostly) compatible with ASCII, so the user was able to mix ASCII and the extended characters in single plain text.  The first scheme was JIS C 6226-1978, and its characteristics such as the 94x94 format, and the code point allocation of symbols were later used in the early domestic encodings in China and Korea.  Its revisions and variations of the encoding are used (perhaps more than Unicode based representation) today.
 While the Unicode's promise that the round trip conversion between the existing domestic encoding and Unicode was false, Unicode is faily well used to represent Japanese language today.  Even with some reluctance, many system supports Unicode and use Unicode as their internal representations.  It will be more so in the future.
 It is notable that given the history of early adaptation of computing, the free available resource, willingness to make their work public, multilingualization and internationalization work of major open-source software project was done by, or largely contributed by, Japanese developers.  The examples include, Emacs, Free Unices, X Window System, etc.
 There are a few ways of text input with a keyboard.  Usually, the characters are input phonetically first either in alphabetical pronounciation (Roma-ji) or in Kana, and then it is converted into meaningful mix of Kanji and Kana by using a dictionary and user choice.  Such software module that converts the Kana to Kanji-Kana mix is called an Input Method or Input Method Editor.
 There are many free implementations of such Input Methods.  The first one was perhaps "Wnn" in 1985.  It was later enhanced to handle Chinese and Korean and has been successfully used on the terminal emulators or text editors like Emacs on Unix/X Window System.  Similar systems such as Canna, SKK, etc. are available, too.  Such systems communicate with the X server first through its own invented protocols and later standarlized XIM or XInput protocols.  While inputting Japanese text requires additional software modules, the environments has been well supported.