Input methods: Difference between revisions
(→SCIM Smart Common Input Method platform: Add content) |
Chaosconst (talk | contribs) |
||
(56 intermediate revisions by 21 users not shown) | |||
Line 1: | Line 1: | ||
<noinclude>{{Translations}}</noinclude> |
|||
In order to input text in any particular [[Languages|language]] and [[Writing systems|writing system]], we need a [[Unicode]] [[fonts|font]] to display it in, a [[rendering engines|rendering engine]] that knows how to display it, and a keyboard layout or Input Method Editor (IME) that provides a way to get all of the needed characters. Most alphabetic and syllabic languages can be typed on fairly simple keyboards that produce one Unicode character per key combination, using the ordinary typing keys together with Meta (usually Alt) and Compose keys (further description needed). Any accented letter that is included in Unicode in precomposed form falls within this capability. This covers letters that occur in any widely-used pre-Unicode character set, such as Latin-1 (ISO-8859-1), which supports French, German, Spanish, Italian, Scandinavian languages, and some other languages that use only the accented letters in Latin-1. |
|||
In order to input text in any particular [[Languages|language]] and [[Writing systems|writing system]], we need a [[Unicode]] [[fonts|font]] to display it in, a [[rendering engines|rendering engine]] that knows how to display it, and a [[keyboard layouts|keyboard layout]] or Input Method Editor (IME) that provides a way to get all of the needed characters. Most alphabetic and syllabic languages can be typed on fairly simple keyboards that produce one Unicode character per key combination, using the ordinary typing keys together with Meta (usually Alt) and Compose keys. Any of several keys, including Menu and Windows keys, can be set to act as Compose. Then on Latin keyboards Compose-a-' produces á, Compose-c-, produces ç, and so on. Any accented letter that is included in Unicode in precomposed form falls within this capability. This covers letters that occur in any widely-used pre-Unicode character set, such as Latin-1 (ISO-8859-1), which supports French, German, Spanish, Italian, Scandinavian languages, and some other languages that use only the accented letters in Latin-1. |
|||
Multiple diacritics can be entered sequentially on simple keyboards of this type, while more elaborate input methods can enter more than one Unicode character code into the input buffer for each key combination. [[Yoruba]] is an example of a language that poses this choice, because it has vowel letters with an acute accent above and a dot below that are not available precomposed in [[Unicode]]. |
Multiple diacritics can be entered sequentially on simple keyboards of this type, while more elaborate input methods can enter more than one Unicode character code into the input buffer for each key combination. [[Yoruba]] is an example of a language that poses this choice, because it has vowel letters with an acute accent above and a dot below that are not available precomposed in [[Unicode]]. |
||
Line 11: | Line 14: | ||
==Tools== |
==Tools== |
||
Tools for keyboard layouts, to come. |
Tools for keyboard layouts, to come. loadkeys utility to load keyboard layouts. |
||
Tools for IMEs, to come. |
Tools for IMEs, to come. |
||
Line 17: | Line 20: | ||
==Input Methods== |
==Input Methods== |
||
===Phonetic conversion=== |
===Phonetic conversion=== |
||
The concept of phonetic conversion is that any CJKV language typed in any alphabet or other sound-based writing system can be converted using a combination of dictionary lookup together with grammatical and semantic analysis. The first successful phonetic conversion word processor was the Xerox 8010 J-Star, an outgrowth of the Xerox Alto computer and Smalltalk programming language in 1981. Thanks go to Alan Kay for the Alto and Smalltalk ideas, and to Joseph Becker for the language handling software. Phonetic conversion to CJKV characters exists for the following combinations, in many variations. |
The concept of phonetic conversion is that any CJKV language typed in any alphabet or other sound-based writing system can be converted using a combination of dictionary lookup together with grammatical and semantic analysis. The first successful phonetic conversion word processor was the Xerox 8010 J-Star, an outgrowth of the Xerox Alto computer and [[Smalltalk]] programming language in 1981. Thanks go to [[Alan Kay]] for the Alto and Smalltalk ideas, and to Joseph Becker for the language handling software. Phonetic conversion to CJKV characters exists for the following combinations, in many variations. |
||
* Romazi (Latin alphabet) or Zhuyin to either Traditional or Simplified Chinese hanzi |
* Romazi (Latin alphabet) or Zhuyin to either Traditional or Simplified Chinese hanzi 漢字 |
||
* |
* Hangeul 한글 (Korean alphabet) to Korean hanja 漢字 |
||
* Romaji ローマ字 (Latin alphabet) or hiragana to Japanese kanji |
* Romaji ローマ字 (Latin alphabet) or hiragana ひらがな syllabary to Japanese kanji 漢字 |
||
Phonetic conversion systems depend on a native alphabetic or syllabic representation, or on one or more Romanizations of the target language. |
Phonetic conversion systems depend on a native alphabetic or syllabic representation, or on one or more Romanizations of the target language. |
||
Line 34: | Line 37: | ||
(Yes, the Yale Department of Linguistics was busy on the issue for decades.) |
(Yes, the Yale Department of Linguistics was busy on the issue for decades.) |
||
==Dasher - gesture text entry== |
|||
[http://www.dasher.org.uk/ Dasher] is an information-efficient text-entry interface, driven by natural continuous pointing gestures. It is a competitive text-entry system wherever a full-size keyboard cannot be used - for example, when operating a computer one-handed, by joystick, touchscreen, trackball, or mouse—particularly interesting when the OLPC is in e-book mode. It can be used for [http://www.dasher.org.uk/Languages.html many languages (>60)], and is extensible through XML files. You can [http://www.dasher.org.uk/TryJavaDasherNow.html try it in your browser (w/Java)]. |
|||
==Shape-based== |
==Shape-based== |
||
It has been evident to everyone who has studied Chinese characters, including ancient Chinese scholars, that one can analyze most Chinese characters into a variety of smaller parts, many of which are characters themselves, and further, into a number of specific brush strokes (or the corresponding elements for characters cast in bronze, carved into seals, scratched into oracle bones, and so on). Many attempts were made over thousands of years to analyze the character shapes for various reasons, among them dictionary making. The best known is the system of 214 radicals set forth in the 1615 CE Zihui (字彙 "Character Glossary"), edited by Mei Yingzuo (梅膺祚) during the Ming Dynasty, and made a de facto standard by the 1716 CE Kangxi Zidian (康熙字典 "Kangxi Dictionary"), compiled under the Kangxi Emperor of the Qing Dynasty. |
|||
In the computer age, numerous analyses of character shapes were made as the basis for methods for typing Chinese, including those described below, which remain in widespread use, and others. As with the radicals, shapes are generally somewhat variable. In shape-based IMEs, several shapes may be grouped together and assigned a single key. Different shape-based IMEs are usually used for Traditional and Simplified characters. |
|||
==Methods by language== |
==Methods by language== |
||
Line 53: | Line 61: | ||
====Pinyin conversion==== |
====Pinyin conversion==== |
||
Phonetic conversion method, using the standard Romanization of Chinese, converted to Simplified Chinese characters. |
|||
Phonetic conversion method |
|||
Example: zhongguo results in 中国 |
|||
====[[Cangjie]]==== |
====[[Cangjie]]==== |
||
Line 62: | Line 72: | ||
z 難 金 女 月 弓 一 , 。 / |
z 難 金 女 月 弓 一 , 。 / |
||
Shape-based. The Cangjie keyboard layout has 24 simple Chinese characters on it, plus a key for "difficult" 難 characters. All of the common brush strokes and many common combinations are mapped to the 24 base characters. Characters are analyzed into |
Shape-based. The Cangjie keyboard layout has 24 simple Chinese characters on it, plus a key for "difficult" 難 characters. All of the common brush strokes and many common combinations are mapped to the 24 base characters. Characters are analyzed into these combinations, and then 1 to 5 of them are selected for typing, according to a moderately complex set of rules. |
||
Example: 日月 results in 明. |
|||
====Four Corners==== |
====Four Corners==== |
||
Line 72: | Line 84: | ||
====[[Pinyin]]==== |
====[[Pinyin]]==== |
||
This is the stepped down version and the lesser of the orignal chinese print. |
|||
====[[Wubi]]==== |
====[[Wubi]]==== |
||
etc. |
etc. |
||
====Fcitx In Sugar==== |
|||
=====install===== |
|||
1. switch to tty2 and login with root privilege |
|||
2. run |
|||
"yum install fcitx" |
|||
=====auto start===== |
|||
1. run the following scripts. |
|||
echo "fcicx" >> ~/.profile |
|||
=====screenshot===== |
|||
[[File:fcitx_sugar.jpg]] |
|||
===[[Japanese]]=== |
===[[Japanese]]=== |
||
Line 88: | Line 112: | ||
====Romaja (ASCII) conversion==== |
====Romaja (ASCII) conversion==== |
||
Generally, not used. |
|||
====[[Hangeul]] conversion==== |
====[[Hangeul]] conversion==== |
||
Line 93: | Line 118: | ||
etc. |
etc. |
||
==Descriptions of |
==Descriptions of IME engines== |
||
===SCIM Smart Common Input Method platform=== |
===SCIM Smart Common Input Method platform=== |
||
Moved to separate [[SCIM]] page. |
|||
The SCIM platform supports the following IMEs on Linux and other forms of Unix. |
|||
The SCIM platform supports the following IMEs on Linux and other forms of Unix. |
|||
*English/European |
|||
* English/European |
|||
*[[Amharic]] |
|||
*[[ |
* [[Arabic]] |
||
*[[ |
* Ethiopic/[[Amharic]] |
||
*[[ |
* [[Cyrillic]] |
||
* Alphabets and Languages of [[India]] |
|||
*Tigrigna-Eritrean |
|||
** [[Assamese]] |
|||
*Tigrigna-Ethiopian |
|||
*[[ |
** [[Gujarati]] |
||
** [[Hindi]] (Devanagari alphabet) |
|||
** [[Kannada]] |
|||
** [[Malayalam]] |
|||
** [[Oriya]] |
|||
** [[Punjabi]] (Gurmukhi alphabet) |
|||
** [[Tamil]] |
|||
** [[Telugu]] |
|||
*[[Bengali]] |
|||
* Nepali |
|||
* Greek |
|||
* Iranian (Persian, Farsi) |
|||
* Tibetan |
|||
* Hebrew |
|||
* Cambodian |
|||
* Laotian |
|||
* Thai |
|||
* [[Inuktitut]] |
|||
* [[IPA]] |
|||
* [[Vietnamese]] |
|||
* [[Japanese]] |
|||
* [[Korean]] |
|||
* [[Simplified Chinese]] |
|||
* [[Traditional Chinese]] |
|||
RAJESH |
|||
[[Japanese]] |
|||
*Hiragana (romaji-hiragane conversion) |
|||
*Katakana (romaji-katakana conversion) |
|||
*Nippon (romaji-kanji conversion) |
|||
[[Korean]] |
|||
*Hangul |
|||
*Hangul Romaja |
|||
*Hanja |
|||
[[Simplified Chinese]] |
|||
*Erbi |
|||
*Erbi-QS |
|||
*Wubi |
|||
*Ziranma |
|||
[[Traditional Chinese]] |
|||
* Wu |
|||
* Array30 |
|||
* CNS11643 |
|||
* Cangjie |
|||
* Cangjie 3 |
|||
* Cangjie 5 |
|||
* Canton HK |
|||
* Cantonese Pinyin |
|||
* Dayi3 |
|||
* EasyBig |
|||
* Jyutping |
|||
* Quick |
|||
* Simplex |
|||
* Stroke 5 |
|||
* ZhuYin |
|||
* ZhuYin Big |
|||
R A J |
|||
On Debian Linux, use apt-get, aptitude, or Synaptic to install the scim package along with im-switch (required for Chinese, Japanese, and Korean) and scim-gtk2-immodule. You can also install skim, the KDE graphical frontend, which puts an icon on the toolbar. On systems using Red Hat packages, install the corresponding software. Follow the [http://www.scim-im.org/wiki/documentation/installation_and_configuration/all/system_configuration configuration instructions] to set the necessary environment variables. Then run |
|||
===Chinput=== |
|||
scim -d |
|||
from the command line. Open a supported application such as gedit. Right-click in the window to get the main IME menu. Select "SCIM Input Method" to enable Chinese, Korean, and Japanese, |
|||
===xcin=== |
|||
[http://www.opencjk.org/~yumj/project-chinput-e.html Chinese XIM Input Server] |
|||
===IIIMF=== |
|||
[http://www.openi18n.org/modules.php?op=modload&name=Sections&file=index&req=viewarticle&artid=103&page=1 Internet/Intranet Input Method Framework] of OpenI18n.org is a new Unicode-based input method framework. |
|||
== |
==Help== |
||
etc. |
|||
Tests of IMEs, need help from expert users. |
Tests of IMEs, need help from expert users. |
||
Line 158: | Line 170: | ||
*[http://en.wikipedia.org/wiki/IME Wikipedia article on IMEs] with links to articles on IMEs for specific languages |
*[http://en.wikipedia.org/wiki/IME Wikipedia article on IMEs] with links to articles on IMEs for specific languages |
||
*[http://en.wikipedia.org/wiki/Chinese_dictionary#Graphically_organized_dictionaries Wikipedia article on Chinese dictionaries] |
|||
*[http://www.cjmember.com/ Official Cangjie Home Page] |
*[http://www.cjmember.com/ Official Cangjie Home Page] |
||
*[http://www.cjmember.com/cj_book.htm Cangjie Method] book in English |
*[http://www.cjmember.com/cj_book.htm Cangjie Method] book in English |
||
*[http://freedesktop.org/wiki/Software_2fscim SCIM] |
|||
*[http://www.openi18n.org/modules.php?op=modload&name=Sections&file=index&req=viewarticle&artid=30&page=1 IIIMF] |
|||
*[http://lists.freedesktop.org/archives/scim/2004-October/001024.html How to create an IME in scim] |
|||
[[Category:Hardware]] |
|||
[[Category:Keyboard]] |
|||
[[Category:Language support]] |
|||
[[Category:Languages (international)]] |
|||
[[Category:Fonts]] |
Latest revision as of 05:35, 23 March 2012
In order to input text in any particular language and writing system, we need a Unicode font to display it in, a rendering engine that knows how to display it, and a keyboard layout or Input Method Editor (IME) that provides a way to get all of the needed characters. Most alphabetic and syllabic languages can be typed on fairly simple keyboards that produce one Unicode character per key combination, using the ordinary typing keys together with Meta (usually Alt) and Compose keys. Any of several keys, including Menu and Windows keys, can be set to act as Compose. Then on Latin keyboards Compose-a-' produces á, Compose-c-, produces ç, and so on. Any accented letter that is included in Unicode in precomposed form falls within this capability. This covers letters that occur in any widely-used pre-Unicode character set, such as Latin-1 (ISO-8859-1), which supports French, German, Spanish, Italian, Scandinavian languages, and some other languages that use only the accented letters in Latin-1.
Multiple diacritics can be entered sequentially on simple keyboards of this type, while more elaborate input methods can enter more than one Unicode character code into the input buffer for each key combination. Yoruba is an example of a language that poses this choice, because it has vowel letters with an acute accent above and a dot below that are not available precomposed in Unicode.
The most elaborate IMEs are for input of CJKV characters for Chinese, Japanese, Korean, and the historical Vietnamese Chu Nomh writing. Each of these languages requires several thousand characters at a minimum, and there is a desire to have much more extensive CJKV sets available, including a number of Hong Kong characters and other recent additions, or the tens of thousands of historical characters important for scholarship.
Several hundred methods for entering CJKV characters have been invented over several decades. Among the most important (due to efficiency of use or ease of learning, or in a few cases both) are language-specific phonetic conversion systems for Chinese, Japanese, or Korean, and shape-based systems that are in principle independent of language, but in practice specific to particular countries up to now.
See also countries, languages, writing systems, fonts, locales, and keyboard layouts.
Tools
Tools for keyboard layouts, to come. loadkeys utility to load keyboard layouts.
Tools for IMEs, to come.
Input Methods
Phonetic conversion
The concept of phonetic conversion is that any CJKV language typed in any alphabet or other sound-based writing system can be converted using a combination of dictionary lookup together with grammatical and semantic analysis. The first successful phonetic conversion word processor was the Xerox 8010 J-Star, an outgrowth of the Xerox Alto computer and Smalltalk programming language in 1981. Thanks go to Alan Kay for the Alto and Smalltalk ideas, and to Joseph Becker for the language handling software. Phonetic conversion to CJKV characters exists for the following combinations, in many variations.
- Romazi (Latin alphabet) or Zhuyin to either Traditional or Simplified Chinese hanzi 漢字
- Hangeul 한글 (Korean alphabet) to Korean hanja 漢字
- Romaji ローマ字 (Latin alphabet) or hiragana ひらがな syllabary to Japanese kanji 漢字
Phonetic conversion systems depend on a native alphabetic or syllabic representation, or on one or more Romanizations of the target language.
- Chinese: Pinyin 拼音, Gwoyeu Romatzyh 國語羅馬字, Wade-Giles, and Yale are a few of hundreds
- Japanese: Hepburn, Kunrei-shiki, Nippon-shiki, Yale
- Korean: McCune-Reischauer (MR), Revised Romanization of Korean (RR), Yale
(Yes, the Yale Department of Linguistics was busy on the issue for decades.)
Dasher - gesture text entry
Dasher is an information-efficient text-entry interface, driven by natural continuous pointing gestures. It is a competitive text-entry system wherever a full-size keyboard cannot be used - for example, when operating a computer one-handed, by joystick, touchscreen, trackball, or mouse—particularly interesting when the OLPC is in e-book mode. It can be used for many languages (>60), and is extensible through XML files. You can try it in your browser (w/Java).
Shape-based
It has been evident to everyone who has studied Chinese characters, including ancient Chinese scholars, that one can analyze most Chinese characters into a variety of smaller parts, many of which are characters themselves, and further, into a number of specific brush strokes (or the corresponding elements for characters cast in bronze, carved into seals, scratched into oracle bones, and so on). Many attempts were made over thousands of years to analyze the character shapes for various reasons, among them dictionary making. The best known is the system of 214 radicals set forth in the 1615 CE Zihui (字彙 "Character Glossary"), edited by Mei Yingzuo (梅膺祚) during the Ming Dynasty, and made a de facto standard by the 1716 CE Kangxi Zidian (康熙字典 "Kangxi Dictionary"), compiled under the Kangxi Emperor of the Qing Dynasty.
In the computer age, numerous analyses of character shapes were made as the basis for methods for typing Chinese, including those described below, which remain in widespread use, and others. As with the radicals, shapes are generally somewhat variable. In shape-based IMEs, several shapes may be grouped together and assigned a single key. Different shape-based IMEs are usually used for Traditional and Simplified characters.
Methods by language
Traditional Chinese
Zhuyin conversion
Zhuyin 注音, or Bopomofo ㄅㄆㄇㄈ, is a Chinese alphabet used for teaching children as well as for typing Chinese input. It has multiple keyboard layouts. Zhuyin is one of the standard conversion methods for Chinese, appearing on almost all computers and some cell phones.
[Unicode: Bopomofo] code table (PDF)
Phonetic conversion method
Pinyin conversion
Phonetic conversion method, using the standard Romanization of Chinese, converted to Simplified Chinese characters.
Example: zhongguo results in 中国
Cangjie
` 1 2 3 4 5 6 7 8 9 0 [ ] 手 田 水 口 廿 卜 山 戈 人 心 [ ] 、 日 尸 木 火 土 竹 十 大 中 ; ‘ z 難 金 女 月 弓 一 , 。 /
Shape-based. The Cangjie keyboard layout has 24 simple Chinese characters on it, plus a key for "difficult" 難 characters. All of the common brush strokes and many common combinations are mapped to the 24 base characters. Characters are analyzed into these combinations, and then 1 to 5 of them are selected for typing, according to a moderately complex set of rules.
Example: 日月 results in 明.
Four Corners
Shape-based. The corners of a character are encoded, and a code for the whole character created from them.
etc.
Simplified Chinese
Pinyin
This is the stepped down version and the lesser of the orignal chinese print.
Wubi
etc.
Fcitx In Sugar
install
1. switch to tty2 and login with root privilege 2. run
"yum install fcitx"
auto start
1. run the following scripts.
echo "fcicx" >> ~/.profile
screenshot
Japanese
Romaji (ASCII) conversion
Kana conversion
etc.
Korean
Romaja (ASCII) conversion
Generally, not used.
Hangeul conversion
etc.
Descriptions of IME engines
SCIM Smart Common Input Method platform
Moved to separate SCIM page.
The SCIM platform supports the following IMEs on Linux and other forms of Unix.
- English/European
- Arabic
- Ethiopic/Amharic
- Cyrillic
- Alphabets and Languages of India
- Bengali
- Nepali
- Greek
- Iranian (Persian, Farsi)
- Tibetan
- Hebrew
- Cambodian
- Laotian
- Thai
- Inuktitut
- IPA
- Vietnamese
- Japanese
- Korean
- Simplified Chinese
- Traditional Chinese
RAJESH
R A J
Chinput
Help
Tests of IMEs, need help from expert users.
External links
- Wikipedia article on IMEs with links to articles on IMEs for specific languages
- Wikipedia article on Chinese dictionaries
- Official Cangjie Home Page
- Cangjie Method book in English
- SCIM
- IIIMF
- How to create an IME in scim