Localization

From OLPC
Jump to navigation Jump to search
  This page is monitored by the OLPC team.
  english | español |日本語 | 한글 HowTo [ID# 45014]  +/-  

Internationalization technology is the technology for representing and composing the languages spoken, taught or used in your countries. Localization is the process of taking software or content and adapting it for local use.

Localization involves fonts, script layout, input methods, speech synthesis, musical instrumentation, collating order, number & date formats, dictionaries, and spelling checkers, among other issues.

Linux is already more widely localized than Microsoft Windows since no cooperation from a vendor is required to do so: having said this, cooperation with the free software and content community is vital to reduce overall work required.

The size of the problem is huge. Ethnologue has extensive information on the languages of the world.

See also Wikipedia's definition

This is an outline of (some of) the core topics and tools, and issues of localization.

(If you need to localize the keyboard symbols for a laptop issued during the developer-phase of our program, please refer to the instructions found on the Customizing NAND images#Keyboard page of the wiki.)

Basic Localization Topics

Character Sets

Unicode is fully supported in “modern” applications and toolkits used in free software. Legacy character set support also present, but modern applications use Unicode.

Collation order (the text sorting order) is generally well supported in the C library.

See also: Category:Fonts, Unicode.

Script Layout

OLPC uses the Pango library, which is bable to layout most of the “hard” languages, including: Arabic, the Indic languages, Hebrew, Persian, Thai, etc. It has a modular puggable layout engine and supports vertical text, as well as supporting bi-directional layout. Overall, some issues remain – but overall Pango is in can handle most scripts already; if it cannot, modules can be built to handle new scripts as documented in Pango's reference manual.

See also: Category:Languages (international)

Fonts

To share content and preserve cultural heritage OLPC's goal must be and is full coverage of all the world's languages. By using the Fontconfig system Linux has a better concept of language coverage of fonts than other systems. Fontconfig is used to configure the font system and determine what set of fonts are needed to cover a set of languages.

The formats of fonts supported on Linux include OpenType, TrueType and many others: see Freetype for details. Most of the font formats supported by Freetype are obsolete, and by far the best results on the screen will be had from OpenType and TrueType format fonts, particularly if they are hinted well. Type 1 fonts are useful primarily for printing; the renderer for Type1 fonts in Freetype we have today is not very good, and Type 1 does not support programmatic hinting for low resolution screens.

The OLPC XO-1 has a high resolution screen. High resolution helps OLPC considerably, particularly in grayscale mode at 200DPI. [Wikipedia as usual, is a starting point for free fonts. "Font foundries" are companies who will contract to produce fonts.

See also: Category:Fonts, Fonts, HIG-The Sugar Interface/Text and Fonts

Free Fonts

Free fonts are available for most scripts in the world, though some fonts are licensed incorrectly for completely free redistribution.

Need for Screen Fonts

Applications and content should be usable on other screens everywhere, not just on OLPC's high resolution screen. Therefore the OLPC community needs to work together on extending the coverage of high quality screen fonts. The "DejaVu" font family (derived from Bitstream Vera) covers most Latin alphabets and some other languages. This family has in general good "hinting" for screen use. The Red Hat "Liberation" family recently became available to help substitute for the Microsoft family of fonts, but does not yet have very wide coverage.

SIL International also builds fonts for a number of additional languages of local interest.

Helping with these or other efforts to build fonts or to increase coverage of existing fonts is greatly appreciated. Pooling efforts on hinting glyphs, which is boring but important work, and/or donations and buyouts are also being investigated.

Keyboards

OLPC Keyboard layouts document OLPC's currently available keyboard layouts: further layouts are a modest amount of work if there are existing designs for those languages. People with local expertise will need to work with OLPC staff to generate new layouts.

See also: Category:Keyboard, HIG-Input Systems-Keyboard

Input Methods

An input method is software that allows typing of scripts with many more characters than keyboard keys. Examples include languages such as Chinese, Japanese, and Korean.

Free software systems now are using SCIM - Smart Common Input Method Platform. SCIM is replacing older input method systems.

Knowing what languages are taught as “foreign” languages, as well as are native in an area is needed to design keyboards that are most useful in each country. For example, the Nigerian keyboard is designed to allow easy entry of English, Hausa, and Yoruba, which are common languages in much of Nigeria. The "US/International" covers most of the western European languages.

Some issues remain in our base technology. For example: Arabic ligatures could present problems: by avoiding putting them on the keyboard we avoided the need for an input method. However, such workarounds may not be feasible for your language.

See also: Input methods, HIG-Input Systems

Accessibility and Usability

Speech Synthesis

Speech synthesis has a set of complex tradoffs of synthesizer size versus fidelity versus effort to localize a new languag. The Wikipedia speech synthesis article discusses software that is available, which includes festival, flite, and espeak.

Espeak is small enough for us to often bundle and covers quite a few languages: ~10 languages currently supported tuned by native speakers. Localization to ten more languages is underway.

Synthesis is essential for accessibility to content by people with vision problems, and will need to be integrated with the ATK library used, as well as literacy training, other uses as part of a GUI. Full localization therefore involves selection of a suitable synthesis system and integration into the ATK framework, along with localization of that system for the particular language involved.

Speech synthesis is usually not a good guide for pronunciation – but it may be better than a poor teacher who has never had the opportunity to learn from a native speaker of that language.

See also Category:Accessibility

Music and Sound Samples

We want much more than dead white male western instruments for dead white male composers!

Clean samples of your musical instruments and music needed!

Samples need appropriate licensing terms.

See also TamTam: Sounds

Dictionaries, Spelling Checkers, Thesaurus

There is existing support for most major languages.

Spelling, Hyphenation, Thesaurus dictionaries may be needed for different parts of Linux, which may or may not apply to OLPC directly; for example you can check:

Of these, the first three are most immediately interesting to OLPC, as we use versions of these codebases as part of the Sugar environment.

Character Recognition

Stroke/character recognizer localization is of some interest with the pen/tablet: in the future (Gen 2) when we have a touch screen they will become essential. xstroke is one such individual character/stroke recognizer, sufficient for alphabets of up to about 100 characters.

Considerations

Current Shortcomings

There are some real shortcomings where help is needed. These include:

  • Non-Gregorian calendars
  • Non-Latin digits (Roozbeh Pournader has patches, but these are not yet integrated and may need help).
  • and the sheer scale of the localization problem will eventually require changes in free software projects.

Localization Techniques

It only takes a small team to localize Linux for a language: e.g. Welsh, Icelandic, which are relatively small languages, have been pretty fully localized by small teams.

You can do the work yourself, hire the work out, or find volunteers among universities (worldwide), the world wide internet and free software community. Add to existing projects whenever possible. By checking with some of the major free software projects (e.g. Gnome, OpenOffice, Mozilla, KDE), you can often locate people already at work in your language.

Work directly in the software and content projects whenever possible. This makes your work available worldwide, while lessens the ongoing work. If you keep your localization work local, others cannot benefit from your work and effort and your software and content will be that much harder to localize.

Tools

Some example tools include pootle, kbabel and rosetta. Most software uses the GNU “gettext” libraries and standard .po files, including Sugar; Firefox and OpenOffice have their own systems for historical reasons. Wordforge is a good place to get plugged into tools and the community efforts.

The cldr project is worth watching, though OpenOffice is the first major project using this.

Remember, contribute your translations to the “upstream” projects to minimize long term effort: share your work with the world. Do not presume that if one Linux distribution has your effort that you are finished; some Linux distributions are not good about working with the community that builds and distributes the original software.

Licensing

Translated strings will often be useful among many projects, not just the the project you are working on translating, therefore, since the MIT/BSD (3 clause) licenses are usable by all projects, these are the safest licenses to use for translation to enable widest sharing.

The SIL OFL license recommended for Fonts. An often overlooked issue with fonts is that they are incorporated into documents themselves (for example, into PDF documents) and that therefore licensing needs to be considered carefully.

See also Software licensing

Next Steps

Localization is by nature local: but languages often crosses borders. Please contact Jim Gettys to identify issues.

We need identified people/organizations responsible for language, translation, keyboards, speech synthesis, an effective free software community leaders to help with local deployment and "on the ground" knowledge.

Sugar Localization

Sugar and sugar applications use standard .po files, and can be localized using the usual tools.

General Linux Localization

By looking at the gnome, mozilla, OpenOffice, KDE projects, you can get plugged into translating other Linux software of general interest.

Localization within Python/Pygame

Following the wxpython tutorial below, I added the following code at the top of my application:

import gettext
gettext.install('kuku', './locale', unicode=False)

#one line for each language
presLan_en = gettext.translation("kuku", os.path.join(get_bundle_path(),'locale'), languages=['en'])
presLan_sw = gettext.translation("kuku", os.path.join(get_bundle_path(),'locale'), languages=['sw'])

#only install one language - add program logic later
presLan_en.install()
# presLan_sw.install()

Here my application is called kuku.py, and I am using 'kuku' to be the domain of my i18n. Now I choose which strings I needed to localize within my application file kuku.py - these strings I surrounded with _(). For example

message = _('Begin!')

Next I need to create the i18n files. First I create a directory called 'locale' within my activity directory (this is referred to in the above lines (presLan_en ...). The first step is to make a pot file, which I use pygettext.py to process kuku.py

python <path to your python distribution>/Tools/i18n/pygettext.py -o kuku.pot kuku.py

which creates kuku.pot. When first created it looks like

# SOME DESCRIPTIVE TITLE.
# Copyright (C) YEAR ORGANIZATION
# FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.
#
msgid ""
msgstr ""
"Project-Id-Version: PACKAGE VERSION\n"
"POT-Creation-Date: 2007-06-19 17:45+EDT\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language-Team: LANGUAGE <LL@li.org>\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=CHARSET\n"
"Content-Transfer-Encoding: ENCODING\n"
"Generated-By: pygettext.py 1.5\n"


#: kuku.py:501
msgid "Begin!"
msgstr ""

The last little bit is the stuff we have to translate. I had to modify the stuff at the top to change the ENCODING and CHARSET. I changed both of these to utf-8, so my file now reads:

# SOME DESCRIPTIVE TITLE.
# Copyright (C) YEAR ORGANIZATION
# FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.
#
msgid ""
msgstr ""
"Project-Id-Version: PACKAGE VERSION\n"
"POT-Creation-Date: 2007-06-19 17:15+EDT\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language-Team: LANGUAGE <LL@li.org>\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: utf-8\n"
"Generated-By: pygettext.py 1.5\n"


#: kuku.py:500
msgid "Begin!"
msgstr ""

Now I moved kuku.pot to ./locale . Then for each language I want to localize to, I create subdirectories within ./locale according to their language codes. Within each of these subdirectories, I create subdirectories called LC_MESSAGES. For know I am using english and swahili, so my directory structure looks like

locale/
  kuku.pot
  en/
    LC_MESSAGES/
  sw/
    LC_MESSAGES/

Now we do translations. I copied kuku.pot into ./locale/en/LC_MESSAGES/kuku.po and ./locale/sw/LC_MESSAGES/kuku.po, and performed the translations:

#./locale/en/LC_MESSAGES/kuku.po
# SOME DESCRIPTIVE TITLE.
# Copyright (C) YEAR ORGANIZATION
# FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.
#
msgid ""
msgstr ""
"Project-Id-Version: PACKAGE VERSION\n"
"POT-Creation-Date: 2007-06-19 17:15+EDT\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language-Team: LANGUAGE <LL@li.org>\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: utf-8\n"
"Generated-By: pygettext.py 1.5\n"


#: kuku.py:500
msgid "Begin!"
msgstr "Begin!"
#./locale/sw/LC_MESSAGES/kuku.po
# SOME DESCRIPTIVE TITLE.
# Copyright (C) YEAR ORGANIZATION
# FIRST AUTHOR <EMAIL@ADDRESS>, YEAR.
#
msgid ""
msgstr ""
"Project-Id-Version: PACKAGE VERSION\n"
"POT-Creation-Date: 2007-06-19 17:15+EDT\n"
"PO-Revision-Date: YEAR-MO-DA HO:MI+ZONE\n"
"Last-Translator: FULL NAME <EMAIL@ADDRESS>\n"
"Language-Team: LANGUAGE <LL@li.org>\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"Content-Transfer-Encoding: utf-8\n"
"Generated-By: pygettext.py 1.5\n"


#: kuku.py:500
msgid "Begin!"
msgstr "Kuanza!"

Now my directory structure looks like

locale/
  kuku.pot
  en/
    LC_MESSAGES/
      kuku.po
  sw/
    LC_MESSAGES/
      kuku.po

One last step before we are ready to go. We need to make the binary files used by gettext. We do that with msgfmt.py:

cd <project path>/locale/en/LC_MESSAGES/
python <path to your python distribution>/Tools/i18n/msgfmt.py kuku.po 
cd <project path>/locale/en/LC_MESSAGES/
python <path to your python distribution>/Tools/i18n/msgfmt.py kuku.po

This creates binary .mo files, and now my directory structure looks like:

locale/
  kuku.pot
  en/
    LC_MESSAGES/
      kuku.po
      kuku.mo
  sw/
    LC_MESSAGES/
      kuku.po
      kuku.mo

To add new languages, we need to add a subdirectory for each language, perform the translations, create the .mo files, and add the relevant code in the application to select the language.

Resources

These are the two docs that I used to learn about i18n (with no prior knowledge). Read the WxPython reference first, and instead of using the mki18n.py file mentioned on the WkPython page, use the tools in the python standard distribution: pygettext.py and msgfmt.py.

Python Reference

WxPython i18n

Current l10n projects

library exchange

activities

Add / include links to upstream localization where appropriate.

  • camera — en | es | ko | pt | zh-CN
  • web?
  • read?
  • write?
  • blockparty?

games

See also
Translators & Translating for the localization of this wiki.

i18n & l10n

The following table is focused on the list of languages present in the currently 'green status' countries (Argentina, Brazil, Ethiopia, India, Libya, Nepal, Nigeria, Pakistan, Peru, Romania, Russia, Rwanda, Thailand, United States, Uruguay). Countries with other 'status' may benefit from efforts for the 'green languages', plus add their own set of languages. Each language must be fully supported for the Localization effort.

Language Green Countries Red Countries Orange
Arabic Libya Bahrain, Egypt, Iraq (+), Israel (+), Jordan, Kuwait, Lebanon (+), Morocco, Oman, Palestine, Saudi Arabia, Sudan (+), Syria (+), Tunisia, Yemen
English Nigeria,
Rwanda,
USA (+)
Belize (+), Pakistan (+), Philippines (+) Canada (+), Gambia, Guyana, India (+), Kenya (+), Mauritius (+), Namibia (+), Saint Kitts and Nevis, Sierra Leone, Singapore (+), South Africa (+), St. Lucia, Trinidad and Tobago, Uganda (+), Zimbabwe (+)
French Rwanda Haiti (+) Benin, Cameroon (+), Democratic Republic of the Congo (+), Gabon, Mali, Niger, Senegal, St. Martin (+), Togo
Hausa Nigeria
Igbo Nigeria
Kinyarwanda Rwanda
Portuguese Brazil Angola Mozambique, Portugal, São Tomé and Príncipe
Spanish Argentina,
Peru (+),
Uruguay,
USA (+)
Belize, Costa Rica, Dominican Republic, El Salvador, Guatemala (+), Honduras, México (+), Nicaragua, Panamá Bolivia (+), Chile, Colombia, Cuba, Ecuador, Paraguay (+), Puerto Rico (+), Spain (+), Venezuela (+)
Thai Thailand
Yoruba Nigeria
Other non-green languages Ethiopia, Indonesia, Philippines (+), Pakistan (+), Vietnam Afghanistan, Albania, Armenia, Azerbaijan, Bangladesh, Bhutan (+), Bosnia and Herzegovina, Cambodia, China (+), Croatia, Cyprus, Eritrea, Estonia, Georgia, Greece, Hungary, Iceland, India (+), Iran, Italy, Japan, Kyrgyzstan, Latvia, Lithuania, Macedonia, Malaysia, Moldova, Mongolia, Romania, Russia, Slovenia, South Korea, Sri Lanka, Tajikistan, Tanzania, Turkey, Ukraine, Uzbekistan, Vatican City


The following table presents on a per country base the target languages that must be considered for the Localization effort of the countries with 'green status' (Argentina, Brazil, Ethiopia, India, Libya, Nepal, Nigeria, Pakistan, Peru, Romania, Russia, Rwanda, Thailand, United States, Uruguay).

Country Target Languages Mayor/important languages Minor/relevant languages
Argentina
EthnologueAR
[spa] Spanish [quh] Quechua (0.85M - 2.1%) See OLPC Argentina/Languages
Brazil
EthnologueBR
[por] Portuguese none reported by Ethnologue BR above 50,000 speakers.
Libya
EthnologueLY
[arb] Arabic, Standard [ayl] Arabic, Libyan Spoken (4.2M - 75%),
[jbn] Nafusi (0.14M - 2.5%)
[rmt] Domari (0.03M - 0.6%)
Nigeria
EthnologueNG
[eng] English,
[hau] Hausa
—(18.5M - 13.5%)
,
[yor] Yoruba
—(18.9M - 13.8%)
[bin] Edo (1.0M - 0.7%) official,
[efi] Efik (0.4M - 0.3%) official,
[fub] Fulfulde, Adamawa (7.6M - 5.6%) official,
[fuv] Fulfulde, Nigerian (1.7M - 1.2%),
[ibb] Ibibio (1.5M to 2.0M - 1.0-1.5%),
[idu] Idoma (0.6M - 0.4%) official,
[ibo] Igbo (18.0M - 13.1%) official,
[knc] Kanuri, Central (3.0M - 2.2%) official,
[tiv] Tiv (2.2M - 1.6%)
See OLPC Nigeria/Languages
Peru
EthnologueNG
[spa] Spanish pending See OLPC Peru/Languages
Rwanda
EthnologueRW
[kin] Kinyarwanda,
[fra] French,
[eng] English
[swh] Swahili (0.01M - 1.3%)
Thailand
EthnologueTH
Thai (dialects?) [nan] Chinese, Min Nan (1.1M - 1.7%),
[kxm] Khmer, Northern (1.1M - 1.8%),
[mfa] Malay, Pattani (3.1M - 4.8%),
[tha] Thai (20.2M - 32%),
[tts] Thai, Northeastern (15.0M - 23%),
[nod] Thai, Northern (6.0M - 9.2%),
[sou] Thai, Southern (5.0M - 7.7%)
[ksw] Karen, S'gaw (0.3M - 0.5%),
[kdt] Kuy (0.3M - 0.5%)
Uruguay
EthnologueUY
[spa] Spanish none other reported by Ethnologue UY
USA
EthnologueUS
[eng] English [spa] Spanish (22.4M - 7.5%),
[___] Polish (3.4M - 1.1%),
[deu] German, Standard (6.1M - 2.0%),
[___] Arabic (3.0M - 1.0%)
[___] Armenian (1.1M - 0.4%),
[___] Chinese (1.6M - 0.5%),
[___] Czech (1.5M - 0.5%),
[___] Eastern Yiddish (1.3M - 0.4%),
[___] French (1.1M - 0.4%),
[frc] French, Cajun (1.0M - 0.3%),
[hwc] Hawai'i Creole English (0.6M - 0.2%),
[___] Italian (0.9M - 0.3%),
[___] Japanese (0.8M - 0.3%),
[___] Korean (1.8M - 0.6%),
[___] Philippines (1.4M - 0.5%),
[___] Portuguese (1.3M - 0.4%),
[___] Swedish (0.6M - 0.2%),
[___] Ukrainian (0.8M - 0.3%),
[___] Vietnamese (0.9M - 0.3%),
[___] Vlax Romani (0.7M - 0.2%),
[___] Western Farsi (0.9M - 0.3%)

Country groups and descriptions

Korean-based Communities

Korea map.gif

People using Korean as their native language are those in South Korea (한국인) and North Korea (조선인). Some Chinese and those with other nationalities, living in the Nothern part of Korea also are using Korean as their second language, because of some historical issues. They are called as 고려인(Korea-in) and 조선족 (Chosun-zok or Korean Chinese) respectively.

Currently OLPC Korea (or XO Korea) is covering all those nations and regions. In a near future, we hope there will be regional XO groups for those.