Bityi/GSoC: Difference between revisions

From OLPC
Jump to navigation Jump to search
No edit summary
Line 21: Line 21:
One of the few positive things left where the US leads the world is software. Obviously some of that comes from brain drain, but I do not think it is a coincidence that this field has proven to give a greater overall advantage to native English speakers than other engineering fields. Grown-ups will learn English, but if you talk to a good (or even a middling) software engineer they will tell you some story of programming when they were a kid. For kids, there is no question that having resources in your own language can help, the only question is by how much. But even if it only means 10% more people learn to program, that is still millions of dollars - er, I mean, stable currency units - of income down the road.
One of the few positive things left where the US leads the world is software. Obviously some of that comes from brain drain, but I do not think it is a coincidence that this field has proven to give a greater overall advantage to native English speakers than other engineering fields. Grown-ups will learn English, but if you talk to a good (or even a middling) software engineer they will tell you some story of programming when they were a kid. For kids, there is no question that having resources in your own language can help, the only question is by how much. But even if it only means 10% more people learn to program, that is still millions of dollars - er, I mean, stable currency units - of income down the road.


In other words, I understand that this feature may do less to build the platform than other features, maybe even less than other features within Develop. But in my opinion, '''there are very few features that do more to advance the core mission of OLPC''' - as a tool which, through education (including a good dollop of self-education), will foster development in the third world.
In other words, I understand that this feature may do less to build the platform than other features, maybe even less than other features within Develop. But in my opinion, '''there are very few features that would do more to advance the core mission of OLPC''' - as a tool which, through education (including a good dollop of self-education), will foster development in the third world.


== Why does OLPC need this? ==
== Why does OLPC need this? ==

Revision as of 16:52, 26 March 2008

The basic idea

Why can you only program in English-based programming languages? Why can't people to program in something much closer to their own natural language, but have the resulting program be fully portable?

A use-case answer

Pepito wants to program his XO. He opens Develop and creates a new activity based on the new-activity template. The file is in English on disk, but he sees it and edits it in Spanish. When he copies and pastes in some English example code, it switches to Spanish automatically.

He adds "importa xml" to a file, and, since the xml module already has a translation, he can use the functions in that module by their Spanish names, which he can see in a module browser. Except that he had already defined a variable called "analiza" (Spanish for parse), so, in order to avoid a naming conflict, his variable is called "mymodule__analiza" on-screen (and "es__analiza" on disk, as it was before the import) and the xml function is called "xml__analiza" on-screen (and "parse" on disk, of course). By right clicking on mymodule__analiza, he gets a context menu for setting translation, and he selects the option to harmonize the translation; now his variable is called "parse" on disk.

Now he wants to use some example code he found which uses the cgi module. He adds "importa cgi", but the cgi module has no translations yet. He copies and pastes in the example code, and it stays in English. Then he right-clicks on a word to add a translation. By scanning the imported modules, his computer can guess that the translation should be associated with the cgi module, so it puts that module at the top of the list of options where to add the translation. He chooses that module, and gets a dialog for adding his translation (hopefully, seeded with guesses from local and remote dictionaries). He chooses a reasonable translation, and, with another click, his choice is uploaded to a central server so that others can use it.

Now he wants to look up documentation for a function. Suspecting that his question is too specialized to have an answer in Spanish, he hovers the mouse over the Spanish function name, and gets a tooltip with the English name so that he can search it in Google. Or, if he wants, with a simple menu option he can switch his whole view to English, and get the Spanish only in the tooltips.

Later, he sends the module he created to his friend Janinha in Brazil. She imports it and adds portuguese translations to the functions she uses, but leaves his internal variables untranslated because she doesn't care about them.

A manifesto answer

The time has come that people should be able to program in their own language.

One of the few positive things left where the US leads the world is software. Obviously some of that comes from brain drain, but I do not think it is a coincidence that this field has proven to give a greater overall advantage to native English speakers than other engineering fields. Grown-ups will learn English, but if you talk to a good (or even a middling) software engineer they will tell you some story of programming when they were a kid. For kids, there is no question that having resources in your own language can help, the only question is by how much. But even if it only means 10% more people learn to program, that is still millions of dollars - er, I mean, stable currency units - of income down the road.

In other words, I understand that this feature may do less to build the platform than other features, maybe even less than other features within Develop. But in my opinion, there are very few features that would do more to advance the core mission of OLPC - as a tool which, through education (including a good dollop of self-education), will foster development in the third world.

Why does OLPC need this?

If you intend to have a "view source" key that lets any user modify their applications (and more); if you are shipping to the non-English-speaking world; and if the target audience is ALL children; then you need this. Arguably, any two of these factors would not require localized coding; but the combination of all three does.

But...

  • But many non-native English speakers are programmers, and they will generally tell you that English was not a barrier to learning programming for them. After all, "raise string(variable)", while it is composed of English words, makes as much English sense as "lift twine(capricious)".
    • That's true, but these are a pretty self-selected group. If you intend to expose / teach all the children in a country to programming, forcing them to do it in a foreign language is going to be a significant hurdle. Also, even if the language hurdle is, in retrospect, a minor one, it comes at the very outset of learning to program. Experience in widely varying areas shows consistently that removing initial barriers can have a disproportionate effect on participation.
  • But anyone who aspires to be a good programmer will eventually want to learn at least some English anyway.
    • Exactly. By letting more people get a taste of programming, you will let more people aspire to be good programmers. The end result will be more people learning more English, not the reverse; but along the way, this will also be encouraging viable communities based in non-English languages. (based on the concept: the more programmers always just talk in English, the less I can google for a more-local community; and if I can't google an online community, it doesn't grow.)
  • But this has been attempted before, and it has failed.
    • Apple tried to do something similar with AppleScript in the 90s. There are several important differences this time. That was before Wikipedia - before the principle of "many hands make light work" was really operational on the web. AppleScript was never an important language for initially learning programming, as Logo, Basic, Pascal, and now Python all are/have been. The translations only existed for the language built-ins, it was impossible to have an actual program exist in two forms. Unicode had less penetration at the time. Etc.
I believe that if a tool makes creating and sharing translations in real time easy, the translations will happen.

Definitions, design, deliverables, and dates

For all of the below, I will use "Spanish" (ISO code "es") to refer to an arbitrary non-English user language.

A few definitions:

  • A "dictionary" is a one-to-one list of identifiers in English and Spanish. Each python file can have up to two dictionaries, one for the public interface (including that which comes from imports), and one for the locally-defined private internals (not including imports).
  • A "mapping" is an (almost) one-to-one mapping from on-disk (presumably English) words to on-screen (presumably Spanish) words. (It is permissible for several on-screen words to map to one English word as long as it is unlikely that users would type any but one of them). A mapping is simply the disambiguated union of one or more "dictionaries" - the public and private dictionaries of the current file, and the public dictionaries of any direct imports. Mappings operate exclusively at the level of a single identifier. (This means that the python two-word keyword "is not" would be discouraged in favor of "not (a is b)").
  • "Disambiguated" means that if two dictionaries disagree on the translation for an English word (dictone:is<->es;dicttwo:is<->esta) then the Spanish shows both (dictone__es__dicttwo__esta). If they disagree on the translation for a Spanish word (dictone:do<->hacer;dicttwo:make<->hacer), then the two english words are shown with disambiguating prefixes (do<->dictone__hacer, make<->dicttwo__hacer).
  • In version 1, the public and private dictionaries for a given module are created manually but with computer help (for instance, moving something from any dictionary to the public one is trivial). All dictionary entries also point to their "source version" (either themselves or some direct or indirect import) to enable propagation of dictionary changes. (Terminology: source entry/copy entry. This is different from coincidental duplicate entries, such as two modules with similarly-named functions. It only happens when an element of the public interface of one module becomes part of the public interface of an importer module - for instance, when the importer makes a subclass, and so many method names are by definition the same.)
    • A later improvement would be to use a pylint-like static analysis to automatically (re)build the public dictionary by adding public global variables (including classes) and the imported classes they use (superclasses, declared types on functions, and, where static analysis reveals it, instance types). Note that this analysis would also enable many intelligent features such as argument tooltips, intelligent [eclipse-like] auto-completion, etc.
  • Each module has a "preferred language". This affects the handling of words not in the mapping (not-yet-translated). If the word is entered while the editor is set in the "preferred language" it is left untranslated on disk. If it is entered while the editor is set in another language, it is escaped with a language prefix (like es___algunNombre) and a tinycode-based system is used to ensure typeability in spite of unicode.

Design

Some choices:

  • All files are in unicode (python already supports this as an option). Nevertheless, untranslated identifiers not in a file's preferred language (noted in a magic comment near the top) would be escaped by a tinycode-like scheme to pure ascii (unfortunately, tinycode includes hyphens which are not alphanumeric in python, so it would not be exactly tinycode.)
  • I have a scheme of patterns to recognize and translate the most common type of occurrences of identifiers in docstrings and comments. Other than these, docstrings and comments would be translated (or, more commonly, not) as blocks of text, a la gettext.
  • The large majority of the work would be done in python, but some changes to gtksourceview in C++ would be necessary to support syntax coloring of translated code (accept the language definitions dynamically regenerated from python).

Modules:

  • Non-UI
    • MappingMgr module parses files for imports, finds dictionary files, asks for them to be parsed, and passes them to mapping module. Handles changes to dictionaries. Validates the consistency of "master entries" and "copy entries".
    • Mapping module handles mappings when given parsed dictionaries. Dictionary objects signal the mapping when they are updated. Mapping module then reparses the file and the GtkSourceView language definition to maintain consistency.
    • Dictionary module is an abstraction layer over the dictionary file format.
    • When a user wants to create a new translation, NewTranslationGuesser guesses where it belongs.
  • UI
    • Add a toolbar for preferences/ toggling languages/ etc.
    • Add tooltips for seeing translations, but leave hooks for further data in tooltips.
    • Add contextual menus for adding translations, again, leave hooks.
    • Consider whether it is feasible to stash language metadata in the clipboard so that cut/paste can work its magic. If not, cut all text as English unless it contains an odd number of string boundaries, in which case ??.

Schedule

Before April 14: Finish work on current Develop features, add further features but freeze a week or two before and just work on documentation so it's in good shape to start. Clarify proposed API and file formats. "Hello world" ready (for translation in final week).

April 14 - May 26:: Since I can only commit about 30+ hours per week to GSoC, I will use this time to get a real head start with the project. Starting May 26th, I will give weekly progress reports and have a working check-in weekly at least.

July 7: Translation works, including processing of imported files. Handles case with/without write permissions in the directory of imported files. UI to set editor language. UI to set file's preferred language. UI to add new translations, including scanning imported files to suggest where to add the new translations. UI to resolve translation conflicts. Work on GtkSourceview started but not necessarily finished. Static analysis and networking features NOT started (these are extra).

July 7 - July 14: Review and polishing for mid-term evaluation, no new work.

August 11: Gtksourceview work done. API for other apps to display/use translated code done. Documentation decent. Additional features (docstrings/comments, static analysis, remote storage/queries of translations, class browser, etc.) done cleanly or not started, as time allows.

August 11-August 18: Primarily work on Spanish translation of a "hello world" activity, along with any cleanup necessary.

Related work

Wikipedia is a good place to start.

A paper saying that localizing python source code is hard. Does not consider the option of leaving it unlocalized on-disk.

Chinese python. A reprogrammed interpreter, code is Chinese on-disk. This would be a great source of initial translations for Chinese.

AppleScript retrospective paper See page 20 for a discussion of "dialects".

Of course, there is also the infuriating example of MS office, in particular the function names in Excel. The big problem here is that there is no way to figure out "what is the Spanish word for stdev?" so you are constantly reading through the list of functions one-by-one to find what you need. (I have Excel installed in Spanish and I hate it, it is a thousand times worse than having to search for the apostrophe key all the time). The clear lesson to be learned is that for languages to be useful, there must be tools for easily moving from one language to another in real time. This design incorporates that lesson.