Bityi/GSoC
Note:
This page is currently under heavy work. I plan to have a presentable version by 23:59 UTC, March 25th, and probably 6-8 hours before then.
The basic idea
Why can you only program in English-based programming languages? Why can't people to program in something much closer to their own natural language, but have the resulting program be fully portable?
A use-case answer
Pepito wants to program his XO. He opens Develop and creates a new activity based on the new-activity template. The file is in English on disk, but he sees it and edits it in Spanish. When he copies and pastes in some English example code, it switches to Spanish automatically.
He adds "importa xml" to a file, and, since the xml module already has a translation, he can use the functions in that module by their Spanish names, which he can see in a module browser. Except that he had already defined a variable called "analiza" (Spanish for parse), so, in order to avoid a naming conflict, his variable is called "mymodule__analiza" on-screen (and "es__analiza" on disk, as it was before the import) and the xml function is called "xml__analiza" on-screen (and "parse" on disk, of course). By right clicking on mymodule__analiza, he gets a context menu for setting translation, and he selects the option to harmonize the translation; now his variable is called "parse" on disk.
Now he wants to use some example code he found which uses the cgi module. He adds "importa cgi", but the cgi module has no translations yet. He copies and pastes in the example code, and it stays in English. Then he right-clicks on a word to add a translation. By scanning the imported modules, his computer can guess that the translation should be associated with the cgi module, so it puts that module at the top of the list of options where to add the translation. He chooses that module, and gets a dialog for adding his translation (hopefully, seeded with guesses from local and remote dictionaries). He chooses a reasonable translation, and, with another click, his choice is uploaded to a central server so that others can use it.
Now he wants to look up documentation for a function. Suspecting that his question is too specialized to have an answer in Spanish, he hovers the mouse over the Spanish function name, and gets a tooltip with the English name so that he can search it in Google. Or, if he wants, with a simple menu option he can switch his whole view to English, and get the Spanish only in the tooltips.
Later, he sends the module he created to his friend Janinha in Brazil. She imports it and adds portuguese translations to the functions she uses, but leaves his internal variables untranslated because she doesn't care about them.
Why does OLPC need this?
If you intend to have a "view source" key that lets any user modify their applications (and more); if you are shipping to the non-English-speaking world; and if the target audience is ALL children; then you need this. Arguably, any two of these factors would not require localized coding; but the combination of all three does.
But...
- But many non-native English speakers are programmers, and they will generally tell you that English was not a barrier to learning programming for them. After all, "raise string(variable)", while it is composed of English words, makes as much English sense as "lift twine(capricious)".
- That's true, but these are a pretty self-selected group. If you intend to expose / teach all the children in a country to programming, forcing them to do it in a foreign language is going to be a significant hurdle. Also, even if the language hurdle is, in retrospect, a minor one, it comes at the very outset of learning to program. Experience in widely varying areas shows consistently that removing initial barriers can have a disproportionate effect on participation.
- But anyone who aspires to be a good programmer will eventually want to learn at least some English anyway.
- Exactly. By letting more people get a taste of programming, you will let more people aspire to be good programmers. The end result will be more people learning more English, not the reverse; but along the way, this will also be encouraging viable communities based in non-English languages. (based on the concept: the more programmers always just talk in English, the less I can google for a more-local community; and if I can't google an online community, it doesn't grow.)
- But this has been attempted before, and it has failed.
- Apple tried to do something similar with Appletalk in the 90s. There are several important differences this time. That was before Wikipedia - before the principle of "many hands make light work" was really operational on the web. Appletalk was never an important language for initially learning programming, as Logo, Basic, Pascal, and now Python all are/have been. The translations only existed for the language built-ins, it was impossible to have an actual program exist in two forms. Unicode had less penetration. Etc.
- I believe that if a tool makes creating and sharing translations in real time easy, the translations will happen.
Definitions, design, deliverables, and dates
For all of the below, I will use "Spanish" (ISO code "es") to refer to an arbitrary non-English user language.
A few definitions:
- A "dictionary" is a one-to-one list of identifiers in English and Spanish. Each python file can have up to two dictionaries, one for the public interface (including that which comes from imports), and one for the locally-defined private internals (not including imports).
- A "mapping" is an (almost) one-to-one mapping from on-disk (presumably English) words to on-screen (presumably Spanish) words. (It is permissible for several on-screen words to map to one English word as long as it is unlikely that users would type any but one of them). A mapping is simply the disambiguated union of one or more "dictionaries" - the public and private dictionaries of the current file, and the public dictionaries of any direct imports. Mappings operate exclusively at the level of a single identifier. (This means that the python two-word keyword "is not" would be discouraged in favor of "not (a is b)").
- "Disambiguated" means that if two dictionaries disagree on the translation for an English word (dictone:is<->es;dicttwo:is<->esta) then the Spanish shows both (dictone__es__dicttwo__esta). If they disagree on the translation for a Spanish word (dictone:do<->hacer;dicttwo:make<->hacer), then the two english words are shown with disambiguating prefixes (do<->dictone__hacer, make<->dicttwo__hacer).
- In version 1, the public and private dictionaries for a given module are created manually but with computer help (for instance, moving something from any dictionary to the public one is trivial). All dictionary entries also point to their "source version" (either themselves or some direct or indirect import) to enable propagation of dictionary changes. (Terminology: source entry/copy entry. This is different from coincidental duplicate entries, such as two modules with similarly-named functions. It only happens when an element of the public interface of one module becomes part of the public interface of an importer module - for instance, when the importer makes a subclass, and so many method names are by definition the same.)
- A later improvement would be to use a pylint-like static analysis to automatically (re)build the public dictionary by adding public global variables (including classes) and the imported classes they use (superclasses, declared types on functions, and, where static analysis reveals it, instance types). Note that this analysis would also enable many intelligent features such as argument tooltips, intelligent [eclipse-like] auto-completion, etc.
- Each module has a "preferred language". This affects the handling of words not in the mapping (not-yet-translated). If the word is entered while the editor is set in the "preferred language" it is left untranslated on disk. If it is entered while the editor is set in another language, it is escaped with a language prefix (like es___algunNombre) and a tinycode-based system is used to ensure typeability in spite of unicode.
Design
Some choices:
- All files are in unicode (python already supports this as an option). Nevertheless, untranslated identifiers not in a file's preferred language (noted in a magic comment near the top) would be escaped by a tinycode-like scheme to pure ascii (unfortunately, tinycode includes hyphens which are not alphanumeric in python, so it would not be exactly tinycode.)
- I have a scheme of patterns to recognize and translate the most common type of occurrences of identifiers in docstrings and comments. Other than these, docstrings and comments would be translated (or, more commonly, not) as blocks of text, a la gettext.
- The large majority of the work would be done in python, but some changes to gtksourceview in C++ would be necessary to support syntax coloring of translated code (accept the language definitions dynamically regenerated from python).
Modules:
- Non-UI
- MappingMgr module parses files for imports, finds dictionary files, asks for them to be parsed, and passes them to mapping module. Handles changes to dictionaries. Validates the consistency of "master entries" and "copy entries".
- Mapping module handles mappings when given parsed dictionaries. Dictionary objects signal the mapping when they are updated. Mapping module then reparses the file and the GtkSourceView language definition to maintain consistency.
- Dictionary module is an abstraction layer over the dictionary file format.
- When a user wants to create a new translation, NewTranslationGuesser guesses where it belongs.
- UI
- Add a toolbar for preferences/ toggling languages/ etc.
- Add tooltips for seeing translations, but leave hooks for further data in tooltips.
- Add contextual menus for adding translations, again, leave hooks.
- Consider whether it is feasible to stash language metadata in the clipboard so that cut/paste can work its magic. If not, cut all text as English unless it contains an odd number of string boundaries, in which case ??.
Schedule
Before April 14: Finish work on current Develop features, add further features but freeze a week or two before and just work on documentation so it's in good shape to start. Clarify proposed API and file formats.
April 14 - May 26:: Since I can only commit about 30+ hours per week to GSoC, I will use this time to get a real head start with the project. Starting May 26th, I will give weekly progress reports and have a working check-in weekly at least.
July 7: Translation works, including processing of imported files. Handles case with/without write permissions in the directory of imported files. UI to set editor language. UI to set file's preferred language. UI to add new translations, including scanning imported files to suggest where to add the new translations. UI to resolve translation conflicts. Work on GtkSourceview started but not necessarily finished. Static analysis and networking features NOT started (these are extra).
July 7 - July 14: Review and polishing for mid-term evaluation, no new work.
August 11: Gtksourceview work done. API for other apps to display/use translated code done. Documentation decent. Additional features (docstrings/comments, static analysis, remote storage/queries of translations, class browser, etc.) done as time allows.
August 11-August 18: Primarily work on Spanish translation of a "hello world" activity, along with any cleanup necessary.