Bityi/GSoC

From OLPC
< Bityi
Revision as of 12:26, 19 April 2008 by Homunq (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Last modified Homunq 11:58, 29 March 2008 (EDT). If you're a mentor, this link may work to take you to see the app inside google's system.

The basic idea

Why can you only program in English-based programming languages? Why can't people to program in something much closer to their own natural language, but have the resulting program be fully portable?

A use-case answer

Pepito wants to program his XO. He opens Develop and creates a new activity based on the new-activity template. The file is in English on disk, but he sees it and edits it in Spanish. When he copies and pastes in some English example code, it switches to Spanish automatically.

He adds "importa xml" to a file, and, since the xml module already has a translation, he can use the functions in that module by their Spanish names, which he can see in a module browser. Except that he had already defined a variable called "analiza" (Spanish for parse), so, in order to avoid a naming conflict, his variable is called "mymodule__analiza" on-screen (and "es__analiza" on disk, as it was before the import) and the xml function is called "xml__analiza" on-screen (and "parse" on disk, of course). By right clicking on mymodule__analiza, he gets a context menu for setting translation, and he selects the option to harmonize the translation; now his variable is called "parse" on disk.

Now he wants to use some example code he found which uses the cgi module. He adds "importa cgi", but the cgi module has no translations yet. He copies and pastes in the example code, and it stays in English. Then he right-clicks on a word to add a translation. By scanning the imported modules, his computer can guess that the translation should be associated with the cgi module, so it puts that module at the top of the list of options where to add the translation. He chooses that module, and gets a dialog for adding his translation (hopefully, seeded with guesses from local and remote dictionaries). He chooses a reasonable translation, and, with another click, his choice is uploaded to a central server so that others can use it.

Now he wants to look up documentation for a function. Suspecting that his question is too specialized to have an answer in Spanish, he hovers the mouse over the Spanish function name, and gets a tooltip with the English name so that he can search it in Google. Or, if he wants, with a simple menu option he can switch his whole view to English, and get the Spanish only in the tooltips.

Later, he sends the module he created to his friend Janinha in Brazil. She imports it and adds portuguese translations to the functions she uses, but leaves his internal variables untranslated because she doesn't care about them. Note that any functions or variables he uses that are named the same as anything he imports, may already show up on Janinha's screen in portuguese, so she can probably understand as she browses his code.

A manifesto answer

The time has come that people should be able to program in their own language.

One of the few positive things left where the US unquestionably leads the world is software. Obviously some of that comes from brain drain, but I do not think it is a coincidence that this field has proven to give a greater overall advantage to a country of native English speakers than other engineering fields have. Grown-ups will learn English, but if you talk to any good (or even middling) software engineer, they will tell you some story of programming when they were a kid. For kids, there is no question that having resources in your own language can help, the only question is by how much. But even if it only means 10% more people learn to program, that is still millions of dollars - er, I mean, stable currency units - of income down the road.

OLPC is the right platform on which to implement this feature, for many reasons. But - again, IMO - this feature will have an importance beyond OLPC and will eventually be ported to other IDEs / computer languages. It will be a source of pride for OLPC, and a way to further the educational goals of OLPC beyond the XO laptop.

In other words, I understand that this feature may do less to build the platform than other features, maybe even less than other features within Develop. But in my opinion, there are very few features that would do more to advance the core mission of OLPC - as a tool which, through education (including a good dollop of self-education), will foster development in the third world.

Homunq 13:01, 26 March 2008 (EDT)

Why does OLPC need this?

If you intend to have a "view source" key that lets any user modify their applications (and more); if you are shipping to the non-English-speaking world; and if the target audience is ALL children, not just geeky ones; then you need this. Arguably, any two of these factors would not require localized coding; but the combination of all three does.

But...

  • But many non-native English speakers are programmers, and they will generally tell you that English was not a barrier to learning programming for them. After all, "raise string(variable)", while it is composed of English words, makes as much English sense as "lift twine(capricious)".
    • That's true, but these are a pretty self-selected group. If you intend to expose / teach all the children in a country to programming, forcing them to do it in a foreign language is going to be a significant hurdle. Also, even if the language hurdle is, in retrospect, a minor one, it comes at the very outset of learning to program. Experience in widely varying areas shows consistently that removing initial barriers can have a disproportionate effect on participation - even if those barriers appear small once you're past them.
  • But anyone who aspires to be a good programmer will eventually want to learn at least some English anyway.
    • Exactly. By letting more people get a taste of programming, you will let more people aspire to be good programmers. The end result will be more people learning more English, not the reverse; but along the way, this will also be encouraging viable communities based in non-English languages. (based on the concept: the more programmers always just talk in English, the less I can google for a more-local community; and if I can't google an online community, it doesn't grow.) And even "lift twine(capricious)" is more approachable to english-speaking kids than "поднять строку (переменную)"; the same point holds true for any language.
  • But this has been attempted before, and it has failed.
    • Apple tried to do something similar with AppleScript in the 90s. There are several important differences this time. That was before Wikipedia - before the principle of "many hands make light work" was really operational on the web. AppleScript was never an important language for initially learning programming, as Logo, Basic, Pascal, and now Python all are/have been. The translations only existed for the language built-ins, it was impossible to have an actual program exist in two forms. It was also difficult to toggle visible languages or look up synonyms in a different language. Unicode had less penetration at the time. Etc.
I believe that if a tool makes creating and sharing translations in real time easy, the translations will happen.

References and prior work

Wikipedia is a good place to start.

A paper saying that localizing python source code is hard. Does not consider the option of leaving it unlocalized on-disk.

Chinese python. A reprogrammed interpreter, code is Chinese on-disk. This would be a great source of initial translations for Chinese.

AppleScript retrospective paper See page 20 for a discussion of "dialects".

Of course, there is also the infuriating example of MS office, in particular the function names in Excel. The big problem here is that there is no way to figure out "what is the Spanish word for stdev?" so you are constantly reading through the list of functions one-by-one to find what you need. (I have Excel installed in Spanish and I hate it, it is a thousand times worse than having to search for the apostrophe key all the time). The clear lesson to be learned is that for languages to be useful, there must be tools for easily moving from one language to another in real time. This design incorporates that lesson.

Idly-develop

OLPC hosts idly develop, where I was working on this starting from IDLE. I am embarassed to say that I thought that tk was a direct ancestor of gtk and so it would be relatively easy to port later. Nevertheless, in 4.5 weeks of work-time (6 weeks real time - 3 weeks 90%, and 3 weeks 50%), I got to a point where translation was working, imports partly worked (without the public-private distinction described below, although that was planned from the start), the UI for switching languages mostly worked, syntax coloring worked, and tooltips worked. About 40% of this code (one clean chunk, the mapping class) is reusable as-is. Certainly, the experience I gained with the issues involved make me confident that I can do it better and faster now. It is clear to me that I can finish all (or at least a usable most) of my targets for this project in the GSoC timeframe.

Definitions, design, deliverables, and dates

For all of the below, I will use "Spanish" (ISO code "es") to refer to an arbitrary non-English user language.

A few definitions:

  • The "editor preference language" would default to the UI language, and it would be easy to toggle back and forth to English. All words would show up in the editor preference language, when a translation exists, or with a prefix to indicate their language of origin. So Janinha, browsing Pepito's source code, would see a mix of portuguese (keywords, builtins, and common modules), en__englishwords (less-common modules), and es__palabrasespañoles (defined by Pepito). She could quickly toggle to an English view if she preferred it, though Pepito's creations would still show up as es__palabrasespañoles until she gave them a translation.
  • A "dictionary" is a one-to-one list of identifiers in English and Spanish. Each python file can have up to two dictionaries, one for the public interface (including that which comes from imports), and one for the locally-defined private internals (not including imports).
  • A "mapping" is an (almost) one-to-one mapping from on-disk (presumably English) words to on-screen (presumably Spanish) words. (It is permissible for several on-screen words to map to one English word as long as it is unlikely that users would type any but one of them). A mapping is simply the disambiguated union of one or more "dictionaries" - the public and private dictionaries of the current file, and the public dictionaries of any direct imports. Mappings operate exclusively at the level of a single identifier. (This means that the python two-word keyword "is not" would be discouraged in favor of "not (a is b)").
  • "Disambiguated" means that if two dictionaries disagree on the translation for an English word (dictone:is<->es;dicttwo:is<->esta) then the Spanish shows both (dictone__es__dicttwo__esta). If they disagree on the translation for a Spanish word (dictone:do<->hacer;dicttwo:make<->hacer), then the two english words are shown with disambiguating prefixes (do<->dictone__hacer, make<->dicttwo__hacer).
  • In version 1, the public and private dictionaries for a given module are created manually but with computer help (for instance, moving something from any dictionary to the public one is trivial). All dictionary entries also point to their "source version" (either themselves or some direct or indirect import) to enable propagation of dictionary changes. (Terminology: source entry/copy entry. This is different from coincidental duplicate entries, such as two modules with similarly-named functions. It only happens when an element of the public interface of one module becomes part of the public interface of an importer module - for instance, when the importer makes a subclass, and so many method names are by definition the same.)
    • A later improvement would be to use a pylint-like static analysis to automatically (re)build the public dictionary by adding public global variables (including classes) and the imported classes they use (superclasses, declared types on functions, and, where static analysis reveals it, instance types). Note that this analysis would also enable many intelligent features such as argument tooltips, intelligent [eclipse-like] auto-completion, etc.
  • Each module has a "preferred language". This affects the handling of words not in the mapping (not-yet-translated). If the word is entered while the editor is set in the "preferred language" it is left untranslated on disk. If it is entered while the editor is set in another language, it is escaped with a language prefix (like es___algunNombre) and a tinycode-based system is used to ensure typeability in spite of unicode.

Design

Some choices:

  • This would be a part of Develop, initially coded entirely within that app, but with a view to moving some module into Sugar later if they get approval (not part of this project).
  • All files are in unicode (python already supports this as an option). Nevertheless, untranslated identifiers not in a file's preferred language (noted in a magic comment near the top) would be escaped by a tinycode-like scheme to pure ascii (unfortunately, tinycode includes hyphens which are not alphanumeric in python, so it would not be exactly tinycode.)
  • Adding a non-English translation would not affect files on-disk. However, adding an English translation would. Since this could break code that imports the changed module, the changed file would be saved as a copy, and there would be a separate tool to carefully clean up the mess.
  • I have a scheme of patterns to recognize and translate the most common type of occurrences of identifiers in docstrings and comments. Other than these, docstrings and comments would be translated (or, more commonly, not) as blocks of text, a la gettext. (I will only program this aspect if time allows)
  • The large majority of the work would be done in python, but some changes to gtksourceview in C++ would be necessary to support syntax coloring of translated code (accept the language definitions dynamically regenerated from python).

Modules:

  • Non-UI
    • MappingMgr module parses files for imports, finds dictionary files, asks for them to be parsed, and passes them to mapping module. Handles changes to dictionaries. Validates the consistency of "master entries" and "copy entries". (status: preliminary version exists in IdlyDevelop, would need rewriting, though it was useful as an exercise.)
    • Mapping module handles mappings when given parsed dictionaries. Dictionary objects signal the mapping when they are updated. Mapping module then reparses the file and the GtkSourceView language definition to maintain consistency. (status: exists in idlydevelop, could be used almost unchanged)
    • Dictionary module is an abstraction layer over the dictionary file format. (status: a useful beginning exists in idlyDevelop, I would have to subclass further but could use the code. Schedule, assuming that file format is clarified before coding: 1 day to code, 1 to code tests, and 1 to debug.)
    • When a user wants to create a new translation, NewTranslationGuesser guesses where it belongs. (status:planned. Schedule: 2 days to code and debug a working hack, improve later as needed.)

Total schedule for above: 13 days, call it 3 weeks.

    • LanguageRedefiner would parse and fix gtksourceview language definitions files. Some limitations on the regex magic that these files perform internally may be necessary, and I have no compunctions about doing this. I only need this to work for python initially, and to be extensible one language at a time.

This is a 3-4 day task.

  • UI
    • Add a toolbar for preferences/ toggling languages/ etc.
    • Add tooltips for seeing translations, but leave hooks for further data in tooltips.
    • Add contextual menus for adding translations, again, leave hooks. (this is probably 1 week's work, assuming I have gained gtk familiarity in the previous step)
    • For cut/paste: cut all text as English if both selection boundaries are outside of strings. Otherwise, cut on-screen text and give a nonmodal warning. Similar rule for pasting. Also give a warning when pasting text with an odd number of string boundaries. (2-3 days work)
    • Modifications to gtksourceview: Basically, allow it to accept dynamically-generated language lists. I envision about 6-10 closely-related functions (get/set for the source file, the dynamically modified file, and thesome simple state info) which would mostly be just data type housekeeping. Still, since I am unfamiliar with gtk programming I am budgeting this as the largest single task.

Schedule

Day-by-day schedule for illustrative purposes only.

Before April 14: Finish work on current Develop features, add further features but freeze a week or two before and just work on documentation so it's in good shape to start. Clarify proposed API and file formats. "Hello world" ready (for translation in final week).

April 14 - May 26:: Since I can only commit about 30+ hours per week to GSoC, I will use this time to get a real head start with the project. Starting May 26th, I will give weekly progress reports and have a working check-in weekly at least.

Starting May 26:

Week 1: MappingMgr: 3 days to code basic structure, 1 to code tests, and 2 to debug

Week 2: Dictionary: (assuming file format defined) 1 day to code, 1 to code tests, and 1 to debug. NewTranslationGuesser: 2 days to code and debug a working hack, improve later as needed.

Week 3: Back to MappingMgr to handle new translations: 3 days. 2 days to polish the first 4 modules.

Week 4: toolbar for preferences: 2 days. Cut/paste and switching translations: 3 days.

Week 5: Gtk Tooltips (in IdlyDevelop/tk, implementing this feature took me under a day. Could take up to a week in gtk, speaking very conservatively, simply because I do not have experience with the tooltips there)

Week 6: Contextual menus for adding translations: 4 days coding, 1 day testing (testing extends into midterm week)

July 7: Translation works, including processing of imported files. Handles case with/without write permissions in the directory of imported files. UI to set editor language. UI to set file's preferred language. UI to add new translations, including scanning imported files to suggest where to add the new translations. UI to resolve translation conflicts. GtkSourceview, Static analysis and networking features NOT started (these are extra).

July 7 - July 14: Review and polishing for mid-term evaluation, no new work.

Week 1: work on Gtksourceview. Much slower because working in C, must budget at least 2 days to set up a build system that mirrors an OLPC, and deal with gtksourceview versions, and other network-heavy pain. First just add a simple "currentNaturalLanguage" member variable, to see how hard it is.

End of week one: evaluate whether progress is being made on gtksourceview. If not, abort, and start porting python-based source coloring from idly-develop/tk. (This is inferior because gtksourceview's coloring is better, more flexible, and more widely-used/ portable to existing IDEs.) Either way, I have 3 weeks left to get source coloring working and do any extra features there's time for.

August 11: Gtksourceview work done. API for other apps to display/use translated code done. Documentation decent. Additional features (docstrings/comments, static analysis, remote storage/queries of translations, class browser, etc.) done cleanly or not started, as time allows.

August 11-August 18: Primarily work on Spanish translation of a "hello world" activity, along with any cleanup necessary.