Bityi (translating code editor)/design

From OLPC
Jump to: navigation, search

I've been thinking about the design issues for coding i18n in develop. The problem is that if you want to translate identifiers at all, you immediately have to work with multiple dictionaries for all the modules you're importing. Initially I just slogged in and tried to start coding something that would keep a whole import tree in its head, and magically decide for you where any changes in the dictionary should end up. I took about a day to realize that thence lay only nasty pointed teeth (I should have realized sooner).

So after thinking about what simplifying assumptions I can make, I have a basic design that I'm kinda proud of. I'll try to explain it below, and to keep the "just LOOK at how hairy the problems are if you don't do it my way" details to a minimum. Trust me, they're hairy. Nonetheless, I know y'all need no map from me, if you see another path, or a minor modification, which still avoids the hair and teeth, by all means, suggest it. Also speak up if you see some hair on my path I've missed, of course.

Summary:

1-As discussed already, any identifier or keyword with a translation is presented in user's preferred language on screen, but in English on disk.

1a-Translation is based on the concept of alphanumeric words (start with an alpha, including _, and continue with alphanum). The only parsing prior to translation is to separate code, strings, and comments (the latter two generally untranslated, with certain exceptions based on simple ascii markup). This makes the solution relatively easy to generalize to other computer languages. (It also means that the prefixes "_" and "__" are not necessarily preserved).

2-The identifier-translating dictionary for any given file - say, "somemodule.py" - stays in parallel files in the same directory. English and the file's preferred language for exported identifiers go in the master translation file "somemodule.t9n". Further languages go in "somemodule.languagecode.t9n" (indexed against the master translation file).

2a- The editor would have to understand import statements and have the ability to fine the relevant .t9n's. "from" and "as" modifiers would be ignored, except when combined, because even a single imported item could carry with it all the identifiers.

3-Initially, this dictionary ONLY contains translations for the "public interface" of somemodule.py, that is, those identifiers which are used in importer modules. It also defines a single, unchanging "preferred language" for that file, which is the assumed language for all non-translated identifiers in that file.

3a. It would be possible to convert a file to English/multilanguage and add a private dictionary ("somemodule.private.t9n" and "somemodule.private.languagecode.t9n" for identifiers that are not imported with the file). This would involve adding a original-language-tag to all identifiers in that file which had not yet been translated. The file could then be edited in any language, and new identifiers added during those sessions would be similarly tagged with a prefix indicating the language they were supposedly written in.

4-There would be good UI support for creating a new translation for a word. However, the assumed user model is that words will be translated INTO a users preferred language; FROM the context of an importer module (you'd generally not add translations for a module from that module itself, since generally you wouldn't even have modules open whose preferred language is not your own); and therefore WITH an explicit user decision as to which module this translation belongs in (they want to use their language for identifier X which is in English, well, they must have had a reason to write it in English rather than their language so they presumably know what imported module it comes from.) This user model is not strictly enforced, but it is encouraged, as it is felt this will result in effort being focused on the highest-quality, most-useful translations.

5-As a consequence of points 1 and 4, when you add a translation to a module whose preferred language is not English, that results in a change on-disk of the python code for that file. (Unlike the case for adding a translation for a file whose preferred language is English, which only anywhere results in safe on-screen changes). To enable the EDITOR to intelligently propagate these changes to other importers of the changed module, and the INTERPRETER to dumbly continue to work for these other importers before the editor gets to them, the changed file (and its dictionary) is given a new name (for instance " importedmodule.i18n.v1.py"). The old version is not deleted and keeps the old name.

6. Due to the notable disadvantages of point 5 (polluting the filesystem and, worse, the import/pythonpath namespace with old versions, whereas the best version of a file would always have a name like " importedmodule.i18n.v37.py"), there would be one change to the python core to facilitate cleanup. If someone deleted all the old copies and renamed the aforementioned best version to just importedmodule.py again, the default __import__ function would know how to find it when it couldn't find importedmodule.i18n.v37.py. This new feature would have no impact on any existing python code, and, to be honest, I think that its presence in the "changes in python3001" lists would be (minor but useful) propaganda for the new i18n features.

(obviously, a good delinting tool would take care of all the issues created by 5 at once.)

7. Docstrings and comments, as always, are a separate issue, but I think that they're also a soluble one.

.................

Is all this clear? Do y'all understand why it's necessary? Do you have any other ideas, or see problems with the above that I missed? Do you think I've made any intolerable or unnecessary compromises? Or do you just think that it's absolutely brilliant?

Homunq 17:14, 10 August 2007 (EDT)

Later thoughts

(This has now been included above as point 3a, but I am leaving this further expansion of what I mean here).

After discussion on email and further thought, I have decided to initially implement a two-level design.

Level 1: files with an intrinsic preferred language and no translation of internals (no private translation dict). This would work essentially as outlined above. These files would be editable only in their preferred language, though they would be importable from any language.

Level 2: files with no intrinsic preferred language and an internal/private translation dict. These are editable in any language. Any identifiers added when editing in a non-English language are tagged on-disk with the editor language when they were created, until they can be translated. When editing in non-English, untranslated English identifiers are marked as such instead of just being presented as-is.

Conversion from level 1 to level 2 would entail changing the file on disk. This is conceived as being a step that someone would take not initially upon sharing the file, but only when the module's public interface is relatively well-translated.

(Another addition above: multiple translation files for multiple languages)

Current status:

See Bityi (translating code editor)