Bityi (translating code editor)/design: Difference between revisions

From OLPC
Jump to navigation Jump to search
No edit summary
No edit summary
 
(24 intermediate revisions by 2 users not shown)
Line 1: Line 1:
I've been thinking about the design issues for coding i18n in develop. The problem is that if you want to translate identifiers at all, you immediately have to work with multiple dictionaries for all the modules you're importing. Initially I just slogged in and tried to start coding something that would keep a whole import tree in its head, and magically decide for you where any changes in the dictionary should end up. I took about a day to realize that thence lay only nasty pointed teeth (I should have realized sooner).
I'm going to use this space for my own notes on how to do this. I am currently operating under the following assumptions, though any one of these may change:


So after thinking about what simplifying assumptions I can make, I have a basic design that I'm kinda proud of. I'll try to explain it below, and to keep the "just LOOK at how hairy the problems are if you don't do it my way" details to a minimum. Trust me, they're [[Talk:Source-code editor with transparent native-language display/design|hairy]]. Nonetheless, I know y'all need no map from me, if you see another path, or a minor modification, which still avoids the hair and teeth, by all means, suggest it. Also speak up if you see some hair on my path I've missed, of course.
*This is desirable.


Summary:
*All problems are solvable (Though things like ReST comments will wait til last).


1-As discussed already, any identifier or keyword with a translation is presented in user's preferred language on screen, but in English on disk.
*It should be done in OLPC / Sugar first, then ported. This means doing it in python. It also means that good design is important, the code should avoid touching Sugar unless it has to.


:1a-Translation is based on the concept of alphanumeric words (start with an alpha, including _, and continue with alphanum). The only parsing prior to translation is to separate code, strings, and comments (the latter two generally untranslated, with certain exceptions based on simple ascii markup). This makes the solution relatively easy to generalize to other computer languages. (It also means that the prefixes "_" and "__" are not necessarily preserved).
*For shorthand, I'm assuming Spanish and Python in my examples.


2-The identifier-translating dictionary for any given file - say, "somemodule.py" - stays in parallel files in the same directory. English and the file's preferred language for exported identifiers go in the master translation file "somemodule.t9n". Further languages go in "somemodule.languagecode.t9n" (indexed against the master translation file).
OK. So. Looking for projects that already have python-based lexers for many languages, I come across pygment. Most of its lexers are one-pass - crucial for a real-time system. I think that this can be used.


:2a- The editor would have to understand import statements and have the ability to fine the relevant .t9n's. "from" and "as" modifiers would be ignored, except when combined, because even a single imported item could carry with it all the identifiers.
One funny case I think of: say you add a quote that suddenly makes the rest of the file into a string literal. Two behaviours are possible: preserving the localized string (user sees nothing but a color change, but entire source code from there on out is invisibly rewritten in Spanish) or preserving the underlying python (user sees that suddenly everything turns string-colored and all the keywords change to English). The latter behaviour is far easier to accomplish and seems to me "better" (among other things, it makes an easily "discoverable" quick-and-dirty way to see the English source) but there would be problems if someone "forgot to put the quotes in" until too late (random bits of their UI strings could get quasi-translated into English).


3-Initially, this dictionary ONLY contains translations for the "public interface" of somemodule.py, that is, those identifiers which are used in importer modules. It also defines a single, unchanging "preferred language" for that file, which is the assumed language for all non-translated identifiers in that file.
Similar issues for anything that, by changing context, changes meaning of code that follows. Significantly, commenting out an existing section or line fragment - but, depending on the language involved, there could be other examples. Obviously, the answer for comments is different - you want the text to remain unchanged onscreen, though you may want to continue to run the translation as if it were real code in some cases.


3a. It would be possible to convert a file to English/multilanguage and add a private dictionary ("somemodule.private.t9n" and "somemodule.private.languagecode.t9n" for identifiers that are not imported with the file). This would involve adding a original-language-tag to all identifiers in that file which had not yet been translated. The file could then be edited in any language, and new identifiers added during those sessions would be similarly tagged with a prefix indicating the language they were supposedly written in.
If you turned off translation and just used the Spanish text for comments, that would give a workaround for the "I don't want my string to change" problem. Another workaround would be cut/paste.


4-There would be good UI support for creating a new translation for a word. However, the assumed user model is that words will be translated INTO a users preferred language; FROM the context of an importer module (you'd generally not add translations for a module from that module itself, since generally you wouldn't even have modules open whose preferred language is not your own); and therefore WITH an explicit user decision as to which module this translation belongs in (they want to use their language for identifier X which is in English, well, they must have had a reason to write it in English rather than their language so they presumably know what imported module it comes from.) This user model is not strictly enforced, but it is encouraged, as it is felt this will result in effort being focused on the highest-quality, most-useful translations.
Another issue this raises: for a lexer with state and without a uniform definition of token boundaries, reverse translation is hard in the general case. In practice, this problem is much easier - you have all the lexer state info from the actual Python, and most languages (Brainf*ck aside) have a single definition of tokens, so you just backtranslate one token at a time, using state if necessary.


5-As a consequence of points 1 and 4, when you add a translation to a module whose preferred language is not English, that results in a change on-disk of the python code for that file. (Unlike the case for adding a translation for a file whose preferred language is English, which only anywhere results in safe on-screen changes). To enable the EDITOR to intelligently propagate these changes to other importers of the changed module, and the INTERPRETER to dumbly continue to work for these other importers before the editor gets to them, the changed file (and its dictionary) is given a new name (for instance " importedmodule.i18n.v1.py"). The old version is not deleted and keeps the old name.
But you need a generalized backtranslation algorithm for typing, pasting, and possibly for uncommenting. I think you could just assume that the tokens are decent and that all hard keywords show up literally in the regex's, and do a search-and-replace on the lexers. Debug the ones that barf offline, people won't be making lexers on the fly.


6. Due to the notable disadvantages of point 5 (polluting the filesystem and, worse, the import/pythonpath namespace with old versions, whereas the best version of a file would always have a name like " importedmodule.i18n.v37.py"), there would be one change to the python core to facilitate cleanup. If someone deleted all the old copies and renamed the aforementioned best version to just importedmodule.py again, the default __import__ function would know how to find it when it couldn't find importedmodule.i18n.v37.py. This new feature would have no impact on any existing python code, and, to be honest, I think that its presence in the "changes in python3001" lists would be (minor but useful) propaganda for the new i18n features.
== harder than I thought ==


(obviously, a good delinting tool would take care of all the issues created by 5 at once.)
simple case: lexer is grabbing one token at a time. Definition of token is stateless. Meaning of token may be stateful.
solution: a backtranslate dictionary for each state. for each token, backtranslate than lex, that gives you state for next token.
advantages: no need to touch the lexer
dis: dictionaries need to be defined for states which were intended as lexer internals.


7. Docstrings and comments, as always, are a separate issue, but I think that they're also a soluble one.
relax that: actually, definition of token needn't be stateless.
dis: you must hand-define tokens for each state, losing much of the advantage of having predefined lexers. That, or make a program that is smarter about regex's than I care to think.


.................
grrr... OK, what happens if you just throw syntax to the wind, and translate statelessly all alphanum "words"?


Is all this clear? Do y'all understand why it's necessary? Do you have any other ideas, or see problems with the above that I missed? Do you think I've made any intolerable or unnecessary compromises? Or do you just think that it's absolutely brilliant?
Possible problems:
-a language with two uses of the same keyword with different meanings - not likely outside brainf*ck.
-a program with ditto - actually a minor issue, and this would encourage fixing it.
-you still need at least 3 concepts - program text, string literals, and comments. Ideally there would be some way to distinguish commented-out code from other comments - ReST?


[[User:Homunq|Homunq]] 17:14, 10 August 2007 (EDT)
All you really need is a way to tell when to start / stop translation. The editor needs to be aware of whether the cursor is in a translation or a no-translation zone in order to translate-as-typing happens.


== Later thoughts ==
keystroke handler:
(This has now been included above as point 3a, but I am leaving this further expansion of what I mean here).
formerCursorPos = cursorPos
retranslate token behind cursor, if any
cursorPos = curLexerPos = GetCursorPos()
while (lexerState[curLexerPos] != formerLexerState[adjustForNewChars(curLexerPos)]):
lexOneStep()
redisplayUpTo(curLexerPos)


After discussion on email and further thought, I have decided to initially implement a two-level design.


Level 1: files with an intrinsic preferred language and no translation of internals (no private translation dict). This would work essentially as outlined above. These files would be editable only in their preferred language, though they would be importable from any language.
behaviour:


Level 2: files with no intrinsic preferred language and an internal/private translation dict. These are editable in any language. Any identifiers added when editing in a non-English language are tagged on-disk with the editor language when they were created, until they can be translated. When editing in non-English, untranslated English identifiers are marked as such instead of just being presented as-is.
quotes cause visible backtranslation. Workaround, cut and paste, or menu option "put selection in quotes".


Conversion from level 1 to level 2 would entail changing the file on disk. This is conceived as being a step that someone would take not initially upon sharing the file, but only when the module's public interface is relatively well-translated.
:Possible solution: Whenever the total number of quotes in a document is odd (ie. a string has been started but not ended), then all translation will be paused until the issue is resolved (until another quote is inserted). Paused means nothing new will be translated, and nothing old will be untranslated. There are still ways to "confuse" this, (eg. 2 missing quotes in a widely seperated bit of the program) but they'll be much less common. [[User:Hello1024|Hello1024]] 18:53, 30 July 2007 (EDT)


(Another addition above: multiple translation files for multiple languages)
Comments in Python source code read as dumbly backtranslated. No wait - if you use some ReST convention to distinguish source from non-source in comments, the experience is, type "//" and everything flickers to English, then type "::" and it flickers back to what it was. With, as above, two menu options for "comment out (source code)" and "comment out (Spanish)".

== One difficult case (from email) ==

:Suppose two modules m1 and m2 define classe c1 and c2 respectively, both with a method X in their public interface. m1 is translated, and states that X is iks in the current target language. m2 is untranslated.

:We are trying to display a file that goes like this:

import m1, m2

def maFonction_i18n_fr(aParam):
aParam.X()
EOF

:We do not, cannot know whether aParam refers to an m1.c1 or m2.c2 instance; possibly maFonction could receive both instances. so we cannot know whether X is translated or not. I am not sure how to handle it; how do you?


Good job stating the case. Just one detail that I think you misstated: m1 states X is english, not the current target language, otherwise X would not be showing up in the file on disk (as you state the problem, at worst it would be ***m1lang___X*** which is NOT in the example I want).

Let's call the third module which is importing m3.

As stated, our editor knows that X exists in m1, because it has an entry in the public dict. In my vision, the editor does not go willy-nilly scanning actual .py files, it just imports public dicts, so it may or may not know that m2 exports an X depending whether that's in m2.t7n .

If it knows, things are fine. As soon as you put an identifier in the public dict of m2, that identifier is tagged on disk for safety. So we have m2lang___X, and no collision, though this can show up either as (X, m2lang___X), (en___X, X), or (en___X, m2lang___X) depending on m3lang. (and if m1 has a good translation that would go in the first element of those lists, and if m1 gives the null translation of X=X you could even end up seeing the disambiguation (m1___X, m2___X). If you type X in that last case, it shows up bright red as an error and the file refuses to save until you fix it.)

If it does NOT know, then we have a problem, and our goal should be to let an aware user notice it and fix it as soon as possible. Remember, for now this file works as intended, but if someone comes along and gives m2 a new English translation for X, everything will break.

If m3lang == m2lang, the original m3 programmer should have seen either en___X or m1translation. If they type X, this will automatically be changed to m3lang___X on disk, so their code never ran, and so they catch the error and right-click on X and say 'add translation to file m2' and leave the English blank because they don't speak English. And everything works fine then because both files use m3lang___X on disk.

If m3lang == en, then we have a tougher problem waking the programmer up. They would have to be moderately aware to notice that when they looked into m2 to find that the method was called X, everything was colored non-English and was in some funny language. So they should by instinct want to write m2lang___X if they mean the m2 one. And then when that doesn't work, they can easily fix the problem as above.

Say they don't wake up. Then somebody else comes along and adds a translation to m2, saying that X is exported (either giving an English translation Y or leaving it as m2lang___X; let's call those both Y). The next time our English-speaking m3 programmer opens their file, our logic tries to add the new translation, notices the conflict, and complains. It turns all the Xes into WARNING___X or something, then tells the programmer to search through for WARNING___X and change it either into X or Y, depending on whether its from m1 or m2. (If they are the one to add the new translation, all the better; if m3 is open when the translation is added, they get the warning right away.)

That last paragraph is the only explicit coding we have to do for this case. The rest of it all happens naturally, as a consequence of how things work anyway.

==Translating inside docstrings and comments and reST==
Most words inside docstrings and comments should be considered to be in natural language, not identifiers. However, identifiers must be caught and translated. reST provides a way to mark something as an identifier; however, there are many docstrings that are not religiously rest-ed. The most common cases are all catchable algorithmically:

#argument : `lists, or`
#method -- `lists`
#funnyCapitalization `or` underscore_words
#function(calls,like=this)
#index[lookups:or:slices]
#[lists,like,this] `or` {dictionaries:like,this}

All the above should get caught as translatable. I use backquotes above to show what would not otherwise be translatable. Note that some words (such as McQueen) have funnyCapitalization but are not identifiers. This may be more common in some languages. However, a slightly-overzealous translator is seen as far better than either a completely-overzealous or a deficient one.

I'm currently writing this up as a PEP.
==Current status:==
See [[Bityi (translating code editor)]]

Latest revision as of 22:21, 23 August 2007

I've been thinking about the design issues for coding i18n in develop. The problem is that if you want to translate identifiers at all, you immediately have to work with multiple dictionaries for all the modules you're importing. Initially I just slogged in and tried to start coding something that would keep a whole import tree in its head, and magically decide for you where any changes in the dictionary should end up. I took about a day to realize that thence lay only nasty pointed teeth (I should have realized sooner).

So after thinking about what simplifying assumptions I can make, I have a basic design that I'm kinda proud of. I'll try to explain it below, and to keep the "just LOOK at how hairy the problems are if you don't do it my way" details to a minimum. Trust me, they're hairy. Nonetheless, I know y'all need no map from me, if you see another path, or a minor modification, which still avoids the hair and teeth, by all means, suggest it. Also speak up if you see some hair on my path I've missed, of course.

Summary:

1-As discussed already, any identifier or keyword with a translation is presented in user's preferred language on screen, but in English on disk.

1a-Translation is based on the concept of alphanumeric words (start with an alpha, including _, and continue with alphanum). The only parsing prior to translation is to separate code, strings, and comments (the latter two generally untranslated, with certain exceptions based on simple ascii markup). This makes the solution relatively easy to generalize to other computer languages. (It also means that the prefixes "_" and "__" are not necessarily preserved).

2-The identifier-translating dictionary for any given file - say, "somemodule.py" - stays in parallel files in the same directory. English and the file's preferred language for exported identifiers go in the master translation file "somemodule.t9n". Further languages go in "somemodule.languagecode.t9n" (indexed against the master translation file).

2a- The editor would have to understand import statements and have the ability to fine the relevant .t9n's. "from" and "as" modifiers would be ignored, except when combined, because even a single imported item could carry with it all the identifiers.

3-Initially, this dictionary ONLY contains translations for the "public interface" of somemodule.py, that is, those identifiers which are used in importer modules. It also defines a single, unchanging "preferred language" for that file, which is the assumed language for all non-translated identifiers in that file.

3a. It would be possible to convert a file to English/multilanguage and add a private dictionary ("somemodule.private.t9n" and "somemodule.private.languagecode.t9n" for identifiers that are not imported with the file). This would involve adding a original-language-tag to all identifiers in that file which had not yet been translated. The file could then be edited in any language, and new identifiers added during those sessions would be similarly tagged with a prefix indicating the language they were supposedly written in.

4-There would be good UI support for creating a new translation for a word. However, the assumed user model is that words will be translated INTO a users preferred language; FROM the context of an importer module (you'd generally not add translations for a module from that module itself, since generally you wouldn't even have modules open whose preferred language is not your own); and therefore WITH an explicit user decision as to which module this translation belongs in (they want to use their language for identifier X which is in English, well, they must have had a reason to write it in English rather than their language so they presumably know what imported module it comes from.) This user model is not strictly enforced, but it is encouraged, as it is felt this will result in effort being focused on the highest-quality, most-useful translations.

5-As a consequence of points 1 and 4, when you add a translation to a module whose preferred language is not English, that results in a change on-disk of the python code for that file. (Unlike the case for adding a translation for a file whose preferred language is English, which only anywhere results in safe on-screen changes). To enable the EDITOR to intelligently propagate these changes to other importers of the changed module, and the INTERPRETER to dumbly continue to work for these other importers before the editor gets to them, the changed file (and its dictionary) is given a new name (for instance " importedmodule.i18n.v1.py"). The old version is not deleted and keeps the old name.

6. Due to the notable disadvantages of point 5 (polluting the filesystem and, worse, the import/pythonpath namespace with old versions, whereas the best version of a file would always have a name like " importedmodule.i18n.v37.py"), there would be one change to the python core to facilitate cleanup. If someone deleted all the old copies and renamed the aforementioned best version to just importedmodule.py again, the default __import__ function would know how to find it when it couldn't find importedmodule.i18n.v37.py. This new feature would have no impact on any existing python code, and, to be honest, I think that its presence in the "changes in python3001" lists would be (minor but useful) propaganda for the new i18n features.

(obviously, a good delinting tool would take care of all the issues created by 5 at once.)

7. Docstrings and comments, as always, are a separate issue, but I think that they're also a soluble one.

.................

Is all this clear? Do y'all understand why it's necessary? Do you have any other ideas, or see problems with the above that I missed? Do you think I've made any intolerable or unnecessary compromises? Or do you just think that it's absolutely brilliant?

Homunq 17:14, 10 August 2007 (EDT)

Later thoughts

(This has now been included above as point 3a, but I am leaving this further expansion of what I mean here).

After discussion on email and further thought, I have decided to initially implement a two-level design.

Level 1: files with an intrinsic preferred language and no translation of internals (no private translation dict). This would work essentially as outlined above. These files would be editable only in their preferred language, though they would be importable from any language.

Level 2: files with no intrinsic preferred language and an internal/private translation dict. These are editable in any language. Any identifiers added when editing in a non-English language are tagged on-disk with the editor language when they were created, until they can be translated. When editing in non-English, untranslated English identifiers are marked as such instead of just being presented as-is.

Conversion from level 1 to level 2 would entail changing the file on disk. This is conceived as being a step that someone would take not initially upon sharing the file, but only when the module's public interface is relatively well-translated.

(Another addition above: multiple translation files for multiple languages)

One difficult case (from email)

Suppose two modules m1 and m2 define classe c1 and c2 respectively, both with a method X in their public interface. m1 is translated, and states that X is iks in the current target language. m2 is untranslated.
We are trying to display a file that goes like this:
   import m1, m2
   def maFonction_i18n_fr(aParam):
       aParam.X()
   EOF
We do not, cannot know whether aParam refers to an m1.c1 or m2.c2 instance; possibly maFonction could receive both instances. so we cannot know whether X is translated or not. I am not sure how to handle it; how do you?


Good job stating the case. Just one detail that I think you misstated: m1 states X is english, not the current target language, otherwise X would not be showing up in the file on disk (as you state the problem, at worst it would be ***m1lang___X*** which is NOT in the example I want).

Let's call the third module which is importing m3.

As stated, our editor knows that X exists in m1, because it has an entry in the public dict. In my vision, the editor does not go willy-nilly scanning actual .py files, it just imports public dicts, so it may or may not know that m2 exports an X depending whether that's in m2.t7n .

If it knows, things are fine. As soon as you put an identifier in the public dict of m2, that identifier is tagged on disk for safety. So we have m2lang___X, and no collision, though this can show up either as (X, m2lang___X), (en___X, X), or (en___X, m2lang___X) depending on m3lang. (and if m1 has a good translation that would go in the first element of those lists, and if m1 gives the null translation of X=X you could even end up seeing the disambiguation (m1___X, m2___X). If you type X in that last case, it shows up bright red as an error and the file refuses to save until you fix it.)

If it does NOT know, then we have a problem, and our goal should be to let an aware user notice it and fix it as soon as possible. Remember, for now this file works as intended, but if someone comes along and gives m2 a new English translation for X, everything will break.

If m3lang == m2lang, the original m3 programmer should have seen either en___X or m1translation. If they type X, this will automatically be changed to m3lang___X on disk, so their code never ran, and so they catch the error and right-click on X and say 'add translation to file m2' and leave the English blank because they don't speak English. And everything works fine then because both files use m3lang___X on disk.

If m3lang == en, then we have a tougher problem waking the programmer up. They would have to be moderately aware to notice that when they looked into m2 to find that the method was called X, everything was colored non-English and was in some funny language. So they should by instinct want to write m2lang___X if they mean the m2 one. And then when that doesn't work, they can easily fix the problem as above.

Say they don't wake up. Then somebody else comes along and adds a translation to m2, saying that X is exported (either giving an English translation Y or leaving it as m2lang___X; let's call those both Y). The next time our English-speaking m3 programmer opens their file, our logic tries to add the new translation, notices the conflict, and complains. It turns all the Xes into WARNING___X or something, then tells the programmer to search through for WARNING___X and change it either into X or Y, depending on whether its from m1 or m2. (If they are the one to add the new translation, all the better; if m3 is open when the translation is added, they get the warning right away.)

That last paragraph is the only explicit coding we have to do for this case. The rest of it all happens naturally, as a consequence of how things work anyway.

Translating inside docstrings and comments and reST

Most words inside docstrings and comments should be considered to be in natural language, not identifiers. However, identifiers must be caught and translated. reST provides a way to mark something as an identifier; however, there are many docstrings that are not religiously rest-ed. The most common cases are all catchable algorithmically:

  1. argument : `lists, or`
  2. method -- `lists`
  3. funnyCapitalization `or` underscore_words
  4. function(calls,like=this)
  5. index[lookups:or:slices]
  6. [lists,like,this] `or` {dictionaries:like,this}

All the above should get caught as translatable. I use backquotes above to show what would not otherwise be translatable. Note that some words (such as McQueen) have funnyCapitalization but are not identifiers. This may be more common in some languages. However, a slightly-overzealous translator is seen as far better than either a completely-overzealous or a deficient one.

I'm currently writing this up as a PEP.

Current status:

See Bityi (translating code editor)