Bityi (translating code editor)/design: Difference between revisions

From OLPC
Jump to navigation Jump to search
(pasting from email)
Line 1: Line 1:
I've been thinking about the design issues for coding i18n in develop. The problem is that if you want to translate identifiers at all, you immediately have to work with multiple dictionaries for all the modules you're importing. Initially I just slogged in and tried to start coding something that would keep a whole import tree in its head, and magically decide for you where any changes in the dictionary should end up. I took about a day to realize that thence lay only nasty pointed teeth (I should have realized sooner).
I'm going to use this space for my own notes on how to do this. I am currently operating under the following assumptions, though any one of these may change:


So after thinking about what simplifying assumptions I can make, I have a basic design that I'm kinda proud of. I'll try to explain it below, and to keep the "just LOOK at how hairy the problems are if you don't do it my way" details to a minimum. Trust me, they're hairy. Nonetheless, I know y'all need no map from me, if you see another path, or a minor modification, which still avoids the hair and teeth, by all means, suggest it. Also speak up if you see some hair on my path I've missed, of course.
*This is desirable.


Summary:
*All problems are solvable (Though things like ReST comments will wait til last).
1-As discussed already, any identifier or keyword with a translation is presented in user's preferred language on screen, but in English on disk.


2-The identifier-translating dictionary for any given file - say, " somemodule.py" - stays in a parallel file in the same directory, say ".someModule.p4n".
*It should be done in OLPC / Sugar first, then ported. This means doing it in python. It also means that good design is important, the code should avoid touching Sugar unless it has to.


2a- The editor would have to understand import statements and have the ability to fine the relevant .p4n's. "from" and "as" modifiers would be ignored, except when combined, because even a single imported item could carry with it all the identifiers.
*For shorthand, I'm assuming Spanish and Python in my examples.


3-This dictionary ONLY contains translations for the "public interface" of somemodule.py, that is, those identifiers which are used in importer modules. It also defines a single, unchanging "preferred language" for that file, which is the assumed language for all non-translated identifiers in that file.
OK. So. Looking for projects that already have python-based lexers for many languages, I come across pygment. Most of its lexers are one-pass - crucial for a real-time system. I think that this can be used.


4-There is good UI support for creating a new translation for a word. However, the assumed user model is that words will be translated INTO a users preferred language; FROM the context of an importer module (you'd generally not add translations for a module from that module itself, since generally you wouldn't even have modules open whose preferred language is not your own); and therefore WITH an explicit user decision as to which module this translation belongs in (they want to use their language for identifier X which is in English, well, they must have had a reason to write it in English rather than their language so they presumably know what imported module it comes from.)
One funny case I think of: say you add a quote that suddenly makes the rest of the file into a string literal. Two behaviours are possible: preserving the localized string (user sees nothing but a color change, but entire source code from there on out is invisibly rewritten in Spanish) or preserving the underlying python (user sees that suddenly everything turns string-colored and all the keywords change to English). The latter behaviour is far easier to accomplish and seems to me "better" (among other things, it makes an easily "discoverable" quick-and-dirty way to see the English source) but there would be problems if someone "forgot to put the quotes in" until too late (random bits of their UI strings could get quasi-translated into English).


5-As a consequence of points 1 and 4, when you add a translation to a module whose preferred language is not English, that results in a change on-disk of the python code for that file. (Unlike the case for adding a translation for a file whose preferred language is English, which only anywhere results in safe on-screen changes). To enable the EDITOR to intelligently propagate these changes to other importers of the changed module, and the INTERPRETER to dumbly continue to work for these other importers before the editor gets to them, the changed file (and its dictionary) is given a new name (for instance " importedmodule.i18n.v1.py"). The old version is not deleted and keeps the old name.
Similar issues for anything that, by changing context, changes meaning of code that follows. Significantly, commenting out an existing section or line fragment - but, depending on the language involved, there could be other examples. Obviously, the answer for comments is different - you want the text to remain unchanged onscreen, though you may want to continue to run the translation as if it were real code in some cases.


6. Due to the notable disadvantages of point 5 (polluting the filesystem and, worse, the import/pythonpath namespace with old versions, whereas the best version of a file would always have a name like " importedmodule.i18n.v37.py"), there would be one change to the python core to facilitate cleanup. If someone deleted all the old copies and renamed the aforementioned best version to just importedmodule.py again, the default __import__ function would know how to find it when it couldn't find importedmodule.i18n.v37.py. This new feature would have no impact on any existing python code, and, to be honest, I think that its presence in the "changes in python3001" lists would be (minor but useful) propaganda for the new i18n features.
If you turned off translation and just used the Spanish text for comments, that would give a workaround for the "I don't want my string to change" problem. Another workaround would be cut/paste.


(obviously, a good delinting tool would take care of all the issues created by 5 at once.)
Another issue this raises: for a lexer with state and without a uniform definition of token boundaries, reverse translation is hard in the general case. In practice, this problem is much easier - you have all the lexer state info from the actual Python, and most languages (Brainf*ck aside) have a single definition of tokens, so you just backtranslate one token at a time, using state if necessary.


7. Docstrings and comments, as always, are a separate issue, but I think that they're also a soluble one.
But you need a generalized backtranslation algorithm for typing, pasting, and possibly for uncommenting. I think you could just assume that the tokens are decent and that all hard keywords show up literally in the regex's, and do a search-and-replace on the lexers. Debug the ones that barf offline, people won't be making lexers on the fly.


.................
== harder than I thought ==


Is all this clear? Do y'all understand why it's necessary? Do you have any other ideas, or see problems with the above that I missed? Do you think I've made any intolerable or unnecessary compromises? Or do you just think that it's absolutely brilliant?
simple case: lexer is grabbing one token at a time. Definition of token is stateless. Meaning of token may be stateful.
solution: a backtranslate dictionary for each state. for each token, backtranslate than lex, that gives you state for next token.
advantages: no need to touch the lexer
dis: dictionaries need to be defined for states which were intended as lexer internals.


[[User:Homunq|Homunq]] 17:14, 10 August 2007 (EDT)
relax that: actually, definition of token needn't be stateless.
dis: you must hand-define tokens for each state, losing much of the advantage of having predefined lexers. That, or make a program that is smarter about regex's than I care to think.

grrr... OK, what happens if you just throw syntax to the wind, and translate statelessly all alphanum "words"?

Possible problems:
-a language with two uses of the same keyword with different meanings - not likely outside brainf*ck.
-a program with ditto - actually a minor issue, and this would encourage fixing it.
-you still need at least 3 concepts - program text, string literals, and comments. Ideally there would be some way to distinguish commented-out code from other comments - ReST?

All you really need is a way to tell when to start / stop translation. The editor needs to be aware of whether the cursor is in a translation or a no-translation zone in order to translate-as-typing happens.

keystroke handler:
formerCursorPos = cursorPos
retranslate token behind cursor, if any
cursorPos = curLexerPos = GetCursorPos()
while (lexerState[curLexerPos] != formerLexerState[adjustForNewChars(curLexerPos)]):
lexOneStep()
redisplayUpTo(curLexerPos)


behaviour:

quotes cause visible backtranslation. Workaround, cut and paste, or menu option "put selection in quotes".

:Possible solution: Whenever the total number of quotes in a document is odd (ie. a string has been started but not ended), then all translation will be paused until the issue is resolved (until another quote is inserted). Paused means nothing new will be translated, and nothing old will be untranslated. There are still ways to "confuse" this, (eg. 2 missing quotes in a widely seperated bit of the program) but they'll be much less common. [[User:Hello1024|Hello1024]] 18:53, 30 July 2007 (EDT)

Comments in Python source code read as dumbly backtranslated. No wait - if you use some ReST convention to distinguish source from non-source in comments, the experience is, type "//" and everything flickers to English, then type "::" and it flickers back to what it was. With, as above, two menu options for "comment out (source code)" and "comment out (Spanish)".

:Suggestion: You seem to be considering translation as an "on the fly" thing on every keystroke. This seems very cpu intensive, and therefore power consuming. Since writing a program is basicly a text editor, which is one of the simplist programs around, it shouldn't use much power. If on every keystroke the entire program is going to be parsed and every keyword looked up in a dictionary, it's not going to be power efficient. How about using translation on file open and save only? (save includes "run") That solves a lot of the issues you're considering above, and would make implementation a lot easier. Translation could even be implemented as a command line tool for testing - eg. "pyconv input.py lang=ES >output.py" [[User:Hello1024|Hello1024]] 19:00, 30 July 2007 (EDT)

You're absolutely right, that is the obvious general solution. Include menu items for commenting/uncommenting/quoting/unquoting with translation of sections of the open file, but never translate the open file on keystrokes, just for save/run.

==Another tar-baby==

The problem: included files have their own preferred languages and multilanguage translation dicts. You want to be intelligent about using those dicts in the open file. This is actually quite hard. I knew that, but I actually didn't realize quite how hard until I was hip-deep in code. (By the way: before I started coding this mess, I actually got a simple version working which does english-spanish keywords in IDLE. Didn't fix the syntax coloring, or allow turning it on/off, or do any tooltips, but each of those should be pretty easy. Hooray!!)

Before I start explaining all the subproblems, I'm wondering if there's a way to avoid the issue, or at least radically simplify it, by changing the use model, as above?

#No multiple levels of inheritance. Each file exports a dict only for the identifiers it creates, not for the identifiers it imports. Disadvantages: loss of abstract generality of the solution but no concrete problems I can think of. OK: done.
#No "public" and "private" keywords, a module exports the entire translation dict that it has (except as above). Advantages: no need for funky AI to figure out what identifiers are used by importers, simplifies dictionary creation because it encourages translators to ONLY translate public classes/methods/input variables. Disadvantages: overzealous exportation could lead to more "clashes" (though this is also good, see below.) A module is externally usable in English or 3rd languages, but it is hard to read or edit as its internals are still in the original language. (this is also good, consider that a "preferred *English" module should not be edited by a *Spanish editor who would add new internal variables with *Spanish names without creating translations for them) So: no module could have mutually-unintelligible coder communities internally, although externally (ie, "import module) it could still work. Y'know, that's really true anyway, how can coders collaborate on the same piece of code if they can't even talk to each other? At that point, the variable names might as well all be just "x" and "y" anyway. OK: done.
#Do all dictionary changes in the current file's dictionary. Advantages: much easier to code. Disadvantages: let's see...

:*underlying file has complete translation for public stuff from English to Spanish: no problem, this is correct behavior

:*underlying file has incomplete translation and is in English:

:**translation is not an error: gets the right behavior for current file, but other importers of underlying file must duplicate the work (two steps of work: (1) recognizing that the keyword being translated is from underlying file, if it (A) is not or (B) is already marked as such in underlying file's dictionary, possibly because an Arabic translation exists; and (2) putting the translation into underlying file. 2 and 1B are relatively simple programatically, 1A could be manually assisted, so this is not so so hard to do "right". Since 1A is really the whole point of dynamically adding translations, it is not beyond the pale to ask for manual help.)

:**translation is an error: that is, you're associating a Spanish word with an English word which, unbeknownst to you, is already in the imported file. The simplest cases of this could be handled by careful use of namespaces (a warning for "from EnglishFile import *"), but methods of class instances could be in error. I don't actually see how to catch this error even if you wanted to without MASSIVE effort, and even then, at runtime. So, just tell people how to be careful when translating method names.

:*underlying file has incomplete translation and is in Spanish:

:**Translation belongs in underlying file: leads to ERROR ERROR ERROR.

:**Translation does not belong in underlying file: hides an ERROR ERROR ERROR (programmer improperly reusing an identifier) that should be flagged.


==EUREKA!!!!!!!!!!!!!!==

forget the entire point 3. The simplifying assumption is exactly the reverse:

# NO dynamic changes to the dictionary are allowed for the current file, ALL dynamic changes are applied to manually-chosen imported files, these files MUST prefer the non-UI-preferred language. This is much closer to reality in terms of the user model -- why would the same author write in Spanish if they could also provide their own English translation? (If they really want to, let them make a dummy English file which imports the Spanish and do the translation from there.)

**i If UI-preferred language is Spanish, case 1B above pushes the underlying file in question to the top of the line. All is beautiful and pretty easy.

***ia Other Spanish files which import that English module would magically change next time they're opened. You don't even need any special logic, except rechecking the dictionary. Even clash resolution is automatic.

**ii If the UI-preferred language is English, this involves going in and changing the .py files in question. That could break them for:

***iiaSpanish files which import them - the interpreter could actually catch and fix this error. If not, the editor could be smart enough to automatically fix them next time you touched them.

***iib English files which import them - if an English file imports a Spanish file with incomplete translation, it could use an "untranslated Spanish" directive. This would be a rare case anyway. This would get the above behaviour (allow editor or interpreter to fix the problem).

***iic Spanish files they import - you could actually have a manual option to kick the translation back the chain a link, it would be rarely used, if you forgot things would break. Oh well.

***iid English files they import - you should not have done this. Things break. You fix them. (But: it wasn't even yours! You break everyone! Any ideas?)

.......

1... With good coding style, synonyms (even across files) are synonyms, so "clashes" are good. Disadvantages: poor coding style could lead to extra "clashes" on ambiguous abbreviations. This would lead to confusing code on one side (this is a problem anyway, but

==OK, things are simple enough now. What are the edge cases?==
my brain hurts. my daughter is undiagnosably just-a-little-bit-sick (last week, stomach; yesterday, ear; today, just snot and fever and "I'm fine") and my nephew has some horrible blood disease, so I'm really tired. If you can even understand the above, can you see any funny cases we have to worry about?

1. multiple copies of a module and horrible runtime which-one-is-it cases...

....

==To cache or not two cache==

So, since translation dictionaries are to be focused on *imported* files, the question becomes whether or not to cache them at the level of the import-er.

Con: not to is simpler, and preserves the DRY/SPOT principle (in other words, prevents weird "where is THAT coming from, and how do I stop it" errors).

Pro: prevents funny behaviour where suddenly I can't read "my own" file. Also, may be necessary for the above "autofix" stuff, let's take it case-by-case.

ia caching is exactly what you DON'T need.
iia no need to cache what's old, as long as you can tell what's new... datestamps? flimsy... the minimum right answer is to cache what's translated, just source, no dest...
iib same
iic & iid: not an issue. (though the right answer .... whoops, another EUREKA.

don't cache at the importer, cache at the import-ee. Only for case ii, that is, when you're actually changing a .py. Give each version of the .py a UUID. Now the question is, do you Replace-And-Backup (somemodule.py is new version, somemodule.lesstranslated.py is old version) or do you Make A New Version (somemodule.py is old, somemodule.moretranslated.py is new)?

RAB: worse for case iic & iid. and iia & iib. OK so this is WORSE.

MANV: Now you don't even need any changes to the interpreter! The only problem is, how can you tell when to finally get rid of the old version? Your magic editor will redirect clients to the new version every time it touches them (for cases iia and iib), and presumably you'll (with nice tools... TODO) notice the errors created (by cases iic and iid), but how do you know when you have touched all clients? Watching the last file access on the old version, give up after 6 months? Manually (but if you rename the file, it just breaks all your magic updates on the clients, doesn't it? Well, at least you can fix that with your magic, as soon as you touch the broken files... OR, ha ha ha ha, you make just ONE change to the interpreter, that "import somemodule.more.more.moretranslated.py" will default to "import somemodule" if it doesn't find .more.more.moretranslated. HAH, a no-brainer add in to the interpreter, NO performance hit whatsoever, and as an added bonus it does its own propaganda!

Let's step back. I now have a smart, simple system for doing on-the-fly wiki-style translation. The user model is pretty managable, the bugs introduced are kept to a minimum (to 0 with simple precautions?), only one very simple and peripheral change needed to the python core (in the exception handling of the default __import__ function)... On the down side, in one relatively-rare case (english-based clients of spanish-based modules) it has a tendency to make (multiple) old versions poop up the file system and, more seriously, the import xxx namespace of ALL modules that import the spanish-based ones. In other words, eventually it will call for some global PYTHONPATH depooping tool.

OK, I guess I can (ask others to) live with that... Now, to write it up.

[[User:Homunq|Homunq]] 13:24, 9 August 2007 (EDT)

Revision as of 21:14, 10 August 2007

I've been thinking about the design issues for coding i18n in develop. The problem is that if you want to translate identifiers at all, you immediately have to work with multiple dictionaries for all the modules you're importing. Initially I just slogged in and tried to start coding something that would keep a whole import tree in its head, and magically decide for you where any changes in the dictionary should end up. I took about a day to realize that thence lay only nasty pointed teeth (I should have realized sooner).

So after thinking about what simplifying assumptions I can make, I have a basic design that I'm kinda proud of. I'll try to explain it below, and to keep the "just LOOK at how hairy the problems are if you don't do it my way" details to a minimum. Trust me, they're hairy. Nonetheless, I know y'all need no map from me, if you see another path, or a minor modification, which still avoids the hair and teeth, by all means, suggest it. Also speak up if you see some hair on my path I've missed, of course.

Summary: 1-As discussed already, any identifier or keyword with a translation is presented in user's preferred language on screen, but in English on disk.

2-The identifier-translating dictionary for any given file - say, " somemodule.py" - stays in a parallel file in the same directory, say ".someModule.p4n".

2a- The editor would have to understand import statements and have the ability to fine the relevant .p4n's. "from" and "as" modifiers would be ignored, except when combined, because even a single imported item could carry with it all the identifiers.

3-This dictionary ONLY contains translations for the "public interface" of somemodule.py, that is, those identifiers which are used in importer modules. It also defines a single, unchanging "preferred language" for that file, which is the assumed language for all non-translated identifiers in that file.

4-There is good UI support for creating a new translation for a word. However, the assumed user model is that words will be translated INTO a users preferred language; FROM the context of an importer module (you'd generally not add translations for a module from that module itself, since generally you wouldn't even have modules open whose preferred language is not your own); and therefore WITH an explicit user decision as to which module this translation belongs in (they want to use their language for identifier X which is in English, well, they must have had a reason to write it in English rather than their language so they presumably know what imported module it comes from.)

5-As a consequence of points 1 and 4, when you add a translation to a module whose preferred language is not English, that results in a change on-disk of the python code for that file. (Unlike the case for adding a translation for a file whose preferred language is English, which only anywhere results in safe on-screen changes). To enable the EDITOR to intelligently propagate these changes to other importers of the changed module, and the INTERPRETER to dumbly continue to work for these other importers before the editor gets to them, the changed file (and its dictionary) is given a new name (for instance " importedmodule.i18n.v1.py"). The old version is not deleted and keeps the old name.

6. Due to the notable disadvantages of point 5 (polluting the filesystem and, worse, the import/pythonpath namespace with old versions, whereas the best version of a file would always have a name like " importedmodule.i18n.v37.py"), there would be one change to the python core to facilitate cleanup. If someone deleted all the old copies and renamed the aforementioned best version to just importedmodule.py again, the default __import__ function would know how to find it when it couldn't find importedmodule.i18n.v37.py. This new feature would have no impact on any existing python code, and, to be honest, I think that its presence in the "changes in python3001" lists would be (minor but useful) propaganda for the new i18n features.

(obviously, a good delinting tool would take care of all the issues created by 5 at once.)

7. Docstrings and comments, as always, are a separate issue, but I think that they're also a soluble one.

.................

Is all this clear? Do y'all understand why it's necessary? Do you have any other ideas, or see problems with the above that I missed? Do you think I've made any intolerable or unnecessary compromises? Or do you just think that it's absolutely brilliant?

Homunq 17:14, 10 August 2007 (EDT)