Bityi (translating code editor)/design
I'm going to use this space for my own notes on how to do this. I am currently operating under the following assumptions, though any one of these may change:
- This is desirable.
- All problems are solvable (Though things like ReST comments will wait til last).
- It should be done in OLPC / Sugar first, then ported. This means doing it in python. It also means that good design is important, the code should avoid touching Sugar unless it has to.
- For shorthand, I'm assuming Spanish and Python in my examples.
OK. So. Looking for projects that already have python-based lexers for many languages, I come across pygment. Most of its lexers are one-pass - crucial for a real-time system. I think that this can be used.
One funny case I think of: say you add a quote that suddenly makes the rest of the file into a string literal. Two behaviours are possible: preserving the localized string (user sees nothing but a color change, but entire source code from there on out is invisibly rewritten in Spanish) or preserving the underlying python (user sees that suddenly everything turns string-colored and all the keywords change to English). The latter behaviour is far easier to accomplish and seems to me "better" (among other things, it makes an easily "discoverable" quick-and-dirty way to see the English source) but there would be problems if someone "forgot to put the quotes in" until too late (random bits of their UI strings could get quasi-translated into English).
Similar issues for anything that, by changing context, changes meaning of code that follows. Significantly, commenting out an existing section or line fragment - but, depending on the language involved, there could be other examples. Obviously, the answer for comments is different - you want the text to remain unchanged onscreen, though you may want to continue to run the translation as if it were real code in some cases.
If you turned off translation and just used the Spanish text for comments, that would give a workaround for the "I don't want my string to change" problem. Another workaround would be cut/paste.
Another issue this raises: for a lexer with state and without a uniform definition of token boundaries, reverse translation is hard in the general case. In practice, this problem is much easier - you have all the lexer state info from the actual Python, and most languages (Brainf*ck aside) have a single definition of tokens, so you just backtranslate one token at a time, using state if necessary.
But you need a generalized backtranslation algorithm for typing, pasting, and possibly for uncommenting. I think you could just assume that the tokens are decent and that all hard keywords show up literally in the regex's, and do a search-and-replace on the lexers. Debug the ones that barf offline, people won't be making lexers on the fly.
harder than I thought
simple case: lexer is grabbing one token at a time. Definition of token is stateless. Meaning of token may be stateful. solution: a backtranslate dictionary for each state. for each token, backtranslate than lex, that gives you state for next token. advantages: no need to touch the lexer dis: dictionaries need to be defined for states which were intended as lexer internals.
relax that: actually, definition of token needn't be stateless. dis: you must hand-define tokens for each state, losing much of the advantage of having predefined lexers. That, or make a program that is smarter about regex's than I care to think.
grrr... OK, what happens if you just throw syntax to the wind, and translate statelessly all alphanum "words"?
Possible problems: -a language with two uses of the same keyword with different meanings - not likely outside brainf*ck. -a program with ditto - actually a minor issue, and this would encourage fixing it. -you still need at least 3 concepts - program text, string literals, and comments. Ideally there would be some way to distinguish commented-out code from other comments - ReST?
All you really need is a way to tell when to start / stop translation. The editor needs to be aware of whether the cursor is in a translation or a no-translation zone in order to translate-as-typing happens.
keystroke handler:
formerCursorPos = cursorPos retranslate token behind cursor, if any cursorPos = curLexerPos = GetCursorPos() while (lexerState[curLexerPos] != formerLexerState[adjustForNewChars(curLexerPos)]): lexOneStep() redisplayUpTo(curLexerPos)
behaviour:
quotes cause visible backtranslation. Workaround, cut and paste, or menu option "put selection in quotes".
- Possible solution: Whenever the total number of quotes in a document is odd (ie. a string has been started but not ended), then all translation will be paused until the issue is resolved (until another quote is inserted). Paused means nothing new will be translated, and nothing old will be untranslated. There are still ways to "confuse" this, (eg. 2 missing quotes in a widely seperated bit of the program) but they'll be much less common. Hello1024 18:53, 30 July 2007 (EDT)
Comments in Python source code read as dumbly backtranslated. No wait - if you use some ReST convention to distinguish source from non-source in comments, the experience is, type "//" and everything flickers to English, then type "::" and it flickers back to what it was. With, as above, two menu options for "comment out (source code)" and "comment out (Spanish)".