Talk:WiXi

From OLPC
Jump to: navigation, search

some previous talk at Talk:Twext.. also, someone added a similar system:

similar systems

This is similar to ruby annotation.

ruby annotation

http://www.i18nguy.com/unicode/unicode-example-ruby.html <- any better link?

olovieltrac

OmegaWiki

A multi-language dictionary is being developed by wikimedia as a future replacement for the current single language wiktionaries. A beta of the software is online at [[1]]. Would this help?

yes much thanks! including all languages in one database will help wixi a lot.. omegawiki also discusses annotation.. omegawiki (formerly wiktionaryz) link now updated at http://wixi.cc/integrate.. thanks

aligned parallel corpora

ALIGNED PARALLEL CORPORA: Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora. annotating a corpus is part-of-speech tagging, or POS-tagging.. When the language of the corpus is not a working language of the researchers who use it, interlinear glossing is used to make the annotation bilingual.

gloss

GLOSS: A gloss (from Koine Greek glossa, meaning 'tongue') is a note made in the margins or between the lines of a book, in which the meaning of the text in its original language is explained, sometimes in another language. As such, glosses can vary in thoroughness and complexity, from simple marginal notations of words one reader found difficult or obscure, to entire interlinear translations of the original text and cross references to similar passages.. A collection of glosses is a glossary.. In linguistics, an interlinear gloss is often placed between a text and its translation when it is important to understand the structure of the language being glossed..

post

POST: Part-of-speech tagging (POS tagging or POST), also called grammatical tagging, is the process of marking up the words in a text as corresponding to a particular part of speech, based on both its definition, as well as its context —ie. relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc. Once performed by hand, POS tagging is now done in the context of computational linguistics, using algorithms which associate discrete terms, as well as hidden parts of speech, in accordance with a set of descriptive tags.

lemma

LEMMA: In linguistics, and particularly in morphology, a lemma or citation form is the canonical form of a lexeme. Lexeme refers to the set of all the forms that have the same meaning, and lemma refers to the particular form that is chosen by convention to represent the lexeme.

This needs review by a doctor!!!

This page needs to be reveiewed by a pediatric doctor and also by an eye specialist. The process of refocusing used by this system, will put strain on the readers' eyes. Since children are still in thr process of physical development this may be especially DANGEROUS for them. These pages should be removed from the wiki unless and until this is proven to be not harmful to the children.

please refer to
2b8.png
seen at WiXi page.. note the 3rd paragraph: computer users can adjust visibility.. if twext on paper, not.. but only a few complaints so far, in general only from older users
wixi does not intend to make kids go blind, but rather to enable kids (and adults) to adjust the visibility of the interlinear annotation where needed to facilitate language learning.. as the program evolves, it will be able to reduce the visibility (and distraction) of words and phrases that are more likely to be known:
http://twext.cc/pix/TwexterOutputFading.jpg

computer vision syndrome

re: review by a doctor!!! it may also be prudent to remove this entire wiki and while at it, this entire laptop project: The American Optometric Association here says children are especially vulnerable to "Computer Vision Syndrome" because:

  1. kids use electronic display gadgets for long hours
  2. kids have adaptive idea of what's normal (they don't notice if their vision blurs)

so it may be that a fraction of olpc users (with or without wixi) will suffer from eye damage.. note: above link recommends minimizing Computer Vision Syndrome, in part, with REGULAR REFOCUS (on objects more distant than the screen).. meanwhile, "refocusing" is consistently cited, not as a "strain" but rather as an exercise to strengthen eye muscles..

informal tests show that it's not kids who complain about "strain".. those who complain tend to be adults over forty.. "As our eyes age, focusing on small print becomes difficult. Most people over the age of forty start to develop presbyopia. Letters look fuzzy..' please cite links to validate your opinion that it may be "especially DANGEROUS!!!" for kids to refocus to learn a little.. thanks :) Duke 13:05, 2 February 2007 (EST)

update: study show video games improve vision

twexter software specs

xcroll

xcroll scrolls two windows with one xcrollbar

  • demo by Waqas Hussein good but can get much better
  • another demo by Gerardo Iglesias might be useful

twext won't happen without xcroll.. and xcroll might prove useful for regular mediawiki editing:

http://meta.wikimedia.org/wiki/Image:XcrollWiki.jpg

twext line break

TWEXT LINE BREAK
line break method for a chunk of bitext

http://twext.cc/dev/TwextLineBreakWidthChange.png

twext line breaks enable twext translations 
to fit within variable page or column widths.
problem: 
if chunk of bitext doesn't fit on line, then
it must be broken and continued on next line

solution:
1. find where chunk of bitext must break
2. break both text and twext proportionally
   a. pre-break twext part flushes to the right >
   b. post-break twext part flushes left <

example

if page width is quite
narrow, like twenty-two 
characters, then many
twext line breaks are
needed

 I F   P A G E   W I D T H   I S   Q U I T E
 si  .    ancho de pagina  .       esta muy e-           

 N A R R O W ,   L I K E   T W E N T Y - T W O
-strecho      .    como  .         veinte y dos

 C H A R A C T E R S ,   T H E N   M A N Y
 caracteres           . entonces .        -

 T W E X T   L I N E   B R E A K S   A R E
-muchas quebraditas twextos        .    se -     

 N E E D E D
-nececitan

so when a bitext chunk is broken, 
twext part of first half flushes right, while
twext part of second half flushes left
thus, reader understands chunk is broken
twext line breaks are less necessary when 
formatting lyrics or poems, so it may be
useful to first focus on lyric formats
http://code.twext.com/old/format/c1/lyric_TwEXT.c

flowchart

http://code.twext.com/old/format/TwextLineBreakFlowchart.pdf

code

some c code able to do twext line breaks:
http://code.twext.com/old/format/c1/
http://code.twext.com/old/format/c1/paper_TwEXT.c
http://code.twext.com/old/format/c1/print_TwEXT.c

soc

http://twext.com/google#summer_of_code

2008

this year there's an open Summer_of_Code/2008/Ideas page.. maybe this year twext has a chance.. Alexander http://Gelbukh.com may offer to mentor again

2007

apparently the OLPC team has their hands full, so WiXi for SOC 2007 doesn't seem likely.. maybe the team would change their mind if an outstanding student application arrived.. a hiqhly qualified professor Alexander Gelbukh is eager to mentor.. little team risk, maybe great reward..

at little risk WiXi offers an good chance to affirm "learning learning" by enabling kids to construct learning.. wixi may also prove to be of service to both the distributed translation and one dictionary per child projects.. soc 2008? 8P


any support out there?

ja ja..

yaddayadda

wixi.cc will be back up soon with related read.fm wikis for language learners.. on distant roadmap is a code_wiki

ambition did a code wiki here

SpeechTwext WixiFication of Widgets

Twext sounds great! Need to combine with (audio) speech and subtitle animation of widgets, tooltips and context sensitive help in Sugar UI and activities.

is that positive feedback? thanks! (here's a little more.. synxi w/ speech, image and subtle animation would be quite interesting to explore.. pending delivery of semi-fuctional multilingual wixi may spark more participatory development of wixi interface..

A special form of twext is marked up phonetic transcription. Not normally displayed to the "reader" (only the plain text of two languages is shown). But corresponding audio speech for either language can be played through speakers or ear phones.

phonetic twext may be relatively easy to offer, but association with actual audio pronunciation will probably produce more language learning.. phonetic transcription, at least here in mexico, doesn't seem to produce intelligible pronunciation.. animation of text (and even twext) in fine-grain synchronzation w/ crystal clear audio pronunciation could be more productive for learners..


Understood. I did not make it clear but I was suggesting (below) crystal clear audio pronunciation entered by audio annotation using microphone, combined with "karaoke" style dynamic highlighting of the phonemes. Synchronizing this properly would benefit from phonetic transcription markup. Authoring the synchronization of timed text could be done while listening to the audio and pressing space bar for each phoneme - but only if the phonemes have been tagged in the phonetic transcription text so that the text highlight moves as you press the spacebar - too hard to synchronize the karaoke timed text otherwise as purely algorithmic recognition would need manual correction in most vernacular languages. BTW MELPe codecs permit crystal clear audio at 2400bps ie 1MB per hour, and intelligible audio at 600bps.

Here are some references to clarify about "timed text".

GPAC 3gpp Timed Text dynamic highlighting (karaoke). See also wikipedia link to 3gpp doc.

Ogg Writ for subtitles, song lyrics and transcripts (work in progress).

Text is spoken and the syllables, phonemes and other linguistic features of the written text are highlighted synchronized with the audio.

since twext is formatted from vertically arrayed variable chunks, horizontal space per chunk could host markup info to add meaning to specific chunks in particular context..

Combine with i18n, l10n and a11y.

Integrate with wixified widget libraries and the OLPC "view source" Develop activity. Icons and widget labels should always have accompanying SpeechTwext that is played along with the twext displayed in tooltips and context sensitive help.

associated audio will be helpful.. also, visual imagery can add instructive context.. commercial products ie rosetta stone and transparent language rely heavily on visuals to create instructive context for sound and text

Ideal for OLPC target of youngest primary school children (5 to 6 years old) in least developed areas without electrification. These are often being taught literacy in a regional interfranca or national standard language while currently only speaking (not reading) a local vernacular dialect. Sugar UI aims to be usable by pre-literate and needs to combine speech with multi-lingual text.

again.. visual, audio will be very helpful

OLPC needs to enable children and educators to annotate software with unofficial audio in both local vernacular dialect and the educational language the children are being taught and enable listening to it while reading parallel twext. Not just in content documents, but also in the textual elements of the Sugar UI and activities themselves.

tight integration of language learning tools w/ all sugar activities would provide useful hands-on context.. ie if student is comfortable w/ spanish UI for some activities, switching the UI to english might create immediate practical context that can produce language learning.. if confused, switch UI back to known lang.. toggle back and forth to scaffold toward target lang..

Similar techniques to accessibility (a11y) for the blind combined with wixification.

Young children can add audio annotation before they can write. So even young children can help provide l10n translations between speech provided in the educational language and speech in the local vernacular as well as between texts in both languages. Educators and older children can subsequently add text, phonetic marked up text and improve to incorporate in the redistributed l10n.

Annotations initially appear in journal and can be reviewed and improved incrementally while distributed up towards l10n projects and back down as software updates.

Utterances collected by distributed speech annotation process can be used to improve speech engines in both local vernacular and national language.

local vernacular gets interesting in multilingual context.. here in mexico it's common to hear "brodder" (brother), "don't worry be happy" and the uber reaction "oh my god!".. possible scenario is olpc kids cross-fertilize languages and enriching vernaculars.. priority focus may be to collect time-tested words of the wise in various cultures, old sayings like "reap what you sow" etc could be interesting for multiculturalizing little language learners..

Volume control fade settings could do something similar to the visual bifocal fading in twext.

all kindsa room to play.. if headphones, text language could lead right w/ fading translation left.. probably has been tried before.. the key is to give kids the tools to play.. they are quite likely to inject fun into the learning process

OLPC sharing of activities across the mesh, with built in voice and text chat for children to interact with each other should facilitate this combination.

there's plenty of room to grow.. basic twexter avaiable at http://sf.net/projects/twexter .. simple multilingual wixi will soon be testable.. your ideas are plentiful and appreciated.. please consider helping to deliver some actual tools :)

Thanks for the responses. Unfortunately I won't be able to contribute to development of tools but hope to return and contribute some more ideas eventually. One thing I'll add now to clarify my stress on "speech" and references to vernacular. I am not just thinking of local dialects and pronunciations like Mexican Spanish or Iraqi Arabic but the more difficult problem that XO is designed to be usable by 5 or 6 year old children in remote villages with no electricity and perhaps no school (although of course many initial deployments will be simpler). In many areas these children that do not have electricity or schools also speak a different language from the official educational languages used by the country concerned.

vernacular

The educational language in which primary school is taught might be for example in Nigeria Hausa or Yoruba and these are the languages the XO software would be localized for, but there are several hundred local vernacular languages spoken in Nigeria. The first two or more years of primary education, if it is extended to such areas at all, would need to teach literacy in the vernacular as preliminary or at least together with literacy in the educational language for which the XO might have some educational content available. This implies teaching of Hausa for example, using one of several hundred vernacular languages as the language in which Hausa is taught. Can only be done with speech since child is still illiterate. So XO needs strong capabilities to combine speech with text pervasively and in particular for children and educators to add speech annotations themselves to such UI elements as tooltips or balloon help as well as to e-book texts that are only in the educational language.