Projects/Automatic translation software: Difference between revisions

From OLPC
Jump to navigation Jump to search
 
(49 intermediate revisions by 5 users not shown)
Line 1: Line 1:
== In a word ==
== !!! Help from linguist, software developers and general dogbodies gratefully received !!! ==
=== please contact Hieu Hoang for more info ===


This projects aims at developing Machine Translation software on the OLPC. For more details see the Mission Statement and Objectives. At the moment we have translations running on the OLPC already. What could we do from there to make it a usable tool? What could *you* do? See Project Ideas below. This is becoming real ...and very exciting!
We are creating a language translation application for the OLPC using the latest cutting-edge tools and systems taken from the automatic translation research community. Specifically, we are developing the application based on the Moses toolkit as the core component.

== Getting started ==

So you want to see how Automatic Translation on the OLPC works? You've come to the right place. We've put together a working demo of the Moses decoder ('translator') for you to try out. It's a command line program, so you have to be familiar with Linux. But hey, the OLPC is a linux box out of the box!

More details of the Moses toolkit can be found here:
http://www.statmt.org/moses

Steps:
1. Run a virtual machine of the OLPC (If you have an actual OLPC, forget this step). I personally use Virtual Box (by Sun/Oracle, it's free and open source)
http://www.virtualbox.org/
There are prepackaged VM files located here
http://dev.laptop.org/pub/virtualbox/
I use version 656 because I have an OLPC of the same version, but take your pick
2. Download the model file into the olpc
http://groups.inf.ed.ac.uk/hoang/hieu/olpc/en-ht.tgz
and the decoder
http://groups.inf.ed.ac.uk/hoang/hieu/olpc/moses

This may be a bit tricky as the Web browser on the OLPC takes some getting used to. You can install wget
su root
yumm install wget

3. Make the moses decoder executable
chmod +x moses

3. Unzip the file and cd into the directory
tar zxf en-ht.tgz
cd model
4. Run the decoder, wait for 30 secs
../moses -f moses.ini
until the prompt:
Created input-output object : [1.000] seconds

5. Type in some English and watch it translate (here, into Haitian Creole)
i am a doctor
Translating: i am a doctor
reading bin ttable
size of OFF_T 8
binary phrasefile loaded, default OFF_T: -1
Collecting options took 0.200 seconds
Search took 0.280 seconds
BEST TRANSLATION: mwen se yon doktè [1111] [total=-0.520] <<0.000, -4.000, 0.000, -0.511, 0.000, 0.000, 0.000, 0.000, 0.000, -16.452, 0.000, -5.911, -0.693, -3.622, 1.000>>
mwen se yon doktè
Translation took 0.280 seconds
Finished translating


If you read on and think this is a great project, you've got a bit of spare time, and would like to help out in some way, we would love to hear from you.

We're especially looking for software developers, linguists, but will take on general dogs bodies, as we can all make our unique contribution.

Please email '''hieuhoang (at) gmail''', in the first instance, for more info

== Mission Statement and Objectives ==

We are creating a language translation application for the OLPC using the latest cutting-edge tools and systems taken from the automatic translation research community. Specifically, we are developing the application based on the ''Moses toolkit'' as the core component.

Unlike many automatic translation approaches, the toolkit support any language pair. Just add data! For example, our friends at Edinburgh University have used the same toolkit to translate 11 European language pairs.
http://www.statmt.org/matrix/


Moses is an cutting-edge, open-source statistical machine translation (SMT) system. It is a widely used tool for academic research into automatic translation. Its reliability and maturity has also meant that the system has also gained traction outside of academia as a core component in commercial translation systems.
Moses is an cutting-edge, open-source statistical machine translation (SMT) system. It is a widely used tool for academic research into automatic translation. Its reliability and maturity has also meant that the system has also gained traction outside of academia as a core component in commercial translation systems.


The challenge of creating an SMT system on the OLPC is immensed.
The challenge of creating an SMT system on the OLPC is immensed.



Firstly, it will be a challenge to run a complex translation system on a resource constrained device like the OLPC. We intend to use our expertise as the creators and developers of the Moses system to reduce the resource requirements thus enabling the application to run with acceptable performance.
Firstly, it will be a challenge to run a complex translation system on a resource constrained device like the OLPC. We intend to use our expertise as the creators and developers of the Moses system to reduce the resource requirements thus enabling the application to run with acceptable performance.

Secondly, a front-end graphical user interface (GUI) needs to be developed. The Moses system was designed to work with GUI, however, only a command line interface has thus far been implemented for the research oriented user-base.
Secondly, a front-end graphical user interface (''GUI'') needs to be developed. The Moses system was designed to work with GUI, however, only a command line interface has thus far been implemented for the research oriented user-base.
Lastly, parallel textual corpora need to be collated from which the translation ‘dictionary’ can be created for each language pair. The collection of such corpora is ad-hoc and may differ from country to country, however, such corpora are usually created from the output of governmental or mass media organisations. With the help of these resources, we will be able to build translation systems for languages pairs in many developing countries which are poorly served by commercial translation systems. The parallel corpora will be collated in collaboration with other researchers, volunteers and other interested parties.

Lastly, ''parallel text corpora'' need to be collated from which the translation ‘dictionary’ can be created for each language pair. The collection of such corpora is ad-hoc and may differ from country to country, however, such corpora are usually created from the output of governmental or mass media organisations. With the help of these resources, we will be able to build translation systems for languages pairs in many developing countries which are poorly served by commercial translation systems. The parallel corpora will be collated in collaboration with other researchers, volunteers and other interested parties.

Aside from the philanthropic aspects of the proposal, we hope that this project will increase interest and research into translation of minority languages in developing countries.

== Project Ideas ==
Running a resource intensive application such as the Moses decoder is a challenge even on large servers in a well-funded academic institution.

We have outlined some ideas below to develop a system to enable school children in developing, equipped with a low resource laptop, to use our translation software.
1. Create a client-server application which will run the resource intensive application on a server. Clients will be a Web browser or a custom Pythong app.
- Skills required: Python, Apache, C++
2. Fork the decoder source code to enable it to run on the OLPC. Minimize memory consumption, discard code not likely to be used by the application.
- Skills required: C++
3. Minimize the work the decoder has to do by using a greedy search instead of a beam search, or have a very tight beam and other threshold.
- Skills required: C++, statistical machine translation

Other ideas and to-do's:
4. Different language pairs
5. Speech-to-speech translation
6. Integrating Optical Character Recognition (OCR) with translation
7. Enable sharing of user vocabulary via the OLPC Mesh network
8. Distributed training of data on the OLPC

To find out more about the Moses automatic translation toolkit, check out
http://www.statmt.org/moses/
and sign up to the mailing list
[http://mailman.mit.edu/mailman/listinfo/moses-support Moses Support]
== Progress ==
====12th march 2009====

OLPC laptops received ! It beautiful. And small

[[Image:DSCF6759.JPG]]

[[Image:DSCF6767.JPG]]

What we've found out about the hardware:
430Mhz CPU. AMD geode x86 processor
237MB RAM
1GB flash disk
Linux OS

The UI feels a bit sluggish. I'm apprenhensive about putting the a resource intensive decoder onto the machine

====3rd May 2009====
Finally got some time to look into developing a GUI for the OLPC. The OLPC used something called Sugar, which I think is the X-Windows windowing manager. The preferred language is Python, using a library call PyGTK, which looks like TCL/TK from way back. After a few hours googling and mashing up example programs. Managed to knock up a rudimentary GUI that should be able to function as the basis for the rest of the development. Its pretty basic, but has most of what's needed, development wise.

[[Image:ui1.JPG]]

It's able to call the executable with the required ini file. However, the input & output of the decoder is completely detached from the GUI. The next step is to integrate the decoder and GUI. The suggestions have been to use TCP/IP ports or wrap the decoder with a SOAP server. However, just been thinking about this and may be able to use named pipes. This was done a while ago but had reliabilty issues. Will look at this in more details next time.

====31st May 2009====

Hooked the Moses lib into a socket server so can request translations client-server style. Hopefully, this will reliable and flexible way of calling the decoder from the front end. It will also allow us to move the servver part to the OLPC school server should the laptop be too feeble to handle decoding for realistic models.

Integrated tokenizer from borrowed from Hoang Industries Conglomates.

[[Image:Picture_1.jpg]]

====12th Jully, 2009====
Finally got the python gui to talk to the moses server! The end of the beginning

[[Image:OLPC-trans-12-7-09.png]]

====30th April 2010====
After a long break, come back to the project with a trained model for Haitian Creole and refreshed webpage.

== Credits ==

Lovingly created and nurtured by Hieu Hoang,with support from his office mates Loic Dugast, Abhishek Arun. With help and encourage of the Edinburgh Uni SMT team

Latest revision as of 19:31, 19 December 2010

In a word

This projects aims at developing Machine Translation software on the OLPC. For more details see the Mission Statement and Objectives. At the moment we have translations running on the OLPC already. What could we do from there to make it a usable tool? What could *you* do? See Project Ideas below. This is becoming real ...and very exciting!

Getting started

So you want to see how Automatic Translation on the OLPC works? You've come to the right place. We've put together a working demo of the Moses decoder ('translator') for you to try out. It's a command line program, so you have to be familiar with Linux. But hey, the OLPC is a linux box out of the box!

More details of the Moses toolkit can be found here:

  http://www.statmt.org/moses

Steps:

 1. Run a virtual machine of the OLPC (If you have an actual OLPC, forget this step). I personally use Virtual Box (by Sun/Oracle, it's free and open source)
        http://www.virtualbox.org/
     There are prepackaged VM files located here
        http://dev.laptop.org/pub/virtualbox/
     I use version 656 because I have an OLPC of the same version, but take your pick
 2. Download the model file into the olpc
        http://groups.inf.ed.ac.uk/hoang/hieu/olpc/en-ht.tgz
    and the decoder
       http://groups.inf.ed.ac.uk/hoang/hieu/olpc/moses
    This may be a bit tricky as the Web browser on the OLPC takes some getting used to. You can install wget 
      su root
      yumm install wget
 3. Make the moses decoder executable
      chmod +x moses
 3. Unzip the file and cd into the directory
       tar zxf en-ht.tgz
       cd model
  4. Run the decoder, wait for 30 secs
       ../moses -f moses.ini
      until the prompt:
         Created input-output object : [1.000] seconds
  5. Type in some English and watch it translate (here, into Haitian Creole)
          i am a doctor
          Translating: i am a doctor 
          
          reading bin ttable
          size of OFF_T 8
          binary phrasefile loaded, default OFF_T: -1
          Collecting options took 0.200 seconds
          Search took 0.280 seconds
          BEST TRANSLATION: mwen se yon doktè [1111]  [total=-0.520] <<0.000, -4.000, 0.000, -0.511, 0.000, 0.000, 0.000, 0.000, 0.000, -16.452, 0.000, -5.911, -0.693, -3.622, 1.000>>
          mwen se yon doktè 
          Translation took 0.280 seconds
          Finished translating


If you read on and think this is a great project, you've got a bit of spare time, and would like to help out in some way, we would love to hear from you.

We're especially looking for software developers, linguists, but will take on general dogs bodies, as we can all make our unique contribution.

Please email hieuhoang (at) gmail, in the first instance, for more info

Mission Statement and Objectives

We are creating a language translation application for the OLPC using the latest cutting-edge tools and systems taken from the automatic translation research community. Specifically, we are developing the application based on the Moses toolkit as the core component.

Unlike many automatic translation approaches, the toolkit support any language pair. Just add data! For example, our friends at Edinburgh University have used the same toolkit to translate 11 European language pairs.

  http://www.statmt.org/matrix/

Moses is an cutting-edge, open-source statistical machine translation (SMT) system. It is a widely used tool for academic research into automatic translation. Its reliability and maturity has also meant that the system has also gained traction outside of academia as a core component in commercial translation systems.

The challenge of creating an SMT system on the OLPC is immensed.

Firstly, it will be a challenge to run a complex translation system on a resource constrained device like the OLPC. We intend to use our expertise as the creators and developers of the Moses system to reduce the resource requirements thus enabling the application to run with acceptable performance.

Secondly, a front-end graphical user interface (GUI) needs to be developed. The Moses system was designed to work with GUI, however, only a command line interface has thus far been implemented for the research oriented user-base.

Lastly, parallel text corpora need to be collated from which the translation ‘dictionary’ can be created for each language pair. The collection of such corpora is ad-hoc and may differ from country to country, however, such corpora are usually created from the output of governmental or mass media organisations. With the help of these resources, we will be able to build translation systems for languages pairs in many developing countries which are poorly served by commercial translation systems. The parallel corpora will be collated in collaboration with other researchers, volunteers and other interested parties.

Aside from the philanthropic aspects of the proposal, we hope that this project will increase interest and research into translation of minority languages in developing countries.

Project Ideas

Running a resource intensive application such as the Moses decoder is a challenge even on large servers in a well-funded academic institution.

We have outlined some ideas below to develop a system to enable school children in developing, equipped with a low resource laptop, to use our translation software.

 1. Create a client-server application which will run the resource intensive application on a server. Clients will be a Web browser or a custom Pythong app.
     - Skills required: Python, Apache, C++
 2. Fork the decoder source code to enable it to run on the OLPC. Minimize memory consumption, discard code not likely to be used by the application. 
     - Skills required: C++
 3. Minimize the work the decoder has to do by using a greedy search instead of a beam search, or have a very tight beam and other threshold.
     - Skills required: C++, statistical machine translation

Other ideas and to-do's:

 4. Different language pairs
 5. Speech-to-speech translation
 6. Integrating Optical Character Recognition (OCR) with translation
 7. Enable sharing of user vocabulary via the OLPC Mesh network
 8. Distributed training of data on the OLPC

To find out more about the Moses automatic translation toolkit, check out

  http://www.statmt.org/moses/

and sign up to the mailing list

  Moses Support

Progress

12th march 2009

OLPC laptops received ! It beautiful. And small

DSCF6759.JPG

DSCF6767.JPG

What we've found out about the hardware:

 430Mhz CPU. AMD geode x86 processor
 237MB RAM
 1GB flash disk
 Linux OS

The UI feels a bit sluggish. I'm apprenhensive about putting the a resource intensive decoder onto the machine

3rd May 2009

Finally got some time to look into developing a GUI for the OLPC. The OLPC used something called Sugar, which I think is the X-Windows windowing manager. The preferred language is Python, using a library call PyGTK, which looks like TCL/TK from way back. After a few hours googling and mashing up example programs. Managed to knock up a rudimentary GUI that should be able to function as the basis for the rest of the development. Its pretty basic, but has most of what's needed, development wise.

Ui1.JPG

It's able to call the executable with the required ini file. However, the input & output of the decoder is completely detached from the GUI. The next step is to integrate the decoder and GUI. The suggestions have been to use TCP/IP ports or wrap the decoder with a SOAP server. However, just been thinking about this and may be able to use named pipes. This was done a while ago but had reliabilty issues. Will look at this in more details next time.

31st May 2009

Hooked the Moses lib into a socket server so can request translations client-server style. Hopefully, this will reliable and flexible way of calling the decoder from the front end. It will also allow us to move the servver part to the OLPC school server should the laptop be too feeble to handle decoding for realistic models.

Integrated tokenizer from borrowed from Hoang Industries Conglomates.

Picture 1.jpg

12th Jully, 2009

Finally got the python gui to talk to the moses server! The end of the beginning

OLPC-trans-12-7-09.png

30th April 2010

After a long break, come back to the project with a trained model for Haitian Creole and refreshed webpage.

Credits

Lovingly created and nurtured by Hieu Hoang,with support from his office mates Loic Dugast, Abhishek Arun. With help and encourage of the Edinburgh Uni SMT team