Projects/Automatic translation software
!!! Help wanted !!!
If you read on and think this is a great project, you've got a bit of spare time, and would like to help out in some way, we would love to hear from you.
We're especially looking for software developers, linguists, but will take on general dogs bodies, as we can all make our unique contribution.
Please email hieuhoang (at) gmail, in the first instance, for more info
Mission Statement and Objectives
We are creating a language translation application for the OLPC using the latest cutting-edge tools and systems taken from the automatic translation research community. Specifically, we are developing the application based on the Moses toolkit as the core component.
We are concentrating our effort developing a system for Quechua-Spanish translation to be deployed in Peru. However, the application framework and core will support any language pair.
Moses is an cutting-edge, open-source statistical machine translation (SMT) system. It is a widely used tool for academic research into automatic translation. Its reliability and maturity has also meant that the system has also gained traction outside of academia as a core component in commercial translation systems.
The challenge of creating an SMT system on the OLPC is immensed.
Firstly, it will be a challenge to run a complex translation system on a resource constrained device like the OLPC. We intend to use our expertise as the creators and developers of the Moses system to reduce the resource requirements thus enabling the application to run with acceptable performance.
Secondly, a front-end graphical user interface (GUI) needs to be developed. The Moses system was designed to work with GUI, however, only a command line interface has thus far been implemented for the research oriented user-base.
Lastly, parallel text corpora need to be collated from which the translation ‘dictionary’ can be created for each language pair. The collection of such corpora is ad-hoc and may differ from country to country, however, such corpora are usually created from the output of governmental or mass media organisations. With the help of these resources, we will be able to build translation systems for languages pairs in many developing countries which are poorly served by commercial translation systems. The parallel corpora will be collated in collaboration with other researchers, volunteers and other interested parties.
Aside from the philanthropic aspects of the proposal, we hope that this project will increase interest and research into translation of minority languages in developing countries.
Project Ideas
Running a resource intensive application such as the Moses decoder is a challenge even on large servers in a well-funded academic institution.
We have outlined some ideas below to develop a system to enable school children in developing, equipped with a low resource laptop, to use our translation software.
1. Create a client-server application which will run the resource intensive application on a server. Clients will be a Web browser or a custom Pythong app. - Skills required: Python, Apache, C++ 2. Fork the decoder source code to enable it to run on the OLPC. Minimize memory consumption, discard code not likely to be used by the application. - Skills required: C++ 3. Minimize the work the decoder has to do by using a greedy search instead of a beam search, or have a very tight beam and other threshold. - Skills required: C++, statistical machine translation
Other ideas and to-do's:
4. Different language pairs 5. Speech-to-speech translation 6. Integrating Optical Character Recognition (OCR) with translation 7. Enable sharing of user vocabulary via the OLPC Mesh network 8. Distributed training of data on the OLPC
To find out more about the Moses automatic translation toolkit, check out
http://www.statmt.org/moses/
and sign up to the mailing list
Moses Support
Progress
12th march 2009
OLPC laptops received ! It beautiful. And small
What we've found out about the hardware:
430Mhz CPU. AMD geode x86 processor 237MB RAM 1GB flash disk Linux OS
The UI feels a bit sluggish. I'm apprenhensive about putting the a resource intensive decoder onto the machine
14th march 2009
Learning more about the OLPC hardware & software and start collecting to together tools to begin development.
A possibly useful bit of technology is the ability to emulate the system on your own computer so that development & testing can be done without the hardware.
Found pre-packaged VMWare virtual machine for different versions of OLPC here:
http://dev.laptop.org/pub/virtualbox/
After downloading and trying some of them, the correct version we have is build 656.
Tried installing gcc, g++, automake & svn onto the virtual machine to enable building moses. However, g++ wouldn't install.
However, compiling on a normal Linux desktop (32-bit Fedora) and transferring the executable to the OLPC seems to work fine! Yippee !
Lets go to work.