Your Voice on XO
This is a proposal for the creation of a new activity for the XO that would advance localization efforts in TTS development, as well as promote the involvement of the local community overall. "Your voice on XO" would consist of a long-term, community-based project to build and/or further development of a synthetic voice for the language used locally (for more on synthetic-voice building, go here, and here).
The overall activity would proceed as follows. A teacher in a community supported by OLPC that uses a language or dialect that is not available via TTS—or any regional or dialectical form of a language that is already supported, for that (e.g., French in W. Africa, or Haiti)—would work with a child, teacher or community volunteers to build a synthetic voice using Festival. Such an undertaking would involve a considerable amount of speech recordings and textual corpora for any given language, but given access to the Internet (and particularly, reasonably large mutilingual corpora, such as those at Wikipedia), and with some effort, this activity could turn into a reasonably accessible endeavor for communities supported by OLPC.
 Use-case scenario
During the course of the three-month long GsoC 2008, I plan to complete the development process that will integrate the eSpeak editor component into Sugar, thereby allowing students, educators and other members of the community to manipulate phonetic data. This stage would then be followed by development of the voice- and textual-data capture and management components of the activity that would be necessary in order to add a new language on the XO. This component of the activity can turn into an entertaining activity in a classroom setting, especially given the fact that one can manipulate the sounds in a given set of words or sentence, according to pitch and pitch range, thus producing interesting sounds and, ultimately, useful phonetic data as well.
- User launches “Your voice on XO” activity and is prompted to choose from a list of supported languages, as well as a voice (male, female, old, young, regional/dialectical, etc.) among two or more options (where available). Alternatively, the user may choose the generic option “add a new language.”
- The language list would be limited to a couple of choices, based on regional usage and given data storage contraints. English would likely be supported by default.
- Given a supported language, user is confronted with a sample sentence recording, which is read out loud once during startup—e.g., “Hi, where should we begin?”
- User launches the Voice Editor application, from within a menu of choices, and is confronted with a window with a set of tabs containing frames, each consisting of a short-time frequency spectrum, covering the period of one cycle at the sound's pitch
- A “Spect” tab, located to the side of the main window, displays modifiable information about the current frame and sequence—e.g., formants (width, height, frequency), amp, sequence length in time (ms), etc.
- Using keyboard shortcuts such as <Shift> + <directional arrows>, the user can modify elements of the sequence such as pitch, frequency, etc.
- This can be done for both the Formants and Prosody editors, where the former represents the vowel and consonant sounds of a given language, and the latter the aggregate phoneme data in a given sequence, such as the introductory sequence (i.e., “Hi, where should we begin?”).
- Once the user is satisfied with the composed sounds, they can choose to save the result in a format of their choice.
- By default, eSpeak editor uses the WAV format, though the resulting file could then be compressed automatically into a more compact, lossless format.
This use case scenario only focuses on the Voice Editor component of eSpeak. The end goal would be to incorporate this component along with the voice recorder and front-end that integrates other components that manage the voice recording and text corpora input that is necessary to build or improve a synthetic voice.
 Implementation details
This activity would entail integrating the voice-building capabilities of eSpeak, or perhaps # Festival, into Sugar on the XO, and working to facilitate synthetic-voice building in a classroom, or community setting (for an overall view of how the voice building process might proceed, go here). This effort would be carried out with a focus on a GUI that would be easy to use, as well as on the integration of the activity with all TTS and pertinent language-related activities—e.g., TalknType, Orca, and E-Book Reader. As such, the use of a high-level, device independent platform such as Speech Dispacher would be ideal. In fact, speechd supports Festival, eSpeak, and several other TTS engines, and is in current use at OLPC.
The activity would consist of component sto facilitate voice recordings, phonetic data manipulation and callibration, and dictionary file management. The phonetic and dictionary components would follow the overall scheme layed out by eSpeak for adding a language. In particular, the phonetic data manipulation component might be based around an interface that would mimic the eSpeak editor interface. Finally, the voice recording component might exploit existing resources, such as those offered by the Record activity.
 Long-term impact
Ultimately, this project seeks to not only improve, but also localize the voice quality of existing languages via the efforts of the local community, to increase the phoneme data that helps improve speech synthesis quality, and to increase involvement in the community via OLPC. Localization efforts include the addition of previously unsupported languages and dialects to the XO's linguistic repertoire, the fine-tuning of languages already present, and the “naturalization” of existing TTS languages. This last item refers particularly to languages with wide geographic spread, and/or considerably diverse forms, such as Spanish, English, French, Arabic and Chinese. Despite their diverse “spoken” forms, such languages would benefit from the existence of a relatively uniform orthographic form, as is the case with the examples provided. On the other hand, TTS efforts should aim to rein in new languages and improve existing ones via the addition of new phonetic data provided by the linguistic community. Finally, the involvement of the community would be fostered not only directly, but also indirectly via this new activity. One example that comes to mind concerns traditionally oral languages, such as Amerindian languages, but also dialectical forms of a given language. In these instances, textual corpora would come primarily from reseach efforts by trained linguists. Such scenearios would obviously necessitate involvement on behalf of the OLPC program, or perhaps local NGO's involved in projects with the impacted community. Indeed, the benefits of an activity such as “Your voice on XO” would be numerous and far-reaching, not only given the direct impact that it would have on all of the activities that make use of TTS, but esp. based on the potential of such localization efforts with regards to the increased involvement and education of the community.
One of the main concerns in this endeavor is the procurement of sufficient storage space for audio recordings. Such a constraint may be surmounted through one of several creative means, and it may not be so difficult to get around the limitations imposed by the storage capacity on the XO. One alternative that comes to mind is via some form of external storage, ideally in the form of a solid-state drive (SSD). Of course, more affordable and integrated solutions may be preferable, especially given the high storage-space-to-cost ratio of SSD technology. One solution might involve the recent efforts to introduce School servers with increased, community storage space via the OLPC program. At the same time, given the efforts necessary to develop a synthetic voice—speech recordings, corpora building, and overall project management—it is easy to see how such an activity would require a considerable degree of planning based around community-driven resources, including securing a suitable recording environment, the involvement time and commitment required of a teacher, mature student, or collection of individuals coordinating the activity, as well as the involvement of any other interested members of the community. With this in mind, then, it is conceivable that such efforts to localize the XO's TTS resources, even on a national scale, would be limited to and based largely on the interest of a given community. As such, the added cost of a simple, portable storage solution beyond what is offered by the XO would not be considerable when seen on a regional, or national level. Finally, the introduction of external storage space can be see as an added-value not just for language-related localization efforts, but also in other fruitful realms, particularly as concerns educational media. Indeed, video, sound and photographic media would benefit considerably from an expansion in storage space, and such media would aid immesurably in efforts to foster a higher level of interaction from, and education for, end-users and, most importantly, the communities of users impacted by the OLPC's mission.
Please feel free to leave feedback regarding “Your voice on XO”.