Speech synthesis: Difference between revisions

From OLPC
Jump to navigation Jump to search
 
(62 intermediate revisions by 21 users not shown)
Line 3: Line 3:


==Scope==
==Scope==
This article is for collecting ideas and resources for using text-to-speech (TTS) [http://en.wikipedia.org/wiki/Text_to_speech speech synthesis] on the XO.
This article is for collecting ideas and resources for using text-to-speech (TTS) [http://en.wikipedia.org/wiki/Text_to_speech speech synthesis] on the XO.


==Applications of Speech Synthesis wrt OLPC==
== eSpeak ==

Speech synthesis will not only be useful in improving the accessibility of the laptop but also for providing learning aides to the student.

Some simple educational activities that would benefit from the speech synthesis project include:

*'''Pronounce''' - An activity teaching the child how to pronounce words correctly. It can be scaled up in the future to use speech recognition/ analysis of audio files to take audio input from the student. Based upon analysis and comparisons of the input audio file the activity can suggest appropriate corrections in the way the child speaks.

*'''Story Book Reader''' - The Read activity can double up as an activity that would read stories that the child downloads on his/her XO. Children can be encouraged to read more and learn as much as they can. Learning through listening has its own advantages when compared to learning through reading and ad-hoc experimentation.

*'''Listen and Spell''' - Students can listen to the XO speak a word. They must then spell the word and see if they did so correctly. This can be scaled up to a multiplayer game where students can challenge other students in their area. edit: Check out wiki.laptop.org/go/talkntype for beginning work in this area.

*'''Talking [[Chatbots]]''' - Kids would love to shoot questions to an AI [[chatbot]] and hear it answer

*'''Accessibility''' - Speech Synthesis tools are an integral component of software meant to improve accessibility. See http://live.gnome.org/Orca Orca] for more info.

Also see the following article which is a good read for the present context: [http://www.olpcnews.com/content/ebooks/effective_adult_literacy_program.html Effective Adult Literacy Program]

==Existing software==

=== [[Speak]] ===
Type text, and a funny face speaks what you typed.
Pitch, speed, glasses, and mouth are adjustable.

=== Others ===
There are [http://en.wikipedia.org/wiki/FOSS FOSS Free Open Source Software] [http://en.wikipedia.org/wiki/Text_to_speech Speech-Synthesis] packages which run on devices comparable to the XO. We are much more concerned with [[localization]] than is typical. And dialects can be a political issue. But TTS would help with [[Accessibility]]. And could be very cool.

Speech synthesis has a set of complex tradoffs of synthesizer size versus fidelity versus effort to localize a new language. The [http://en.wikipedia.org/wiki/Speech_synthesis Wikipedia speech synthesis] article discusses software that is available, which includes [http://www.cstr.ed.ac.uk/projects/festival/ festival], [http://www.speech.cs.cmu.edu/flite/ flite], and [http://espeak.sourceforge.net/ espeak].

[http://sourceforge.net/projects/espeak/ Espeak] is small enough for us to often bundle and covers quite a few languages: ~10 languages currently supported tuned by native speakers. Localization to ten more languages is underway.

Synthesis is essential for accessibility to content by people with vision problems, and will need to be integrated with the [http://developer.gnome.org/projects/gap/ ATK library] used, as well as literacy training, other uses as part of a GUI. Full localization therefore involves selection of a suitable synthesis system and integration into the ATK framework, along with localization of that system for the particular language involved.

Speech synthesis is usually not a good guide for pronunciation – but it may be better than a poor teacher who has never had the opportunity to learn from a native speaker of that language.

=== eSpeak ===


eSpeak is currently included on the xo.
eSpeak is currently included on the xo.
Line 17: Line 52:
$ espeak --stdout "Using aplay." | aplay -
$ espeak --stdout "Using aplay." | aplay -


However, for some initial sounds, espeak fails to output valid audio to standard out. This includes letters c, h, k, p, q, t, v, z and possibly others. For example, this won't work:
However, for some initial sounds, {{Trac|4002|espeak fails to output valid audio to standard out}}. This includes letters c, h, k, p, q, t, v, z and possibly others. For example, this still won't work in build 703 (aka Update.1, espeak v 1.28):


$ espeak --stdout "hello world." | aplay
$ espeak --stdout "hello world." | aplay
Line 25: Line 60:
$ espeak -w temp.wav "hello world."; aplay temp.wav
$ espeak -w temp.wav "hello world."; aplay temp.wav


See the bug ticket for more information: http://dev.laptop.org/ticket/4002
<!---See the bug ticket for more information: http://dev.laptop.org/ticket/4002--->


[[Screen Reader]] is a DBus interface that allows the XO to use eSpeak via Python.
[[Screen Reader]] is a DBus interface that allows the XO to use eSpeak via Python.
Line 32: Line 67:
*http://sourceforge.net/forum/forum.php?thread_id=1679272&forum_id=538920 Improving the Brasilian portuguese voice.
*http://sourceforge.net/forum/forum.php?thread_id=1679272&forum_id=538920 Improving the Brasilian portuguese voice.


== Festival ==
=== Festival ===


*http://festvox.org/festival/ multi-lingual speech synthesis
*http://festvox.org/festival/ multi-lingual speech synthesis
Line 46: Line 81:
$ flite -t 'Hello, world!'
$ flite -t 'Hello, world!'
:Does it always sound this bad, or is just the default voice that works poorly? [[User:MitchellNCharity|MitchellNCharity]] 16:42, 22 October 2007 (EDT)
:Does it always sound this bad, or is just the default voice that works poorly? [[User:MitchellNCharity|MitchellNCharity]] 16:42, 22 October 2007 (EDT)
:The default voice isn't great. The arctic-hts voices are much better but also qite large (2-3MB each) and not lightweight on the CPU either. [[User:Mattdm|Mattdm]] 01:25, 4 October 2008 (UTC)

Festival is ''not'' currently included on the xo. Unless that changes, it would have to come out of your activity's space budget.
Festival is ''not'' currently included on the xo. Unless that changes, it would have to come out of your activity's space budget.


Line 53: Line 88:
$ echo 'Hello, world!' | festival --tts
$ echo 'Hello, world!' | festival --tts


=== [[FreeIconToSpeech]] ===
==Existing software==
There are [http://en.wikipedia.org/wiki/FOSS FOSS Free Open Source Software] [http://en.wikipedia.org/wiki/Text_to_speech Speech-Synthesis] packages which run on devices comparable to the XO. We are much more concerned with [[localization]] than is typical. And dialects can be a political issue. But TTS would help with [[Accessibility]]. And could be very cool.


The goal of FreeIconTospeech is to provide a low-cost assistive / augmentative communication tool for people with speech, motor, and/or developmental challenges. The immediate opportunity is to create open source software to allow a user to select concepts through a menu of icons, and synthesize speech from those selected concepts. See the [[FreeIconToSpeech]] page for more information.
Speech synthesis has a set of complex tradoffs of synthesizer size versus fidelity versus effort to localize a new language. The [http://en.wikipedia.org/wiki/Speech_synthesis Wikipedia speech synthesis] article discusses software that is available, which includes [http://www.cstr.ed.ac.uk/projects/festival/ festival], [http://www.speech.cs.cmu.edu/flite/ flite], and [http://espeak.sourceforge.net/ espeak].

[http://sourceforge.net/projects/espeak/ Espeak] is small enough for us to often bundle and covers quite a few languages: ~10 languages currently supported tuned by native speakers. Localization to ten more languages is underway.

Synthesis is essential for accessibility to content by people with vision problems, and will need to be integrated with the [http://developer.gnome.org/projects/gap/ ATK library] used, as well as literacy training, other uses as part of a GUI. Full localization therefore involves selection of a suitable synthesis system and integration into the ATK framework, along with localization of that system for the particular language involved.

Speech synthesis is usually not a good guide for pronunciation – but it may be better than a poor teacher who has never had the opportunity to learn from a native speaker of that language.


==The state of the art==
==The state of the art==
Line 71: Line 99:


==See also==
==See also==
*[[Screen_Reader|Screen Reader]]
*[[Speech recognition]]
*[[Speech recognition]]
*[[Shtooka Project]]
*[[Shtooka Project]]
*[[Speak]] A simple but cute activity which animates a face as it reads the words typed by the child
*[[Words]] A translating dictionary with speech synthesis
*[[Talkntype]] Initial draft of an activity based on the Speak&Spell toy, using eSpeak speech synthesis.
*[http://code.google.com/soc/2008/clam/appinfo.html?csaid=AE2EEC2E19810C2 GSOC08 Educational Vowel Synthesiser]


[[Category:Software]]
[[Category:Software]]
[[Category:Software ideas]]
[[Category:Software ideas]]
[[Category:Accessibility]]
[[Category:Accessibility]]
[[Category:Speech Synthesis]]

[[Category:Chatbot]]
[[Category:Chatbots]]
[[Category:Virtual Assistant]]
[[Category:Virtual Assistants]]

Latest revision as of 20:20, 12 April 2012

This article is a stub. You can help the OLPC project by expanding it.


Scope

This article is for collecting ideas and resources for using text-to-speech (TTS) speech synthesis on the XO.

Applications of Speech Synthesis wrt OLPC

Speech synthesis will not only be useful in improving the accessibility of the laptop but also for providing learning aides to the student.

Some simple educational activities that would benefit from the speech synthesis project include:

  • Pronounce - An activity teaching the child how to pronounce words correctly. It can be scaled up in the future to use speech recognition/ analysis of audio files to take audio input from the student. Based upon analysis and comparisons of the input audio file the activity can suggest appropriate corrections in the way the child speaks.
  • Story Book Reader - The Read activity can double up as an activity that would read stories that the child downloads on his/her XO. Children can be encouraged to read more and learn as much as they can. Learning through listening has its own advantages when compared to learning through reading and ad-hoc experimentation.
  • Listen and Spell - Students can listen to the XO speak a word. They must then spell the word and see if they did so correctly. This can be scaled up to a multiplayer game where students can challenge other students in their area. edit: Check out wiki.laptop.org/go/talkntype for beginning work in this area.
  • Talking Chatbots - Kids would love to shoot questions to an AI chatbot and hear it answer
  • Accessibility - Speech Synthesis tools are an integral component of software meant to improve accessibility. See http://live.gnome.org/Orca Orca] for more info.

Also see the following article which is a good read for the present context: Effective Adult Literacy Program

Existing software

Speak

Type text, and a funny face speaks what you typed. Pitch, speed, glasses, and mouth are adjustable.

Others

There are FOSS Free Open Source Software Speech-Synthesis packages which run on devices comparable to the XO. We are much more concerned with localization than is typical. And dialects can be a political issue. But TTS would help with Accessibility. And could be very cool.

Speech synthesis has a set of complex tradoffs of synthesizer size versus fidelity versus effort to localize a new language. The Wikipedia speech synthesis article discusses software that is available, which includes festival, flite, and espeak.

Espeak is small enough for us to often bundle and covers quite a few languages: ~10 languages currently supported tuned by native speakers. Localization to ten more languages is underway.

Synthesis is essential for accessibility to content by people with vision problems, and will need to be integrated with the ATK library used, as well as literacy training, other uses as part of a GUI. Full localization therefore involves selection of a suitable synthesis system and integration into the ATK framework, along with localization of that system for the particular language involved.

Speech synthesis is usually not a good guide for pronunciation – but it may be better than a poor teacher who has never had the opportunity to learn from a native speaker of that language.

eSpeak

eSpeak is currently included on the xo. .. But does not work directly to the sound card since the XO uses ALSA instead of OSS as its main Sound System,and enabling OSS Emulation in ALSA is not yet the default. Manually configuring your XO to emulate OSS in ALSA will provide the system devices that you require and allow full espeak functionality - Dking

If you are lacking OSS Emulation on your XO's sound sytstem setup in ALSA, some text can be played by piping espeak's standard output to another file:

$ espeak --stdout "Ello world." | gst-launch fdsrc fd=0 ! wavparse ! alsasink
$ espeak --stdout -vpt "Bem-vindo ao wiki da OLPC" | gst-launch fdsrc fd=0 ! wavparse ! alsasink
$ espeak --stdout "Using aplay." | aplay -

However, for some initial sounds, espeak fails to output valid audio to standard out (Trac #4002) . This includes letters c, h, k, p, q, t, v, z and possibly others. For example, this still won't work in build 703 (aka Update.1, espeak v 1.28):

$ espeak --stdout "hello world." | aplay

A workaround is to first write the output to a file, then play back the file:

$ espeak -w temp.wav "hello world."; aplay temp.wav


Screen Reader is a DBus interface that allows the XO to use eSpeak via Python.

Festival

Flite is not currently included on the xo. Unless that changes, it would have to come out of your activity's space budget.

First, run /sbin/init 3 so yum doesn't run out of memory.  After yum, reboot.
$ yum install flite
$ flite -t 'Hello, world!'
Does it always sound this bad, or is just the default voice that works poorly? MitchellNCharity 16:42, 22 October 2007 (EDT)
The default voice isn't great. The arctic-hts voices are much better but also qite large (2-3MB each) and not lightweight on the CPU either. Mattdm 01:25, 4 October 2008 (UTC)

Festival is not currently included on the xo. Unless that changes, it would have to come out of your activity's space budget.

First, run /sbin/init 3 so yum doesn't run out of memory.  After yum, reboot.
$ yum install festival
$ echo 'Hello, world!' | festival --tts

FreeIconToSpeech

The goal of FreeIconTospeech is to provide a low-cost assistive / augmentative communication tool for people with speech, motor, and/or developmental challenges. The immediate opportunity is to create open source software to allow a user to select concepts through a menu of icons, and synthesize speech from those selected concepts. See the FreeIconToSpeech page for more information.

The state of the art

Commercial Text-To-Speech programs are getting very good now. The examples at the Digital Future Software Company site are very clear. They use AT&T technology and provide examples of Male and Female speech in English, French and Spanish. The XO needs open-source software that can approach this quality in a wide range of languages.--Ricardo 04:07, 17 August 2007 (EDT)

Resources

See also