Speech recognition: Difference between revisions

From OLPC
Jump to navigation Jump to search
(+Category:Software_ideas)
 
(11 intermediate revisions by 4 users not shown)
Line 2: Line 2:


This article is for collecting ideas and resources for using [http://en.wikipedia.org/wiki/Speech_recognition speech recognition] on the XO.
This article is for collecting ideas and resources for using [http://en.wikipedia.org/wiki/Speech_recognition speech recognition] on the XO.

While limited, there ''are'' FOSS recognition packages which run on devices comparable to the XO. They may be sub-realtime, have limited
==Existing speech recognition software==
vocabulary, and/or require training. Using the embedded mic, rather than a higher quality plugin one, may be a challenge. And we are much
While limited, there ''are'' [http://en.wikipedia.org/wiki/Free_and_open-source_software FOSS (Free Open Source Software)] [http://en.wikipedia.org/wiki/Speech_recognition Speech-Recognition] packages which run on devices comparable to the XO, they may be sub-realtime, have limited vocabulary, non-continuous, and/or require training.
more concerned with l10n than is typical. But still, there are posiblilities we should explore. Talking to your XO could be very neat!

Using the embedded mic, rather than a higher quality plugin one, may be a challenge. And we are much more concerned with '[[Localization]]' than is typical. But still, there are posiblilities we should explore. Talking to your XO could be very neat!

==Microphone==
The [http://en.wikipedia.org/wiki/Speech_recognition#Microphone Wikipedia Speech Recognition] article recommends an '[http://en.wikipedia.org/wiki/Array_microphone Array Microphone]' for speech recognition. The [http://en.wikipedia.org/wiki/Array_microphone Array Microphone] article say that a microphone array is any number of microphones operating in tandem and they are good for extracting voice input from ambient noise (notably telephones, speech recognition systems, hearing aids). The downside of using external microphones is that it costs money to provide all the schools or every child with them. --[[User:Ricardo|Ricardo]] 03:22, 17 August 2007 (EDT)

I wonder if multiple XO's could be combined into a microphone array? One might be able to do sound localization, which could be an interesting Activity itself. [[User:MitchellNCharity|MitchellNCharity]] 05:59, 4 November 2008 (UTC)

==Digital Signal Processing (DSP)==
If speech recognition is to be succesful, using an internal or external microphone, some good [http://en.wikipedia.org/wiki/Digital_signal_processing Digital Signal Processing] software may be required. This would filter out background noise, breath-noise, pops and clicks, etc. There is plenty of open-source software for this. --[[User:Ricardo|Ricardo]] 03:22, 17 August 2007 (EDT)

==Overcoming the limitations of speech recognition software==

===A good training video===
To allow interactive sessions, where children dictate text word-by-word and issue editing voice-commands, it would be useful to have a good training video. It would explain how to speak in a way that fits in with the limitations of the software (slowly with gaps), how to train the software, speak in a consistent way, etc.

===Pre-processing the speech===
For large passages of text, each sentence could be recorded as continuous speech, with no pauses between words. A sound-editing program would then be used to mark the boundaries between words and insert gaps. For example, "Passmeanorange" becomes "Pass me an orange". A general purpose sound-editing program could be used, but it would be quicker to create a specialized program that just needs one click to introduce a gap. It could also filter-out noise and normalize the volume.

===Integrating the speech recognition software into the sound-editor===
To minimize the time spent on sound pre-processing, the sound editor could have built-in speech recognition to check whether the sentence is recognizable yet, after each sound-processing action (insertion of gap, etc), so that people don't do more pre-processing work than is necessary.

===Re-recording each sentence before recognition===
If someone else's speech has to be processed, one option is to re-record each sentence in a clear voice. Someone who has already trained the software would listen to each sentence and re-record it in a clear voice, slowly, with gaps between words. The software should then make a better job of recognizing it.

For example, if a child records a member of the community telling a story and they want to turn this into text, then speech recognition software may have problems. The software hasn't been trained on that person's voice, it may be fast, continuous speech with no gaps between words, a heavy accent, old and croaky voice, have background chatter, etc. Re-recording each sentence may solve the problem.

So that every child doesn't have to spend ages training the software for their voice or learn about the limitations of speech recognition software, just one or two children in a class or some volunteers on the internet could act as a 'speech recognition bureau'.

--[[User:Ricardo|Ricardo]] 03:07, 17 August 2007 (EDT)

===Building speech recognition for new languages===
It may be possible for folks to create recognizers for arbitrary languages.

But this is ''hard'', requiring a great deal of work by
* someone who can do linguistics
* lots and lots of people recording themselves saying things.

Tools:
*[http://julius.sourceforge.jp/en_index.php Julius] and [http://julius.sourceforge.jp/en_index.php?q=en_grammar.html Julian] - language-independent continuous speech recognizer. Julius and freevox are now on ubuntu.
*:It seems unlikely XO's could run Julius. Perhaps Julian? Otherwise, they would need school or infrastructure servers.
*http://www.voxforge.org/ - database and how-to guides for creating acoustic models.
*[https://launchpad.net/v2x V2X Speech Recognition and Voice Synthesis] project is in progress. Please join the team if you are interested.


==Resources==
==Resources==
*http://www.speech.cs.cmu.edu/pocketsphinx/


*[http://www.speech.cs.cmu.edu/pocketsphinx/ PocketSphinx]

==See also==
*[[Speech synthesis]]

[[Category:Software]]
[[Category:Software ideas]]
[[Category:Software ideas]]
[[Category:Developers]]
[[Category:Software development]]

Latest revision as of 13:21, 30 July 2010

This article is a stub. You can help the OLPC project by expanding it.

This article is for collecting ideas and resources for using speech recognition on the XO.

Existing speech recognition software

While limited, there are FOSS (Free Open Source Software) Speech-Recognition packages which run on devices comparable to the XO, they may be sub-realtime, have limited vocabulary, non-continuous, and/or require training.

Using the embedded mic, rather than a higher quality plugin one, may be a challenge. And we are much more concerned with 'Localization' than is typical. But still, there are posiblilities we should explore. Talking to your XO could be very neat!

Microphone

The Wikipedia Speech Recognition article recommends an 'Array Microphone' for speech recognition. The Array Microphone article say that a microphone array is any number of microphones operating in tandem and they are good for extracting voice input from ambient noise (notably telephones, speech recognition systems, hearing aids). The downside of using external microphones is that it costs money to provide all the schools or every child with them. --Ricardo 03:22, 17 August 2007 (EDT)

I wonder if multiple XO's could be combined into a microphone array? One might be able to do sound localization, which could be an interesting Activity itself. MitchellNCharity 05:59, 4 November 2008 (UTC)

Digital Signal Processing (DSP)

If speech recognition is to be succesful, using an internal or external microphone, some good Digital Signal Processing software may be required. This would filter out background noise, breath-noise, pops and clicks, etc. There is plenty of open-source software for this. --Ricardo 03:22, 17 August 2007 (EDT)

Overcoming the limitations of speech recognition software

A good training video

To allow interactive sessions, where children dictate text word-by-word and issue editing voice-commands, it would be useful to have a good training video. It would explain how to speak in a way that fits in with the limitations of the software (slowly with gaps), how to train the software, speak in a consistent way, etc.

Pre-processing the speech

For large passages of text, each sentence could be recorded as continuous speech, with no pauses between words. A sound-editing program would then be used to mark the boundaries between words and insert gaps. For example, "Passmeanorange" becomes "Pass me an orange". A general purpose sound-editing program could be used, but it would be quicker to create a specialized program that just needs one click to introduce a gap. It could also filter-out noise and normalize the volume.

Integrating the speech recognition software into the sound-editor

To minimize the time spent on sound pre-processing, the sound editor could have built-in speech recognition to check whether the sentence is recognizable yet, after each sound-processing action (insertion of gap, etc), so that people don't do more pre-processing work than is necessary.

Re-recording each sentence before recognition

If someone else's speech has to be processed, one option is to re-record each sentence in a clear voice. Someone who has already trained the software would listen to each sentence and re-record it in a clear voice, slowly, with gaps between words. The software should then make a better job of recognizing it.

For example, if a child records a member of the community telling a story and they want to turn this into text, then speech recognition software may have problems. The software hasn't been trained on that person's voice, it may be fast, continuous speech with no gaps between words, a heavy accent, old and croaky voice, have background chatter, etc. Re-recording each sentence may solve the problem.

So that every child doesn't have to spend ages training the software for their voice or learn about the limitations of speech recognition software, just one or two children in a class or some volunteers on the internet could act as a 'speech recognition bureau'.

--Ricardo 03:07, 17 August 2007 (EDT)

Building speech recognition for new languages

It may be possible for folks to create recognizers for arbitrary languages.

But this is hard, requiring a great deal of work by

  • someone who can do linguistics
  • lots and lots of people recording themselves saying things.

Tools:

  • Julius and Julian - language-independent continuous speech recognizer. Julius and freevox are now on ubuntu.
    It seems unlikely XO's could run Julius. Perhaps Julian? Otherwise, they would need school or infrastructure servers.
  • http://www.voxforge.org/ - database and how-to guides for creating acoustic models.
  • V2X Speech Recognition and Voice Synthesis project is in progress. Please join the team if you are interested.

Resources

See also