Speech recognition also known as Automated Speech Recognition (ASR) has been an area of great interest and activity to the signal processing and HLT communities over the past several decades. The evolution of this technology can be attributed mainly to the advance in VLSI and DSP technologies, which have allowed complex algorithms like HMM to be performed in real time. ASR has already made its way into the consumer electronics.

Although this field has been researched for more than two decades now, the evolution of ASR and TTS (Text To Speech) systems can be easily tracked from the 1990’s.  the first workable microcomputer bases ASR systems for windows appeared in 1993-1994.  Improved recognition algorithms, faster processing speeds and dedicated DSP cards made voice recognition reasonably accurate and workably fast.
The second generation ASR systems were released in 1996-97. Even though this was an improved version of first generation systems they were not capable of continuous speech recognition and automatic error correction. Later the third generation systems capable of continuous speech recognition were released.
Speech can be considered as a sequence of basic sound units called ‘phonemes’.  Phonemes however may not be directly observed in a speech signal.  Different individuals may produce same string of phonemes to convey the same information, but they generally sound different as a result of variations in dialect, accent and physiology. All ASR systems use similar technology and covert speech to phonemes using a three stage process.
flow chart
Analogue to digital conversion
The analogue speech signals are captured by a microphone and are converted digital signal patterns, usually PCM format


The PCM samples are captured and then segmented into phonemes-the basuc sounds that make up words by ASR system for identification.  The ASR engine compares this with a template to match and then deciphers the word or phrase uttered.

Trigger application

The ASR system then triggers the application with a cue associated with the match.
  • Hands-free computing: voice command recognition computer user interface
  • Home automation
  • Interactive voice response
  • Mobile telephony, including mobile email
  • Multimodal interaction
  • Pronunciation evaluation in computer-aided language learning applications
  • Robotics
  • Speech-to-text (transcription of speech into mobile text messages)
  • Automatic translation
  • Automotive speech recognition
  • Court reporting (Realtime Voice Writing)
  • Telematics (e.g., vehicle Navigation Systems)
  • Transcription (digital speech-to-text)
  • Video games, with Tom Clancy’s EndWar and Lifeline as working examples
  • Health care
  • Battle management
  • Telephony and other domains

The present picture is sure to change in the next few years as continuous speech recognition systems improve their accuracy rates, include larger dictionaries and allow easier and efficient error correction.
  • Voice XML
  • TTS systems
  • ASR perfomance scale

courtesy:-article by HARSHA B.V and SHREE JAISIMHA.


Leave a Reply

Your email address will not be published. Required fields are marked *