Voice computing (speech recognition) explained

voice recognition

Language is an important means of communication. Speech is its main medium. Designing a machine that converses with humans, particularly responding properly to spoken language, has intrigued engineers and scientists for centuries.

The concept of computing based on voice is useful in many areas: for the blind, officers, students, housewives, etc. Unfortunately, in today’s computer systems, there is no provision for people with blindness or physical disabilities.

Voice computing, also known as conversational computing or computer speech recognition, is an application specially designed for physically challenged people or blind persons for whom the physical disorder is the only obstacle in handling current or traditional systems. It is the next major step in user interfaces. It is now changing the way people interact with and use computers.

The objective here is to trap human voice in a digital computer and decode it into corresponding text using speech recognition which is a process of converting a speech signal to a sequence of words utilizing an algorithm implemented as a computer program.

Languages on which automatic speech recognition systems have been developed so far are just a fraction of the total of around 7300 languages. Chinese, English, Russian, Portuguese, Vietnamese, Japanese, Spanish, Filipino, Arabic, Bengali, Tamil, Malayalam, Sinhala, and Hindi are prominent among them.

Speech recognition in movies

Speech recognition technology has become a topic of great interest to the general population through many blockbuster movies of the 1960s and 1970s. The anthropomorphism of “HAL,” a famous character in Stanley Kubrick’s movie “2001: A Space Odyssey”, made the general public aware of the potential of intelligent machines. In this movie, an intelligent computer named “HAL” spoke in a natural-sounding voice and could recognize and understand fluently spoken speech and respond accordingly.

George Lucas, in the famous Star Wars saga, extended the abilities of intelligent machines by making them intelligent and mobile Droids like R2D2 and C3PO were able to speak naturally, recognize and understand fluent speech, move around and interact with their environment, and other droids, and with the human population. Apple Computers, in the year of 1988, created a vision of speech technology and computers for the year 2011, titled “Knowledge Navigator,” which defined the concepts of a Speech UUser Interface (SUI) and a Multimodal User Interface (MUI) along with the theme of intelligent voice-enabled agents. This video dramatically affected the technical community and focused technology efforts, especially in the area of visual talking agents.

What is speech recognition?

A speech recognition system (SRS), also known as Automatic Speech Recognition (ASR), handles converting a speech signal to a sequence of words by employing an algorithm implemented as a computer program. It can potentially be an important mode of interaction between humans and computers.

Speech recognition software takes natural language, spoken words, or commands and translates them into a language easily understood by the computer. This occurs when the computer picks up your voice through a microphone and converts it into an analog signal; it is then processed by your computer’s sound card and, from there, is translated into a binary code so your computer can understand it. The software turns the voice into text or uses it to carry out the consumers’ command.

Speech recognition software can help many people, from busy teenagers to the disabled. Disabled individuals, who are unable to operate computers through mouse or keyboard use, can now control their computers with ease and confidence. Now, the software supports completely hands-free controlling everything from computer games to sending important business emails. The option to ask your computer how to perform tasks can help those who have trouble using the computer.

Components of speech recognition

Most speech recognition systems have the following five components:

(1) A speech capture device usually consists of a microphone and associated analog-to-digital converter, which digitally encodes the raw speech waveform.

(2) A digital signal processing module performs endpoint (word boundary) detection to separate speech from nonspeech, converts the raw waveform into a frequency domain representation, and performs further windowing, scaling, filtering, and data compression. The goal is to enhance and retain only those components of the spectral representation that are useful for recognition purposes, thereby reducing the amount of information that the pattern-matching algorithm must contend with. A set of these
speech parameters for one interval (usually 10-30 milliseconds) are called a speech frame.

(3) Preprocessed signal storage – The preprocessed speech is buffered for the recognition algorithm.

(4) Reference speech patterns – Stored reference patterns can be matched against the user’s speech sample once the DSP module has preprocessed it. This information is stored as a set of speech templates or as generative speech models.

(5) A pattern matching algorithm – The algorithm must compute a goodness-of-fit measure between the preprocessed signal from the user’s speech and all the stored templates or speech models. A selection process chooses the template or model (possibly more than one) with the best match.

Speech recognition tries to solve the following problems:

  • Signal processing – Extracting relevant information from the speech signal efficiently. Characterizing time-varying properties of the speech signal as well as various types of signal preprocessing and post-processing.
  • Pattern recognition – Using algorithms to cluster data and create prototypical patterns for comparing a pair of patterns based on feature measurement. Detecting the presence of a particular speech pattern, a set of coding and decoding algorithms is used to search a large but finite grid for the best path corresponding to a “best” recognized sequence of words.
  • Linguistics – Discovering the relationship between sounds (phonology), words in a language (syntax), the meaning of spoken words (semantics), and sense derived from the meaning (pragmatics).
  • Physiology – Understanding the higher-order mechanisms within the human central nervous system that account for speech production and perception in human beings.

Application of speech recognition systems

Speech technologies are vastly used and have unlimited uses. These technologies enable machines to respond correctly and reliably to human voices and provide useful and valuable services. In a human-to-machine interface, the speech signal is transformed into an analog and digital waveform, which the machine can understand.

  • Education: Speech-to-text processing to correct pronunciation of vocabulary in foreign languages. Using verbal language to enter text in keyboards for handicapped students.
  • Medical: Precision surgery, Automatic wheelchair, Medical transcription (digital speech-to-text)
  • Military: Automatic aircraft control, helicopter, training air traffic controller, Automatic ammunition control.
  • Communication: Voice dialing, telephone directory inquiry without operator assistance.
  • Domestic: Ovens, refrigerators, washing machines, home appliances control, etc.
  • General Use of security purposes at highly secure places, Dictation system on the market. To translate data from one language to another, video gaming and ATM (data entry).


Successful voice computing is evidence of the closing gap between user intention and computer understanding, which has required keyboards, computer screens, and extensive training until now. As technology growth accelerates, we see more frequent examples of this new type of computer interaction. The willingness of consumers and businesses to interact with simple voice tools suggests that most would eagerly adopt future improvements of this technology. Many important scientific and technological advances have been taking place to bring us closer to the “Holy Grail” of machines that recognize and understand fluently spoken speech. However, still, we are far from having a machine that mimics human behavior.