Automatic speech recognition (ASR) or voice recognition is a process that uses a computer program to convert a speech signal into a sequence of words.
Speech recognition software is used to apply grammatical and syntactical rules to speech, and if the words fit into a certain set of rules, the software could figure out what was being said. However, because the human language has so many variations, this type of software does not have a high level of accuracy.
Statistical modeling systems, which use mathematical functions and probability, are used in today’s speech recognition systems. For more than 200 years, scientists have been studying speech recognition. There have been at least five generations of approaches during that time.
First Generation: Mechanical Synthesis
Over two hundred years ago, the first attempts to create synthetic speech were made. Professor Christian Kratzenstein of St. Petersburg explained the physiological differences between five long vowels and created an apparatus to produce them artificially in 1779. He built acoustic resonators that resembled the human vocal tract and activated them with vibrating reeds similar to musical instruments.
Wolfgang von Kempelen introduced an “Acoustic-Mechanical Speech Machine” in Vienna in 1791, capable of producing single sounds and some combinations of sounds. A pressure chamber for the lungs, a vibrating reed that acted as vocal cords, and a leather tube for vocal tract action were all part of his machine. He could make different vowel sounds by changing the shape of the tube. Kempelen received bad press and was dismissed due to other inventions that turned out to be faked. On the other hand, his machine spawned new theories about human vocals.
Charles Wheatstone built a version of von Kempelen’s speaking machine in the mid-1800s that could produce vowels and most consonant sounds. He was even able to utter complete sentences. Alexander Graham Bell, inspired by Wheatstone’s machine, built his own. Mechanical vocal systems were studied and experimented with until the 1960s, but with little success. Voice recognition did not begin in earnest until the introduction of electrical synthesizers.
Second Generation (the 1930s-40s): Homer Dudley and Vocoder
In 1928, AT&T’s Bell Labs engineer Homer Dudley conducted the first voice encoding experiments. At the time, the labs developed the first electronic speech synthesizer, the Voder or Vocoder machine, which was derived from a voice encoder. In 1935, Dudley received a patent for his invention. In 1936, he co-created the first electronic speech synthesizer with fellow engineers Riesz and Watkins. Experts who used a keyboard and foot pedals to play the machine and emit speech demonstrated it at the 1939 Worlds Fairs. In the 1930s, the Vocoder was created as a speech coder for telecommunications applications, with the goal of coding speech for transmission. It was used for secure radio communication, in which the voice was digitized, encrypted, and then sent over a narrow voice-bandwidth channel.
Universities and the United States government-funded most early voice encoding and speech recognition research, primarily the military and the Defense Advanced Research Project Agency. The SIGSALY system, built by Bell Labs engineers in 1943, used Dudley’s Vocoder. During World War II, this system was used to encrypt high-level communications for the Allies. Dudley’s Vocoder and speech recognition did not improve much after the 1930s and early 1940s.
Third Generation (the 1950s – 60s): Synthesizers
At Haskins Laboratories in 1951, Franklin Cooper created a Pattern Playback synthesizer. It converted recorded spectrogram patterns into sounds in their original or modified forms. They were optically recorded on a transparent belt. The frequency spectrum of a compound signal is calculated using spectrograms. It’s a three-dimensional representation of the energy of a signal’s frequency content as it changes over time.
PAT, the first formant synthesizer, was introduced by Walter Lawrence in 1953. (Parametric Artificial Talker). It was made up of three electronic formant resonators connected in series. A buzz or noise was added to the mix. To control the three formant frequencies: voicing amplitude, fundamental frequency, and noise amplitude, a moving glass slide converted painted patterns into six-time functions—the first successful synthesizer used vocal tract resonances to describe the reconstruction process.
At the same time, Gunner Fant introduced the OVE I (Orator Verbis Electric), the first cascade formant synthesizer, which consisted of resonators connected in a cascade. In 1962, Fant and his colleague Martony introduced the OVE II synthesizer, which included separate parts to model the vocal tract’s vowel, nasal, and consonant transfer functions. The OVE III and GLOVE projects were born from these synthesizers.
The first speech recognition systems could only understand digits—labs created the “Audrey” system in 1952, which recognized digits spoken by a single voice. At the 1962 World’s Fair, IBM demonstrated its “Shoebox” machine, which could understand 16 words spoken in English ten years later. Other hardware dedicated to recognizing spoken sounds was developed in labs in the United States, Japan, England, and the Soviet Union, allowing speech recognition technology to support four vowels and nine consonants. These early efforts may not appear to be much, but they were an impressive start, especially considering how primitive computers were at the time.
Fourth Generation (the 1970s to 2001): The HMM Model and the Commercial Market
Princeton University’s Lenny Baum invented the Hidden Markov Modeling approach to speech recognition in the early 1970s. This statistical model generates a series of symbols or numbers and compares them to patterns. This method was adopted by all major speech recognition companies and became the foundation for modern speech recognition. Baum’s invention was shared with several Advanced Research Projects Agency (ARPA) contractors, including IBM.
Defense Advanced Research Projects Agency (DARPA) established the Speech Understanding Research program in 1971 to develop a computer system that could understand continuous speech. Lawrence Roberts spent $3 million per year on the program over five years with government funds. This spawned a slew of Speech Understanding Research groups and was the world’s largest speech recognition project to date.
Thanks to interest and funding from the US Department of Defense in the 1970s, speech recognition technology advanced significantly. From 1971 to 1976, the Department of Defense’s DARPA Speech Understanding Research (SUR) program was one of the largest speech recognition histories. It was behind Carnegie Mellon’s “Harpy” speech-understanding system, among other things. Harpy had a vocabulary of 1011 words, which is about the same as a three-year-old’s.
According to Alex Waibel and Kai-Fu Lee’s Readings in Speech Recognition, Harpy was significant because it introduced a more efficient search approach called beam search to “prove the finite-state network of possible sentences.” (As Google’s entry into speech recognition on mobile devices demonstrated just a few years ago, the story of speech recognition is linked to advances in search methodology and technology.)
Other significant events in speech recognition technology during the 1970s included the founding of Threshold Technology, the first commercial speech recognition company, and Bell Laboratories’ introduction of a system that could interpret multiple people’s voices. Texas Instruments released a popular toy called “Speak and Spell” in 1978. It used a speech processor, which aided in developing more human-like digital synthesis sound.
Speech recognition vocabulary grew from a few hundred to several thousand words over the next decade, with the potential to recognize an infinite number of words, thanks to new approaches to understanding what people say. A new statistical method known as the hidden Markov model was one of the main reasons. Rather than simply looking for sound patterns and using templates for words, HMM considered the likelihood of unknown sounds being words. For the next two decades, this foundation would be in place.
Doctors Jim and Janet Baker founded Dragon Systems in 1982. It has a long history of patents and speech and language technology innovations. SpeechWorks, the leading provider of automated speech recognition over the phone, was founded in 1984.
Speech recognition began to make inroads into commercial applications for business and specialized industries with this expanded vocabulary (for instance, medical use). In the form of Worlds of Wonder’s Julie doll (1987), which children could train to respond to their voice, it even made its way into the home.
However, whether speech recognition software could recognize 1000 words, as the Kurzweil text-to-speech program did in 1985, or a 5000-word vocabulary, as IBM’s system did, remained a significant challenge. You had to pause after each word because these programs took discrete dictation.
In the 1990s, computers with faster processors became available, and speech recognition software became affordable to the general public. Dragon Dictate, the first consumer speech recognition product, was released in 1990 for a whopping $9000. The much-improved Dragon NaturallySpeaking arrived seven years later. Because the app recognized continuous speech, you could speak at around 100 words per minute. However, the program required 45 minutes of training, and it was still $695.
In 1996, BellSouth introduced the first voice portal, VAL, a dial-in interactive voice recognition system that was supposed to provide you with information based on what you said on the phone. VAL paved the way for the next 15 years for inaccurate voice-activated menus that would plague callers.
Dragon Systems released word dictation-level speech recognition software in 1995, making dictation speech recognition technology available to the general public for the first time. IBM and Kurzweil quickly followed suit.
In 1996, Charles Schwab was the first company to devote resources to developing the Voice Broker program. The program can handle up to 360 simultaneous customers calling in for stock and option quotes, and it handles 50,000 requests per day. It was 95% accurate, and it paved the way for a lot of other businesses to follow in their footsteps. BellSouth introduced Val, the first voice portal, a web portal that can only be accessed by voice and is used by both consumers and businesses.
Dragon released “Naturally Speaking” in 1997, the first “continuous speech” dictation software that didn’t require the user to pause between words for the computer to understand what was being said.
Lernout and Hauspie purchased Kurzweil in 1998. Lernout and Hauspie received a $45 million investment from Microsoft to form a partnership that would allow Microsoft to use their speech recognition technology in their systems. Microsoft bought Entropic in 1999, giving them access to the “most accurate speech recognition system” available.
Lernout and Hauspie paid $460 million for Dragon Systems in 2000. TellMe launched the first global voice portal the same year.
ScanSoft purchased Lernout and Hauspie and their speech and language assets in 2001. In 2003, they also bought SpeechWorks and signed a deal to distribute and support IBM desktop speech recognition products.
By 2001, computer speech recognition had reached an all-time high of 80% accuracy, and the technology appeared to be at a standstill near the end of the decade. When the language universe was small, recognition systems worked well—but they were still “guessing” among similar-sounding words with the help of statistical models, and the known language universe grew as the Internet grew.
Windows Vista and Mac OS X have built-in speech recognition and voice commands. Many computer users were unaware that those options were available. Although Windows Speech Recognition and OS X’s voice commands were intriguing, they were not as precise or simple to use as a standard keyboard and mouse.
Fifth Generation (Since 2001)
Speech recognition and related technology advanced dramatically in the fifth generation. With one major event: the release of the Google Voice Search app for the iPhone, speech recognition technology development began to resurface. Google’s app has a significant impact for two reasons. To begin with, cell phones and other mobile devices are ideal vehicles for speech recognition because the desire to replace their tiny on-screen keyboards motivates the development of better, alternative input methods. Second, Google could offload the app’s processing to its cloud data centers, allowing it to use all of that computing power to perform the large-scale data analysis required to find matches between the user’s words and the massive amount of human-speech examples collected. In short, the availability of data and the ability to process it efficiently has always been the bottleneck with speech recognition. Google’s app incorporates data from billions of search queries into its analysis to better predict what you’re likely to say.
Google added “personalized recognition” to Voice Search on Android phones in 2010, allowing the software to record users’ voice searches and generate a more accurate speech model. In mid-2011, the company also added Voice Search to its Chrome browser. Google’s English Voice Search system now contains 230 billion words culled from real-world queries.
And then, there is Siri, which uses cloud-based processing like Google Voice Search. It generates a contextual response based on what it knows about you, and it responds to your voice input with personality. Speech recognition has evolved from a useful tool to a source of entertainment. The child appears to be an adult.
The proliferation of voice recognition apps suggests that there will be plenty more in the future. These apps will allow you to control your PC with your voice or convert voice to text; they will also support multiple languages, provide a variety of speaker voices, and integrate into all aspects of your mobile devices. Speech recognition technology will likely spread to other devices as people become more comfortable speaking aloud to their mobile devices. It’s easy to imagine a time in the not-too-distant future when we’ll be able to control our coffee makers, communicate with our printers, and tell our lights to turn off on their own.