Speech recognition technology has changed how persons with impairments use computers. Technology is at the vanguard of expanding opportunities for leisure, education, and work.
Speech recognition makes it simpler for persons with disabilities to use computers. You can use your voice to explore the computer and produce documents if you have motor limitations and cannot use a regular keyboard and mouse. The technology can also benefit people with learning difficulties who have trouble spelling and writing.
Some people with speech difficulties might benefit from using speech recognition as a therapy tool to enhance their vocal range. Using speech recognition to handle computers hands-free is particularly advantageous for those who suffer from overuse or repetitive stress injuries. Speech recognition has the potential to greatly increase computer accessibility for individuals with disabilities and open up a world of possibilities.
Software for speech recognition needs to be trained to detect and comprehend the user’s voice because it is “speaker dependant.” Reading several pages to the computer trains the program. The computer analyzes the voice data to produce a special voice file for that user.
A continuous speech model is used by commercially available speech recognition software to process voice data. Continuous speech refers to speaking without pausing between words; the user speaks to the computer in sentences and phrases. This permits natural speech from the user without affecting the program’s accuracy.
This post will explore the three popular methods for speech recognition.
- Hidden Markov Model: An alternative approach to Speech recognition is to construct a statistical model of each word in the vocabulary and to recognize each input word as that word of the vocabulary whose model assigns the probability to the occurrence of the observed input pattern.
- Deep Neural Network: A DNN is a feed-forward, artificial neural network with more than one layer of hidden units between inputs and outputs.
- Dynamic Time Warping: DTW provides the time registration between each reference pattern and test pattern.
1. Hidden Markov Model (HMM)
As the name implies, hidden Markov modeling is a modeling strategy. There are hence three. The model, the method of calculating the likelihood that the model would produce a specific output, and the process of calculating the model’s parameters using samples of the known target word can also be revealed by the speech data. A doubly stochastic process called a hidden Markov model (HMM) is used to generate a sequence of observed symbols. The symbols are generated via a collection of stochastic processes controlled by an underlying stochastic finite state machine (FSM). The FSM selects a symbol from that state’s set of symbols probabilistically for output when a state is entered following a state transition.
Because the true state of the FSM can only be seen indirectly through the symbols it emits, the term “hidden” is apt. For isolated word recognition, there is an HMM for every word in the vocabulary. These HMMs might be made up of HMMs that simulate individual words or subword components like phonemes. A single HMM corresponds to the domain grammar in continuous word recognition. The word-model HMMs are used to build this grammar model. The symbols that can be seen correlate to measurements of speech frames.
2. Neural Network Model (NNM)
The single layer perception model, multi-layer perception model, Kohonen self-organizing feature map model, radial basis function neural network, predictive neural network, etc., are the main components of the neural network model that is more frequently used and enhances speech recognition. Additionally, delay neural networks, recurrent neural networks, and other neural networks are used to make the neural network reflect the dynamic of the speech signal’s time-varying characteristics. A DNN is an artificial neural network that feeds information forward and has additional layers of hidden units between its inputs and outputs.
3. Dynamic Time Warping (DTW)
It is true that when a speaker pronounces the same word twice, word recognition cannot be reduced to a linear time-scaled word-template machine problem, as was discussed above. Due to the variations in time, frequency, and amplitude, the corresponding feature vectors or Patterns will never be the same; therefore, it is necessary to align corresponding events in two different patterns. The distance between the two patterns will then be calculated using this alignment. The time registration between each reference pattern and the test pattern is provided by dynamic time warping (DTW). A user-specified distance metric between a frame of the test pattern and a frame of the reference pattern establishes this correspondence.
The best alignment function is identified to minimize the total accumulated distance at each frame of the test pattern. As a result, time alignment and distance computation are carried out simultaneously in this process. The input is recognized as a word belonging to the reference token class with the least amount of accumulated distance after the scores for all reference patterns have been computed.