Speech is a series of events that happen over time. Hidden Markov Models provide the mathematical framework for guessing the 'hidden' words from the 'visible' sounds.
2The Viterbi Path
When we talk, we might say a vowel for 100ms one time and 200ms the next. The Viterbi Algorithm uses Dynamic Programming to find the 'Most Likely Path' through all possible hidden states. It efficiently calculates which sequence of phonemes maximizes the overall probability, allowing the system to correctly identify 'Hello' even if the user speaks slowly or quickly. Without Viterbi, the computer would have to test every possible combination, which is mathematically impossible for even a short sentence.
def viterbi(obs, states, start_p, trans_p, emit_p):
# Dynamic programming to find best path
# Returns the most likely sequence of states
return path, path_probability
likely_phonemes = viterbi(mfccs, hmm_states, ...)3Acoustic Modeling with GMMs
To handle the fact that every person's voice sounds slightly different, HMMs were paired with Gaussian Mixture Models (GMMs). The GMM's job was to model the 'Acoustic Likelihood'—given that the state is the phoneme '/a/', how likely is it that we would see these specific MFCC values? This GMM-HMM architecture was the state-of-the-art for ASR until around 2012, when Deep Neural Networks began to outperform them by replacing the GMM with a much more powerful 'Deep' acoustic model.
from sklearn.mixture import GaussianMixture
# Train a GMM for the phoneme '/a/'
gmm_a = GaussianMixture(n_components=8)
gmm_a.fit(mfccs_for_phoneme_a)
# Score a new frame
likelihood = gmm_a.score([new_frame])