A spectrogram is too 'noisy' for simple speech models. MFCCs provide a clean, compressed, and biologically-inspired representation of the human voice.
1The Spectrum of a Spectrum
The term 'Cepstrum' is an anagram of 'Spectrum.' To calculate MFCCs, we take the Log-Mel Spectrogram and apply the Discrete Cosine Transform (DCT). This process 'decorrelates' the data. In a normal spectrogram, adjacent frequency bins are highly related; MFCCs separate this information into independent coefficients. This makes them perfect for older Machine Learning models like GMMs or HMMs, and still highly relevant for lightweight Deep Learning on the edge.
import librosa
# Calculate MFCCs directly from audio
# n_mfcc specifies how many coefficients to keep
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)2Modeling the Human Voice
Sound is created by air passing through the vocal folds (The Source) and then being shaped by the mouth, tongue, and throat (The Filter). The filter creates resonances called Formants. MFCCs are designed to capture these formants while ignoring the exact pitch of the vocal folds. This is why a speech model can recognize the word 'Hello' whether it's spoken by a deep-voiced man or a high-pitched child—it's looking at the Filter Shape, which MFCCs represent perfectly.
import librosa.display
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 4))
librosa.display.specshow(mfccs, x_axis='time')
plt.colorbar()
plt.title('MFCC representation of speech')
plt.tight_layout()3Capturing Motion
Speech is not static; it's a sequence of movements. A single frame of MFCCs only shows a 'snapshot' of the vocal tract. To see how the sound is changing, we calculate Deltas (the first derivative) and Delta-Deltas (the second derivative). This tells the model how fast the tongue is moving or how quickly a vowel is transitioning into a consonant. A standard feature vector for speech often consists of 13 MFCCs, 13 Deltas, and 13 Delta-Deltas, for a total of 39 features per frame.
# Calculate Deltas and Delta-Deltas
import numpy as np
delta_mfcc = librosa.feature.delta(mfccs)
delta2_mfcc = librosa.feature.delta(mfccs, order=2)
# Stack them to create a 39-dimensional feature
feature_vector = np.vstack([mfccs, delta_mfcc, delta2_mfcc])