To understand speech, we don't need every frequency. We need the shape of the vocal tract. MFCCs provide exactly that.
1Capturing the Envelope
When you speak, your vocal tract (throat, tongue, lips) acts as a filter on the sound from your vocal cords. This filter creates a specific 'envelope' or shape in the frequency domain. MFCCs (Mel-Frequency Cepstral Coefficients) are designed to capture this envelope while ignoring the specific pitch (the harmonics). This allows an AI model to recognize the word 'Hello' whether it is spoken by a child, a man, or a woman.
import librosa
# Extract MFCCs from the raw waveform
# Keeping the most important 13 dimensions
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)2The Extraction Pipeline
Extracting MFCCs is a rigorous process: 1) Convert to Mel Spectrogram to match human hearing. 2) Take the Logarithm of the powers (because we hear volume logarithmically). 3) Apply a Discrete Cosine Transform (DCT). The DCT is the 'magic' step: it compresses the information into a few coefficients and, most importantly, De-correlates the features, making them much easier for machine learning models to process.
import librosa.display
import matplotlib.pyplot as plt
# Visualization of the compressed features
plt.figure(figsize=(10, 4))
librosa.display.specshow(mfccs, x_axis='time')
plt.title('MFCC representation of speech')
plt.tight_layout()3Why 13?
While a spectrogram might have 512 frequency bins, we typically only keep the first 13 to 20 MFCCs. The lower coefficients represent the 'slow' changes in the spectrum—the broad shape of the vocal tract that defines vowels and consonants. The higher coefficients represent 'fast' changes, which are often just noise or fine instrumental details. By keeping only the first few, we significantly reduce the amount of data our model needs to learn.
# Add motion context with Deltas
import numpy as np
# Calculate speed (Δ) and acceleration (ΔΔ)
delta_mfcc = librosa.feature.delta(mfccs)
delta2_mfcc = librosa.feature.delta(mfccs, order=2)
feature_vector = np.vstack([mfccs, delta_mfcc, delta2_mfcc])