MFCCs Explained: Translating Sound for Machines

Pascual Vila
Audio Data Scientist // Code Syllabus
To teach a machine to understand speech, we must first teach it how to "hear" like a human. Mel-Frequency Cepstral Coefficients (MFCCs) have been the industry standard for Automatic Speech Recognition (ASR) for decades because they successfully mimic the biological processes of human auditory perception.
The Problem with Raw Audio
Raw audio waveforms are incredibly dense data streams (often 16,000 to 44,100 samples per second). Neural networks fed directly with raw waveforms often struggle to learn meaningful patterns due to the noise and high dimensionality. By computing the Short-Time Fourier Transform (STFT), we shift our perspective from the Time Domain to the Frequency Domain.
Biological Approximation: The Mel Scale
Humans don't perceive frequencies linearly. We can easily tell the difference between 100Hz and 200Hz, but we can hardly distinguish between 10,000Hz and 10,100Hz. The Mel Scale is a perceptual scale that models this non-linear hearing.
To convert a physical frequency $f$ (in Hertz) to the Mel scale $m$, we use the standard formula:
We apply a Mel Filterbank (typically 40 to 128 triangular filters) to the power spectrum to group frequencies together, emphasizing the low-end where human speech primarily lives.
The Final Steps: Log and DCT
After applying the Mel Filterbank, we take the Logarithm. Why? Because human perception of loudness is also logarithmic (which is why decibels are calculated using logs).
Finally, the filterbank energies are highly correlated. Machine learning algorithms, especially older ones like Gaussian Mixture Models (GMMs), assume data features are independent. To solve this, we apply a Discrete Cosine Transform (DCT). The output of this transform are the Mel-Frequency Cepstral Coefficients. We usually keep only the first 12-13 coefficients.
Modern Context: Do we still use MFCCs?+
Yes and No. While classical ML models required the highly uncorrelated 13 MFCCs, modern Deep Learning architectures (like Convolutional Neural Networks and Transformers) are perfectly capable of handling correlated data. Thus, it is increasingly common to skip the DCT step and feed raw Log-Mel Spectrograms (e.g., 80 or 128 dimensions) directly into the neural network.
β Audio Processing FAQ
What does "Cepstrum" mean in Audio Processing?
Cepstrum is literally an anagram of "Spectrum". It was coined in 1963 by Bogert et al. It represents the spectrum of a log spectrum. By taking the Inverse Fourier Transform (or DCT) of the logarithm of an estimated spectrum, we essentially decouple the sound source (the vocal cords) from the filter (the vocal tract shape).
Why do we extract exactly 13 MFCCs?
The lower-order coefficients (1 through 13) describe the general, broad shape of the spectral envelopeβwhich correlates to the physical shape of the speaker's vocal tract (formants). Higher coefficients represent fast spectral changes (pitch harmonics) which are largely irrelevant for determining *what* phoneme was spoken. Thus, truncating at 13 drops the pitch info but keeps the phonetic info.
How does the Librosa library calculate MFCCs?
In Python, `librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)` executes the entire pipeline under the hood: it computes the STFT, maps it to the Mel scale, takes the power/logarithm, and finally applies the discrete cosine transform, returning a 2D numpy array of shape (13, number_of_frames).