AUDIO PROCESSING /// MEL SCALE /// STFT /// DCT /// LIBROSA /// AUDIO PROCESSING /// MEL SCALE /// STFT /// LIBROSA ///

MFCCs Explained

Unlock the core architecture of speech processing. Translate raw waveforms into biologically inspired acoustic features for Machine Learning.

extract_mfcc.py
1 / 7
12345
πŸŽ™οΈ

A.I.D.E:Raw audio waveforms are messy. To make machines understand speech, we extract features that mimic human hearing. Enter MFCCs.

Extraction Matrix

UNLOCK NODES BY MASTERING FREQUENCIES.

Concept: STFT

The Short-Time Fourier Transform bridges the gap between raw time data and the frequency components of sound.

System Check

What is the primary output of a Fourier Transform on an audio signal?


Machine Listening Hub

Share Your Models

ONLINE

Built a custom speech recognizer or genre classifier? Share your scripts and architectures!

MFCCs Explained: Translating Sound for Machines

Author

Pascual Vila

Audio Data Scientist // Code Syllabus

To teach a machine to understand speech, we must first teach it how to "hear" like a human. Mel-Frequency Cepstral Coefficients (MFCCs) have been the industry standard for Automatic Speech Recognition (ASR) for decades because they successfully mimic the biological processes of human auditory perception.

The Problem with Raw Audio

Raw audio waveforms are incredibly dense data streams (often 16,000 to 44,100 samples per second). Neural networks fed directly with raw waveforms often struggle to learn meaningful patterns due to the noise and high dimensionality. By computing the Short-Time Fourier Transform (STFT), we shift our perspective from the Time Domain to the Frequency Domain.

Biological Approximation: The Mel Scale

Humans don't perceive frequencies linearly. We can easily tell the difference between 100Hz and 200Hz, but we can hardly distinguish between 10,000Hz and 10,100Hz. The Mel Scale is a perceptual scale that models this non-linear hearing.

To convert a physical frequency $f$ (in Hertz) to the Mel scale $m$, we use the standard formula:

$m = 2595 \log_10\left(1 + \frac&123;f&125;700\right)$

We apply a Mel Filterbank (typically 40 to 128 triangular filters) to the power spectrum to group frequencies together, emphasizing the low-end where human speech primarily lives.

The Final Steps: Log and DCT

After applying the Mel Filterbank, we take the Logarithm. Why? Because human perception of loudness is also logarithmic (which is why decibels are calculated using logs).

Finally, the filterbank energies are highly correlated. Machine learning algorithms, especially older ones like Gaussian Mixture Models (GMMs), assume data features are independent. To solve this, we apply a Discrete Cosine Transform (DCT). The output of this transform are the Mel-Frequency Cepstral Coefficients. We usually keep only the first 12-13 coefficients.

Modern Context: Do we still use MFCCs?+

Yes and No. While classical ML models required the highly uncorrelated 13 MFCCs, modern Deep Learning architectures (like Convolutional Neural Networks and Transformers) are perfectly capable of handling correlated data. Thus, it is increasingly common to skip the DCT step and feed raw Log-Mel Spectrograms (e.g., 80 or 128 dimensions) directly into the neural network.

❓ Audio Processing FAQ

What does "Cepstrum" mean in Audio Processing?

Cepstrum is literally an anagram of "Spectrum". It was coined in 1963 by Bogert et al. It represents the spectrum of a log spectrum. By taking the Inverse Fourier Transform (or DCT) of the logarithm of an estimated spectrum, we essentially decouple the sound source (the vocal cords) from the filter (the vocal tract shape).

Why do we extract exactly 13 MFCCs?

The lower-order coefficients (1 through 13) describe the general, broad shape of the spectral envelopeβ€”which correlates to the physical shape of the speaker's vocal tract (formants). Higher coefficients represent fast spectral changes (pitch harmonics) which are largely irrelevant for determining *what* phoneme was spoken. Thus, truncating at 13 drops the pitch info but keeps the phonetic info.

How does the Librosa library calculate MFCCs?

In Python, `librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)` executes the entire pipeline under the hood: it computes the STFT, maps it to the Mel scale, takes the power/logarithm, and finally applies the discrete cosine transform, returning a 2D numpy array of shape (13, number_of_frames).

Audio Processing Glossary

STFT
Short-Time Fourier Transform. Breaks audio into short frames and determines the frequencies present in each.
python_snippet.py
Mel Scale
A perceptual pitch scale where equal distances in pitch sound equally distant to the human ear.
python_snippet.py
Spectrogram
A visual representation of the spectrum of frequencies of a signal as it varies with time.
python_snippet.py
DCT
Discrete Cosine Transform. Decorrelates features and compresses information into fewer coefficients.
python_snippet.py
Vocal Tract
The biological filter (throat, mouth, lips) that shapes the raw sound from vocal cords into distinct phonemes.
python_snippet.py
Framing
Cutting continuous audio into small overlapping segments (usually 20-30ms) assuming the signal is statistically stationary in that window.
python_snippet.py