LIBROSA /// STFT /// MEL SPECTROGRAM /// WAV2VEC /// FEATURE EXTRACTION /// ASR /// LIBROSA /// STFT ///

Spectrograms & Mel Scale

Extract the DNA of sound. Master STFTs and perceptually-weighted feature extraction to feed modern deep learning audio models.

audio_processor.py
1 / 9
12345
🎙️

Tutor:Audio is a 1D waveform in the time domain. But to train AI models, we often need to understand the frequencies present at any given moment.


Acoustic Matrix

DECODE FREQUENCIES TO UNLOCK NODES.

Concept: STFT

The Short-Time Fourier Transform determines the sinusoidal frequency and phase content of local sections of a signal as it changes over time.

Signal Validation

What is the primary advantage of STFT over a standard Fourier Transform?


Neural Network Comm-Link

Share Audio Features

ONLINE

Built an ASR model? Extracted weird frequencies? Share your notebooks and get peer reviews!

Audio Intelligence: Decoding Frequencies

🎧

System Admin

AI Audio Engineer // Code Syllabus

Raw audio waveforms are difficult for neural networks to interpret. By transforming time-domain data into spectrograms and applying the perceptual Mel Scale, we feed our AI models exactly what they need to "hear" like humans.

The STFT: Short-Time Fourier Transform

A standard Fourier Transform tells us what frequencies are in a signal, but loses all information about when those frequencies occurred. For speech recognition, timing is everything.

The STFT solves this by slicing the audio into overlapping frames (windows) and applying the Fourier transform to each. The result is a 2D matrix representing Time (columns), Frequency (rows), and Amplitude (values). When visualized, this is called a Spectrogram.

Human Perception & The Mel Scale

Linear spectrograms have a problem: they allocate too much space to high frequencies. Humans are incredibly sensitive to small pitch changes at low frequencies (e.g., distinguishing 100Hz from 150Hz), but we cannot easily differentiate between 10,000Hz and 10,050Hz.

The Mel Scale is a logarithmic transformation of the Y-axis (frequencies) that mimics the human ear's non-linear perception of pitch.

View Architecture Tips+

Window Size & Hop Length: When generating STFTs, a standard configuration for speech is a window size (`n_fft`) of 2048 or 1024, with a `hop_length` of 512. For 16kHz audio, `n_mels=80` or `128` is industry standard for feeding into Transformers like Wav2Vec or Whisper.

Audio Processing FAQ

Why use a Mel Spectrogram instead of raw waveforms for AI?

While newer models (like Wav2Vec 2.0) can learn from raw audio, processing a 16kHz audio file means 16,000 data points per second. Extracting Mel Spectrograms acts as a dense, compressed feature representation that highlights the frequencies most crucial to human speech, dramatically reducing computational load and speeding up model convergence.

What does "power_to_db" actually do?

The amplitude (power) of an audio signal varies wildly. A sound that is perceived as "twice as loud" actually requires exponentially more energy. By converting the power matrix to Decibels (dB), we apply a logarithmic scale to the amplitude, which again aligns the mathematical data with human perceptual reality.

Acoustic Dictionary

STFT
Short-Time Fourier Transform. Analyzes how the frequency content of a signal changes over time.
python
Spectrogram
A visual representation of the spectrum of frequencies of a signal as it varies with time.
python
Mel Scale
A perceptual scale of pitches judged by listeners to be equal in distance from one another.
python
Mel Filterbank
A set of triangular filters used to convert a linear power spectrum into a Mel spectrum.
python
Decibel (dB)
A logarithmic unit used to express the ratio of two values of a physical quantity, often power or intensity.
python
Hop Length
The number of audio samples between successive STFT columns. Controls the time-resolution.
python