Sound is a mix of frequencies. A spectrogram allows us to see this mix as a beautiful 2D map, revealing the hidden structure of audio.
1Short-Time Fourier Transform
The Fourier Transform is a mathematical tool that converts a signal from the time domain to the frequency domain. Because audio changes over time, we use the Short-Time Fourier Transform (STFT). We break the audio into small frames and apply a Fourier Transform to each one. This creates a 3D dataset: Time, Frequency, and Magnitude. When we plot this, we get a Spectrogram—a visual 'X-ray' of sound.
import librosa
import numpy as np
# Compute STFT
D = librosa.stft(y)
# Convert amplitude to Decibels (dB)
S_db = librosa.amplitude_to_db(np.abs(D), ref=np.max)2The Mel Scale
Humans are very good at distinguishing between 100 Hz and 200 Hz, but we struggle to tell the difference between 10,000 Hz and 10,100 Hz. Our hearing is Non-Linear. The Mel Scale is a perceptual scale of pitches that approximates the human ear's response. A 'Mel Spectrogram' warps the frequency axis so that equal distances on the plot represent equal distances in human pitch perception, making the data much more relevant for tasks like speech recognition.
# Calculate a Mel-Spectrogram directly
mel_spec = librosa.feature.melspectrogram(
y=y, sr=sr, n_mels=128
)
# Convert to decibels
mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)3Spectrograms in Deep Learning
One of the biggest breakthroughs in Audio AI was the realization that Spectrograms are Images. Instead of building complex 1D models for raw waves, we can use 2D Convolutional Neural Networks (CNNs)—the same ones used for face recognition—to analyze spectrograms. This allows the model to find 'textures' and 'edges' in the sound, such as the unique frequency signature of a human voice or a car engine.
# Add a channel dimension for a PyTorch CNN
import torch
# Shape goes from (128, 862) to (1, 128, 862)
# (Channels, Height, Width)
cnn_input = torch.tensor(mel_spec_db).unsqueeze(0)