A waveform is a silhouette; a spectrogram is a photograph. By decomposing sound into its component frequencies, we reveal the hidden patterns that AI can learn.
1The Fourier Transform
The Fourier Transform is a mathematical tool that decomposes a signal into its constituent frequencies. In audio, we use the STFT (Short-Time Fourier Transform), which breaks the signal into small, overlapping frames (windows) and calculates the frequencies for each frame. This gives us a 3D view of the sound: Time on the X-axis, Frequency on the Y-axis, and Magnitude (Color Intensity) as the third dimension. It's essentially a 'Musical Score' for the computer.
import librosa
import matplotlib.pyplot as plt
# Compute STFT
D = librosa.stft(y)
S_db = librosa.amplitude_to_db(abs(D))2Psychoacoustics & The Mel Scale
Human ears are not linear. We can easily hear the difference between 500 Hz and 1000 Hz, but 10,000 Hz and 10,500 Hz sound almost identical to us. The Mel Scale is a non-linear transformation of the frequency axis that mimics this human behavior. It expands the 'important' low-frequency ranges and compresses the high-frequency ones. By training models on Mel-scaled data, we ensure they focus on the same features that humans find important for speech and music.
# Calculate a Mel-Spectrogram directly
mel_spec = librosa.feature.melspectrogram(
y=y, sr=sr, n_mels=128
)
# Convert to decibels
mel_spec_db = librosa.power_to_db(mel_spec)3Audio Meets Computer Vision
The greatest breakthrough in modern Audio AI was the realization that a Mel Spectrogram is essentially an image. This allowed researchers to apply state-of-the-art Convolutional Neural Networks (CNNs) and Transformers directly to audio data. Instead of inventing new architectures for sound, we can use 'ResNet' or 'Vision Transformers' to classify bird calls, detect glass breaking, or recognize spoken commands by 'looking' at the texture of the spectrogram.
# Add a channel dimension for a PyTorch CNN
import torch
# Shape goes from (128, 862) to (1, 128, 862)
# (Channels, Height, Width)
cnn_input = torch.tensor(mel_spec_db).unsqueeze(0)