Why do we always convert Spectrograms to Decibels (dB)?

Because acoustic energy is incredibly dynamic. A loud sound might have a million times more energy than a quiet sound. If you plot that linearly, you'll only see the loudest sound and everything else will be black. Decibels use a logarithmic scale, allowing us to see quiet details alongside loud ones.

Is the Mel Spectrogram the only input type used for Audio Deep Learning?

No, but it is the most popular. Some modern models (like Wav2Vec 2.0) are designed to ingest raw audio waveforms directly, learning their own internal frequency representations. However, Mel Spectrograms remain the industry standard for classification and general audio tasks.

What is an STFT 'window'?

Because frequencies change constantly, you can't just take the Fourier Transform of an entire 3-minute song at once. You have to chop the song into tiny 'windows' (e.g., 25 milliseconds long), calculate the frequencies for just that window, and then slide forward to the next one.

HTML MASTER CLASS /// LEARN TAGS /// BUILD STRUCTURE /// SEMANTIC WEB /// HTML MASTER CLASS /// LEARN TAGS ///

⚡ Total XP: 0|💻 artificialintelligence XP: 0

Spectrograms in AI

Master the transformation of audio into the frequency domain. Learn the mechanics of the STFT, understand why the Mel Scale is essential for biological relevance, and discover how to use Mel Spectrograms as input for powerful 2D Convolutional Neural Networks.

LOADING ENGINE...

Skill Matrix

UNLOCK NODES BY LEARNING NEW TAGS.

Spectro Hub

Visual sound.

Quick Quiz //

What does the 'y-axis' represent in a standard spectrogram?

Sound is a mix of frequencies. A spectrogram allows us to see this mix as a beautiful 2D map, revealing the hidden structure of audio.

1Short-Time Fourier Transform

The Fourier Transform is a mathematical tool that converts a signal from the time domain to the frequency domain. Because audio changes over time, we use the Short-Time Fourier Transform (STFT). We break the audio into small frames and apply a Fourier Transform to each one. This creates a 3D dataset: Time, Frequency, and Magnitude. When we plot this, we get a Spectrogram—a visual 'X-ray' of sound.

—

import librosa
import numpy as np

# Compute STFT
D = librosa.stft(y)

# Convert amplitude to Decibels (dB)
S_db = librosa.amplitude_to_db(np.abs(D), ref=np.max)

localhost:3000

localhost:3000/stft-engine

STFT Output Matrix

Shape: (1025 freq bins, 862 frames)

Values: Decibel scale (-80 to 0 dB)

Status: Complex array mapped

2The Mel Scale

Humans are very good at distinguishing between 100 Hz and 200 Hz, but we struggle to tell the difference between 10,000 Hz and 10,100 Hz. Our hearing is Non-Linear. The Mel Scale is a perceptual scale of pitches that approximates the human ear's response. A 'Mel Spectrogram' warps the frequency axis so that equal distances on the plot represent equal distances in human pitch perception, making the data much more relevant for tasks like speech recognition.

—

# Calculate a Mel-Spectrogram directly
mel_spec = librosa.feature.melspectrogram(
  y=y, sr=sr, n_mels=128
)

# Convert to decibels
mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)

localhost:3000

localhost:3000/mel-filter

Filter Bank Applied

Linear Bins: 1025

Mel Bands: 128

Scale: Perceptually Warped

3Spectrograms in Deep Learning

One of the biggest breakthroughs in Audio AI was the realization that Spectrograms are Images. Instead of building complex 1D models for raw waves, we can use 2D Convolutional Neural Networks (CNNs)—the same ones used for face recognition—to analyze spectrograms. This allows the model to find 'textures' and 'edges' in the sound, such as the unique frequency signature of a human voice or a car engine.

—

# Add a channel dimension for a PyTorch CNN
import torch

# Shape goes from (128, 862) to (1, 128, 862)
# (Channels, Height, Width)
cnn_input = torch.tensor(mel_spec_db).unsqueeze(0)

localhost:3000

localhost:3000/cnn-prep

👁️

Vision Mode Engaged

Tensor ready for ResNet2D

?Frequently Asked Questions

Pascual Vila

Frontend Instructor // Code Syllabus

Lesson Glossary

[01]Spectrogram

A visual representation of the spectrum of frequencies of a signal as it varies with time.

Code Preview

Freq-Time Map

[02]STFT

Short-Time Fourier Transform: A Fourier-related transform used to determine the sinusoidal frequency and phase content of local sections of a signal as it changes over time.

Code Preview