🚀 LEVEL UP TO SENIOR:Unlock 500+ Advanced Practical Challenges & Exercises.
🎓 COURSERA PARTNER:Earn professional Google, Meta, and IBM certificates to supercharge your resume.
HTML MASTER CLASS /// LEARN TAGS /// BUILD STRUCTURE /// SEMANTIC WEB /// HTML MASTER CLASS /// LEARN TAGS ///
Total XP: 0|💻 artificialintelligence XP: 0

Spectrograms in AI

Master the transformation of audio into the frequency domain. Learn the mechanics of the STFT, understand why the Mel Scale is essential for biological relevance, and discover how to use Mel Spectrograms as input for powerful 2D Convolutional Neural Networks.

LOADING ENGINE...

Skill Matrix

UNLOCK NODES BY LEARNING NEW TAGS.

Spectro Hub

Visual sound.

Quick Quiz //

What does the 'y-axis' represent in a standard spectrogram?


Sound is a mix of frequencies. A spectrogram allows us to see this mix as a beautiful 2D map, revealing the hidden structure of audio.

1Short-Time Fourier Transform

The Fourier Transform is a mathematical tool that converts a signal from the time domain to the frequency domain. Because audio changes over time, we use the Short-Time Fourier Transform (STFT). We break the audio into small frames and apply a Fourier Transform to each one. This creates a 3D dataset: Time, Frequency, and Magnitude. When we plot this, we get a Spectrogram—a visual 'X-ray' of sound.

+
import librosa
import numpy as np

# Compute STFT
D = librosa.stft(y)

# Convert amplitude to Decibels (dB)
S_db = librosa.amplitude_to_db(np.abs(D), ref=np.max)
localhost:3000
localhost:3000/stft-engine
STFT Output Matrix
Shape: (1025 freq bins, 862 frames)
Values: Decibel scale (-80 to 0 dB)
Status: Complex array mapped

2The Mel Scale

Humans are very good at distinguishing between 100 Hz and 200 Hz, but we struggle to tell the difference between 10,000 Hz and 10,100 Hz. Our hearing is Non-Linear. The Mel Scale is a perceptual scale of pitches that approximates the human ear's response. A 'Mel Spectrogram' warps the frequency axis so that equal distances on the plot represent equal distances in human pitch perception, making the data much more relevant for tasks like speech recognition.

+
# Calculate a Mel-Spectrogram directly
mel_spec = librosa.feature.melspectrogram(
  y=y, sr=sr, n_mels=128
)

# Convert to decibels
mel_spec_db = librosa.power_to_db(mel_spec, ref=np.max)
localhost:3000
localhost:3000/mel-filter
Filter Bank Applied
Linear Bins: 1025
Mel Bands: 128
Scale: Perceptually Warped

3Spectrograms in Deep Learning

One of the biggest breakthroughs in Audio AI was the realization that Spectrograms are Images. Instead of building complex 1D models for raw waves, we can use 2D Convolutional Neural Networks (CNNs)—the same ones used for face recognition—to analyze spectrograms. This allows the model to find 'textures' and 'edges' in the sound, such as the unique frequency signature of a human voice or a car engine.

+
# Add a channel dimension for a PyTorch CNN
import torch

# Shape goes from (128, 862) to (1, 128, 862)
# (Channels, Height, Width)
cnn_input = torch.tensor(mel_spec_db).unsqueeze(0)
localhost:3000
localhost:3000/cnn-prep
👁️
Vision Mode Engaged
Tensor ready for ResNet2D

?Frequently Asked Questions

Pascual Vila

Pascual Vila

Frontend Instructor // Code Syllabus

Lesson Glossary

[01]Spectrogram

A visual representation of the spectrum of frequencies of a signal as it varies with time.

Code Preview
Freq-Time Map

[02]STFT

Short-Time Fourier Transform: A Fourier-related transform used to determine the sinusoidal frequency and phase content of local sections of a signal as it changes over time.

Code Preview
Fourier Engine

[03]Mel Scale

A perceptual scale of pitches judged by listeners to be equal in distance from one another.

Code Preview
Hearing Scale

[04]Magnitude

The strength or intensity of a specific frequency at a specific point in time.

Code Preview
Color/Bright Intensity

[05]Decibel (dB) Conversion

Transforming linear amplitude to a logarithmic scale, which better matches how humans perceive volume changes.

Code Preview
Log Mapping

Continue Learning