Teaching Machines to Hear

Speech recognition transcribes words, but Environmental Sound Recognition (ESR) understands the context of the real world. It allows autonomous systems to react to sirens, break-ins, or wildlife.

1. The Challenge of Sound Data

Audio is a 1-dimensional array of amplitudes over time (the waveform). While we can train models on raw audio, it's highly inefficient. Instead, we use tools like Librosa to extract meaningful features that represent the frequency content.

2. The Mel-Spectrogram

To effectively classify sounds, we compute a Spectrogram using a Short-Time Fourier Transform (STFT). We then map this to the Mel Scale, which compresses frequencies to mimic the non-linear way human ears perceive pitch.

3. Convolutional Neural Networks (CNNs)

Once we have a Mel-Spectrogram, we treat it as an image. We pass this 2D representation into a 2D CNN architecture. The convolutional filters learn to recognize visual patterns in the audio data:

Vertical Lines: Impulsive sounds like gunshots or dog barks.
Horizontal Lines: Tonal sounds like sirens or engine idling.
Broadband Noise: Textural sounds like rain or AC units.

❓ Audio ML FAQs

Why use the Mel Scale instead of standard Hertz?

Humans are better at differentiating low-frequency sounds (e.g., 100Hz vs 200Hz) than high-frequency sounds (e.g., 10,000Hz vs 10,100Hz). The Mel scale compresses the frequency axis to represent sound the way our ears hear it, allowing the neural network to prioritize human-relevant acoustic patterns.

What is the UrbanSound8K dataset?

UrbanSound8K is the standard benchmark dataset for Environmental Sound Recognition. It contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music.

ESR Terminology

Librosa

A Python package for music and audio analysis. Used heavily for feature extraction.

python

Sample Rate (sr)

The number of samples of audio carried per second, measured in Hz (e.g., 22050 Hz).

python

Spectrogram

A visual representation of the spectrum of frequencies of a signal as it varies with time.

python

Mel-Frequency

A perceptual scale of pitches judged by listeners to be equal in distance from one another.

python

CNN (Conv2D)

Deep learning layer that creates a convolution kernel to extract spatial features from image/spectrogram data.

python

MFCC

Mel-frequency cepstral coefficients. They make up an MFC, derived from a type of cepstral representation of the audio clip.

python

Environmental Sound Recognition

Audio Matrix

Data Prep

System Check

Audio Lab Missions

Teaching Machines to Hear

1. The Challenge of Sound Data

2. The Mel-Spectrogram

3. Convolutional Neural Networks (CNNs)

❓ Audio ML FAQs

ESR Terminology