AUDIO PROCESSING /// MUSIC GENRE CLASSIFICATION /// LIBROSA /// MFCC /// SPECTROGRAMS /// AUDIO PROCESSING /// CLASSIFICATION ///

Music Genre Classification

Teach machines to hear. Extract MFCCs with Python's Librosa and train models to distinguish Jazz from Heavy Metal.

model_training.py
1 / 8
12345
🎵

Guide:Music Genre Classification assigns labels like 'Jazz' or 'Rock' to audio. To do this, ML models can't 'listen' to music directly; they need numerical features.


Classification Matrix

UNLOCK NODES BY EXTRACTING FEATURES.

Feature Extraction

Transforming raw waveform data into usable ML features like MFCCs and Zero-Crossing Rate.

System Check

Why do we extract MFCCs instead of training directly on the raw amplitude array?


AI Audio Network

Share Your Models

ACTIVE

Built a 95% accuracy model on GTZAN? Share your Jupyter notebooks and get feedback!

Music Genre Classification: From Sound to Labels

Author

Audio ML Staff

AI Instructors // Code Syllabus

Classifying music genres is a fundamental problem in Music Information Retrieval (MIR). By extracting the right numerical features from audio, we can teach machines to differentiate between a heavy metal guitar and a soft jazz piano.

Phase 1: Data Collection & Loading

Before training, we need audio. Datasets like GTZAN are standard. In Python, we use librosa to load audio files. The loading process converts raw sound waves into a 1D array of amplitudes, sampled at a specific rate (e.g., 22050 Hz).

Phase 2: Feature Extraction (The Magic)

Feeding raw amplitudes into a standard model is inefficient. Instead, we extract features that represent the "texture" of the sound:

  • MFCCs (Mel-Frequency Cepstral Coefficients): The most important feature. It models the characteristics of the human auditory system and effectively captures the 'timbre' of the track.
  • Zero-Crossing Rate: How often the signal crosses the zero-axis. High rates usually indicate percussive or noisy sounds (like heavy metal drums).
  • Spectral Centroid: Indicates where the "center of mass" of the spectrum is located. Brighter sounds have higher centroids.

Phase 3: Model Training

With our features extracted (and averaged out over time to create a single vector per song), we scale the data using StandardScaler. We can then apply traditional models like Support Vector Machines (SVM) or Random Forests.

Alternatively, we can skip time-averaging and feed 2D Spectrograms directly into a Convolutional Neural Network (CNN) for state-of-the-art results.

Frequently Asked Questions

What are MFCCs and why are they used in audio ML?

MFCCs (Mel-Frequency Cepstral Coefficients) are a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. They are used because they closely approximate the human auditory system's response, making them excellent for capturing the 'timbre' of music and speech.

Why do we need to scale audio features?

Features like Spectral Centroid might have values in the thousands, while MFCCs might range from -200 to +200. Models that compute distance (like SVMs or K-Nearest Neighbors) will be unfairly dominated by features with larger scales. Using `StandardScaler` ensures all features have a mean of 0 and a variance of 1.

Can I use CNNs for Music Genre Classification?

Yes! Instead of averaging 1D features over time, you can generate a visual representation of the audio called a Mel-Spectrogram (which maps time vs frequency vs amplitude). You then treat this spectrogram as an image and train a Convolutional Neural Network (CNN) on it, which often yields superior accuracy compared to traditional 1D feature models.

Audio ML Glossary

librosa
A Python package for music and audio analysis. It provides the building blocks necessary to create MIR systems.
snippet.py
MFCC
Mel-Frequency Cepstral Coefficients. A small set of features (usually 10-20) which concisely describe the overall shape of a spectral envelope.
snippet.py
Zero-Crossing Rate
The rate at which the signal changes from positive to zero to negative or from negative to zero to positive.
snippet.py
Spectrogram
A visual representation of the spectrum of frequencies of a signal as it varies with time.
snippet.py