Music Genre Classification: From Sound to Labels

Audio ML Staff
AI Instructors // Code Syllabus
Classifying music genres is a fundamental problem in Music Information Retrieval (MIR). By extracting the right numerical features from audio, we can teach machines to differentiate between a heavy metal guitar and a soft jazz piano.
Phase 1: Data Collection & Loading
Before training, we need audio. Datasets like GTZAN are standard. In Python, we use librosa to load audio files. The loading process converts raw sound waves into a 1D array of amplitudes, sampled at a specific rate (e.g., 22050 Hz).
Phase 2: Feature Extraction (The Magic)
Feeding raw amplitudes into a standard model is inefficient. Instead, we extract features that represent the "texture" of the sound:
- MFCCs (Mel-Frequency Cepstral Coefficients): The most important feature. It models the characteristics of the human auditory system and effectively captures the 'timbre' of the track.
- Zero-Crossing Rate: How often the signal crosses the zero-axis. High rates usually indicate percussive or noisy sounds (like heavy metal drums).
- Spectral Centroid: Indicates where the "center of mass" of the spectrum is located. Brighter sounds have higher centroids.
Phase 3: Model Training
With our features extracted (and averaged out over time to create a single vector per song), we scale the data using StandardScaler. We can then apply traditional models like Support Vector Machines (SVM) or Random Forests.
Alternatively, we can skip time-averaging and feed 2D Spectrograms directly into a Convolutional Neural Network (CNN) for state-of-the-art results.
❓ Frequently Asked Questions
What are MFCCs and why are they used in audio ML?
MFCCs (Mel-Frequency Cepstral Coefficients) are a representation of the short-term power spectrum of a sound, based on a linear cosine transform of a log power spectrum on a nonlinear mel scale of frequency. They are used because they closely approximate the human auditory system's response, making them excellent for capturing the 'timbre' of music and speech.
Why do we need to scale audio features?
Features like Spectral Centroid might have values in the thousands, while MFCCs might range from -200 to +200. Models that compute distance (like SVMs or K-Nearest Neighbors) will be unfairly dominated by features with larger scales. Using `StandardScaler` ensures all features have a mean of 0 and a variance of 1.
Can I use CNNs for Music Genre Classification?
Yes! Instead of averaging 1D features over time, you can generate a visual representation of the audio called a Mel-Spectrogram (which maps time vs frequency vs amplitude). You then treat this spectrogram as an image and train a Convolutional Neural Network (CNN) on it, which often yields superior accuracy compared to traditional 1D feature models.