Vocoders & Spectrogram Inversion

Dr. Synth

Audio Machine Learning Lead // Code Syllabus

In modern Deep Learning for audio, we rarely generate raw waveforms directly. We generate Mel-Spectrograms. But a spectrogram is just an image of sound. How do we turn it back into audible waves? Enter the Vocoder.

The Phase Problem

When we convert an audio waveform to a spectrogram using the Short-Time Fourier Transform (STFT), we get complex numbers containing both Magnitude (how loud a frequency is) and Phase (the precise time alignment of the wave).

Neural networks struggle to predict Phase because it looks like random noise. So, we throw it away and train models only on Magnitude (the Mel-Spectrogram). However, you cannot reconstruct perfect audio without Phase.

Spectrogram Inversion (Griffin-Lim)

Before Deep Learning, the standard way to recover audio was the Griffin-Lim algorithm. It takes the magnitude spectrogram and iteratively guesses the phase.

It applies an Inverse-STFT with a random phase, checks how wrong the result is by re-applying STFT, corrects the magnitude to the known truth, and repeats. While mathematically elegant, it often results in audio that sounds "metallic" or "hollow."

Neural Vocoders

To achieve human-like fidelity, we use Neural Vocoders. These are separate neural networks trained specifically to translate Mel-Spectrograms into raw waveforms.

WaveNet: Autoregressive model by DeepMind. Extremely high quality, but notoriously slow to generate.
MelGAN & HiFi-GAN: Generative Adversarial Networks (GANs) that produce high-fidelity audio much faster than real-time.

❓ Frequently Asked Questions

What is a Vocoder in Machine Learning?

In Machine Learning, a Vocoder is an algorithm or neural network that acts as the final step in a Text-To-Speech (TTS) pipeline. It takes the acoustic features (like a Mel-Spectrogram) generated by an Acoustic Model and synthesizes them into a listenable audio waveform.

Why don't Acoustic Models generate waveforms directly?

Raw audio waveforms have incredibly high dimensionality (e.g., 24,000 samples per second). It is computationally difficult for a single model to map text directly to that level of detail. Instead, the model maps text to a compressed, lower-dimensional representation (the Mel-Spectrogram), and the Vocoder handles the complex upsampling to audio.

Griffin-Lim vs Neural Vocoders: Which is better?

Neural Vocoders (like HiFi-GAN) are vastly superior in audio quality, producing clear, human-like speech. Griffin-Lim does not require training data or a GPU, making it lightweight and fast, but it produces distinct robotic artifacts and lower fidelity.

Audio Vocoders

Audio Matrix

Concept: The Mel-Spectrogram

System Check

Engineering Bay

ML Audio Engineers Hub

Discuss Vocoder Architectures

Vocoders & Spectrogram Inversion

The Phase Problem

Spectrogram Inversion (Griffin-Lim)

Neural Vocoders

❓ Frequently Asked Questions

DSP Glossary