VOCODERS /// SPECTROGRAM INVERSION /// HIFI-GAN /// GRIFFIN-LIM /// VOCODERS /// SPECTROGRAM INVERSION /// HIFI-GAN ///

Audio Vocoders

From pixels of sound to audible waves. Master spectrogram inversion, the missing phase problem, and Neural Vocoder pipelines.

vocoder.py
1 / 9
12345
🎙️

Tutor:In Text-To-Speech (TTS), acoustic models like Tacotron don't generate audio directly. They generate a Mel-Spectrogram.


Audio Matrix

UNLOCK NODES BY MASTERING SIGNAL PROCESSING.

Concept: The Mel-Spectrogram

TTS models generate Mel-Spectrograms, not raw audio, because it's a more compressed, feature-rich representation of human speech.

System Check

Why do we discard 'Phase' when training acoustic models?


ML Audio Engineers Hub

Discuss Vocoder Architectures

ACTIVE

Tuning your HiFi-GAN? Have questions about Phase reconstruction? Join the Discord!

Vocoders & Spectrogram Inversion

Author

Dr. Synth

Audio Machine Learning Lead // Code Syllabus

In modern Deep Learning for audio, we rarely generate raw waveforms directly. We generate Mel-Spectrograms. But a spectrogram is just an image of sound. How do we turn it back into audible waves? Enter the Vocoder.

The Phase Problem

When we convert an audio waveform to a spectrogram using the Short-Time Fourier Transform (STFT), we get complex numbers containing both Magnitude (how loud a frequency is) and Phase (the precise time alignment of the wave).

Neural networks struggle to predict Phase because it looks like random noise. So, we throw it away and train models only on Magnitude (the Mel-Spectrogram). However, you cannot reconstruct perfect audio without Phase.

Spectrogram Inversion (Griffin-Lim)

Before Deep Learning, the standard way to recover audio was the Griffin-Lim algorithm. It takes the magnitude spectrogram and iteratively guesses the phase.

It applies an Inverse-STFT with a random phase, checks how wrong the result is by re-applying STFT, corrects the magnitude to the known truth, and repeats. While mathematically elegant, it often results in audio that sounds "metallic" or "hollow."

Neural Vocoders

To achieve human-like fidelity, we use Neural Vocoders. These are separate neural networks trained specifically to translate Mel-Spectrograms into raw waveforms.

  • WaveNet: Autoregressive model by DeepMind. Extremely high quality, but notoriously slow to generate.
  • MelGAN & HiFi-GAN: Generative Adversarial Networks (GANs) that produce high-fidelity audio much faster than real-time.

Frequently Asked Questions

What is a Vocoder in Machine Learning?

In Machine Learning, a Vocoder is an algorithm or neural network that acts as the final step in a Text-To-Speech (TTS) pipeline. It takes the acoustic features (like a Mel-Spectrogram) generated by an Acoustic Model and synthesizes them into a listenable audio waveform.

Why don't Acoustic Models generate waveforms directly?

Raw audio waveforms have incredibly high dimensionality (e.g., 24,000 samples per second). It is computationally difficult for a single model to map text directly to that level of detail. Instead, the model maps text to a compressed, lower-dimensional representation (the Mel-Spectrogram), and the Vocoder handles the complex upsampling to audio.

Griffin-Lim vs Neural Vocoders: Which is better?

Neural Vocoders (like HiFi-GAN) are vastly superior in audio quality, producing clear, human-like speech. Griffin-Lim does not require training data or a GPU, making it lightweight and fast, but it produces distinct robotic artifacts and lower fidelity.

DSP Glossary

Spectrogram
A visual representation of the spectrum of frequencies of a signal as it varies with time. Deep learning uses the Mel-scale variant.
snippet.py
Phase
The exact position of a point in time (instant) on a waveform cycle. Discarded in TTS models and estimated by Vocoders.
snippet.py
Griffin-Lim
A mathematical algorithm used to reconstruct audio from a magnitude spectrogram by iteratively estimating the unknown phase.
snippet.py
Vocoder
Voice Coder. In ML, the component that synthesizes raw audio waveforms from intermediate representations like Mel-Spectrograms.
snippet.py