Spectral analysis is for understanding; synthesis is for hearing. Vocoders are the bridge that transforms abstract frequencies back into physical vibrations.
1The Missing Dimension
When we create a spectrogram (Magnitude Spectrogram), we keep the volume of frequencies but discard the Phase. Phase is the information about *where* in its cycle a wave starts. Without phase, we can't perfectly 'invert' the spectrogram back into audio. The Griffin-Lim Algorithm is a traditional method that attempts to 'guess' the phase by iteratively applying the Fourier Transform and its inverse until the signal becomes consistent. While useful, it often produces a 'metallic' sound because its guesses are never perfect.
import librosa
# Invert Mel-Spectrogram using Griffin-Lim
# Note: This is an estimation, not an exact recreation
audio = librosa.feature.inverse.mel_to_audio(
mel_spectrogram,
sr=22050,
n_iter=32 # More iterations = better phase guess
)2Point-by-Point Synthesis
WaveNet was a major neural vocoder. It treated audio as a sequence of discrete samples and predicted each sample one-by-one ($P(x_t | x_{t-1}, ..., x_1)$). Because it was 'Autoregressive,' it was incredibly slow, but it produced the most natural speech ever heard at the time. This proved that neural networks could learn the complex, fine-grained details of human speechāincluding the subtle breaths and mouth soundsāthat traditional algorithms missed entirely.
# WaveNet Concept (Autoregressive)
audio_samples = []
for i in range(total_samples):
# Predicts next sample based on past samples
next_sample = wavenet.predict(audio_samples[-context:])
audio_samples.append(next_sample)3GAN-based Vocoders
The current state-of-the-art involves Generative Adversarial Networks (GANs), such as HiFi-GAN. These models use a Generator network to produce the whole waveform in parallel and a Discriminator (or several) to judge if the audio sounds like real human speech. This adversarial training forces the generator to produce high-frequency details and correct phase information. GAN-based vocoders are 100x faster than WaveNet and achieve higher fidelity, making them the standard for production TTS systems today.
# HiFi-GAN Concept (Parallel)
# Generates all samples instantly from spectrogram
waveform = hifigan_generator(mel_spectrogram)
# Discriminator judges quality during training
score = hifigan_discriminator(waveform)