Speech synthesis is more than just reading words aloud. It's about capturing the nuance, emotion, and rhythm of human communication.
1The Text Front-End
The first step in TTS is Text Normalization. The machine must convert symbols like '$100' to 'one hundred dollars' and 'St.' to 'street' or 'saint' based on context. It then performs Grapheme-to-Phoneme (G2P) conversion, mapping letters to their phonetic representations. This stage also handles Prosody Prediction, deciding which words to stress and how the pitch of the voice should rise and fall to sound natural instead of robotic.
def text_frontend(text):
normalized = normalize(text)
# 'I have $5' -> 'I have five dollars'
phonemes = g2p(normalized)
# -> /aɪ hˈæv fˈaɪv dˈɑlɚz/
return phonemes2The Acoustic Brain
Once the machine has a sequence of phonemes and prosody markers, the Acoustic Model takes over. In modern systems, this is a neural network (like an Encoder-Decoder with Attention). Its job is to predict the Acoustic Features (usually a Mel-Spectrogram) that correspond to that text. This stage is where the 'Style' of the voice is determined—a model trained on a specific speaker will generate spectrograms that carry that speaker's unique vocal characteristics.
# Neural Acoustic Model (e.g. Tacotron 2)
mel_spectrogram = acoustic_model.generate(phonemes)
# The spectrogram contains the 'style' of the speaker
# It's an image representation of the sound frequencies3Waveform Synthesis
A spectrogram is an image, not a sound. The final stage of TTS is the Vocoder. This component takes the predicted Mel-Spectrogram and synthesizes the raw Time-Domain Waveform. Traditional vocoders like Griffin-Lim were fast but sounded metallic. Modern Neural Vocoders (like WaveNet, HiFi-GAN, or WaveGlow) use deep learning to generate samples at 24,000+ Hz, producing speech that is virtually indistinguishable from a real human recording.
# Neural Vocoder (e.g. HiFi-GAN)
audio_waveform = vocoder.synthesize(mel_spectrogram)
# Save as playable audio file
save_wav("output.wav", audio_waveform, sample_rate=24000)