Deep Learning for ASR: Wav2Vec 2.0
"The shift from traditional Gaussian Mixture Models (GMMs) to Deep Neural Networks fundamentally altered Speech Recognition. With self-supervised models like Wav2Vec 2.0, we no longer need thousands of hours of manually transcribed audio."
The Paradigm Shift
Historically, Automatic Speech Recognition (ASR) pipelines were highly fragmented. Developers extracted Mel-Frequency Cepstral Coefficients (MFCCs), passed them into a GMM-HMM acoustic model, and mapped them to language models. Today, End-to-End Deep Learning replaces this entirely, learning the acoustic feature representations directly from the raw audio waveform.
How Wav2Vec 2.0 Learns
Developed by Facebook AI, Wav2Vec 2.0 is a Self-Supervised Learning framework. It consists of a multi-layer convolutional feature encoder which takes raw audio as input and outputs latent speech representations.
- Masking: Similar to how BERT masks words in NLP, Wav2Vec 2.0 masks spans of the encoded audio representations.
- Contrastive Task: The Transformer network tries to predict the correct quantized speech representation for the masked sections out of a set of distractors.
? Architectural FAQs
Why is a 16kHz sampling rate required?
Most pretrained Wav2Vec 2.0 models (like `facebook/wav2vec2-base-960h`) were trained on datasets like LibriSpeech which are sampled at 16,000 Hz. Feeding audio at 44.1kHz will result in misaligned tensor shapes and garbage transcriptions.
What does CTC Decoding actually do?
Because people speak at different speeds, audio frames rarely align perfectly with output characters (e.g., a 2-second clip saying "Hi" might have 100 frames mapping to 'H' and 100 to 'I'). CTC (Connectionist Temporal Classification) solves this by outputting a probability matrix for characters at every frame, utilizing a "blank" token to collapse repeated sequences (e.g., H-H-H-blank-I-I-I -> HI).