Deep Learning for ASR: Wav2Vec 2.0

"The shift from traditional Gaussian Mixture Models (GMMs) to Deep Neural Networks fundamentally altered Speech Recognition. With self-supervised models like Wav2Vec 2.0, we no longer need thousands of hours of manually transcribed audio."

The Paradigm Shift

Historically, Automatic Speech Recognition (ASR) pipelines were highly fragmented. Developers extracted Mel-Frequency Cepstral Coefficients (MFCCs), passed them into a GMM-HMM acoustic model, and mapped them to language models. Today, End-to-End Deep Learning replaces this entirely, learning the acoustic feature representations directly from the raw audio waveform.

How Wav2Vec 2.0 Learns

Developed by Facebook AI, Wav2Vec 2.0 is a Self-Supervised Learning framework. It consists of a multi-layer convolutional feature encoder which takes raw audio as input and outputs latent speech representations.

Masking: Similar to how BERT masks words in NLP, Wav2Vec 2.0 masks spans of the encoded audio representations.
Contrastive Task: The Transformer network tries to predict the correct quantized speech representation for the masked sections out of a set of distractors.

? Architectural FAQs

Why is a 16kHz sampling rate required?

Most pretrained Wav2Vec 2.0 models (like `facebook/wav2vec2-base-960h`) were trained on datasets like LibriSpeech which are sampled at 16,000 Hz. Feeding audio at 44.1kHz will result in misaligned tensor shapes and garbage transcriptions.

What does CTC Decoding actually do?

Because people speak at different speeds, audio frames rarely align perfectly with output characters (e.g., a 2-second clip saying "Hi" might have 100 frames mapping to 'H' and 100 to 'I'). CTC (Connectionist Temporal Classification) solves this by outputting a probability matrix for characters at every frame, utilizing a "blank" token to collapse repeated sequences (e.g., H-H-H-blank-I-I-I -> HI).

ASR Databank

Wav2Vec 2.0

A framework for self-supervised learning of speech representations from raw audio.

python_snippet.py

CTC Loss

An objective function that allows neural networks to output sequences of variable length without knowing exact audio-to-text alignments.

python_snippet.py

Self-Supervised Learning

A machine learning technique where the model trains on unlabeled data by predicting hidden parts of the input.

python_snippet.py

Processor

In Hugging Face, an object combining a feature extractor (audio->tensors) and a tokenizer (tensors->text).

python_snippet.py

Deep Learning
For ASR: Wav2Vec

Architecture Graph

DNNs in ASR

System Check

Model Initialization Lab

Machine Learning Syndicate

Deploy Your Models