Why is WER sometimes over 100%?

WER can exceed 100% if the model 'hallucinates' and outputs many more words (Insertions) than were actually present in the ground truth audio.

What is Out-Of-Vocabulary (OOV)?

In traditional pipeline systems, if a word wasn't in the Pronunciation Lexicon, the model literally couldn't output it. Modern E2E models solve this by outputting sub-words or characters, allowing them to construct novel words or names.

Does ASR work well for all accents?

Historically, no. ASR models reflect the biases of their training data. If a model is only trained on American English, it will have a very high WER for Scottish or Indian English speakers. Modern datasets strive to include diverse accents to fix this.

🚀 LEVEL UP TO SENIOR:Unlock 500+ Advanced Practical Challenges & Exercises.

🎓 COURSERA PARTNER:Earn professional Google, Meta, and IBM certificates to supercharge your resume.

Tutorials

HTML MASTER CLASS /// LEARN TAGS /// BUILD STRUCTURE /// SEMANTIC WEB /// HTML MASTER CLASS /// LEARN TAGS ///

⚡ Total XP: 0|💻 artificialintelligence XP: 0

Intro to ASR in AI

Learn about Intro to ASR in this comprehensive AI tutorial. Master the architecture of modern speech recognition. Explore the transition from traditional 'Pipeline' systems to 'End-to-End' deep learning, understand the role of phonemes and lexicons, and learn to evaluate models using the Word Error Rate (WER) metric.

LOADING ENGINE...

Skill Matrix

UNLOCK NODES BY LEARNING NEW TAGS.

ASR Hub

Machines listening.

Quick Quiz //

Which component decides that 'I read a book' is more likely than 'I red a book'?

Speech is the most natural form of human communication. ASR (Automatic Speech Recognition) is the technology that allows machines to turn that communication into actionable text.

1The Traditional Pipeline

For decades, ASR was built as a multi-stage pipeline. The Acoustic Model (often a GMM-HMM) predicted which Phonemes were present in the audio. The Lexicon (a dictionary) mapped those sounds to possible words. Finally, the Language Model used N-grams or RNNs to determine which sequence of words was most probable given the context. While complex, this modular approach allowed researchers to improve each part independently, and it remains a foundational concept in the field.

—

def classic_asr_pipeline(audio):
    phonemes = acoustic_model.predict(audio)
    word_candidates = lexicon.lookup(phonemes)
    best_sentence = language_model.score(word_candidates)
    return best_sentence

localhost:3000

localhost:3000/asr-pipeline

Pipeline Architecture

Audio -> Phonemes

Phonemes -> Words

Multi-stage Complete

2The End-to-End Revolution

Modern systems, like OpenAI's Whisper or Google's ASR, have moved toward End-to-End (E2E) architectures. These models use deep neural networks (like Transformers or Conformers) to map the raw audio (or Mel-Spectrogram) directly to the final text. By training on hundreds of thousands of hours of data, these models learn to handle noise, accents, and multiple languages within a single, massive weight matrix, dramatically reducing the complexity of the deployment pipeline.

—

import whisper

# Load an end-to-end model
model = whisper.load_model("base")

# Direct audio-to-text inference
result = model.transcribe("audio.wav")
print(result["text"])

localhost:3000

localhost:3000/e2e-whisper

🚀

End-to-End ASR

Direct Transcription Output

3Word Error Rate (WER)

How do we know if an ASR model is good? We use Word Error Rate (WER). It is calculated by taking the number of Substitutions (wrong words), Deletions (missing words), and Insertions (extra words) and dividing by the total number of words in the 'Ground Truth' transcript. A WER of 5% is roughly human-level performance for clear English speech, while a WER of 20% or higher usually indicates a system that is difficult for users to rely on.

—

def calculate_wer(reference, hypothesis):
    S, D, I = count_errors(reference, hypothesis)
    N = len(reference.split())
    wer = (S + D + I) / N
    return wer

localhost:3000

localhost:3000/wer-calc

WER Metrics

Errors: S(1) + D(0) + I(0)

Total Words: 20

WER: 5.0% (Excellent)