Silence is golden, but for an AI, it's also expensive. Voice Activity Detection (VAD) ensures that we only spend compute resources when there is actually something worth hearing.
1Speech Triage
A VAD acts as a triage system for audio. Running a 1-billion parameter ASR model on a continuous stream of audio would melt a phone's battery in minutes. Instead, a lightweight VAD (using simple features like Energy, Spectral Flatness, and Pitch) runs constantly at very low power. Only when the VAD is 90% sure it hears a human voice does it 'wake up' the heavy ASR model to perform the transcription. This tiered architecture is the secret to the 24/7 responsiveness of devices like Alexa and Siri.
def process_audio(frame):
if vad.is_speech(frame):
# Wake up heavy ASR
transcribe(frame)
else:
# Sleep and save power
pass2The Physics of the Voice
VADs distinguish speech from noise by looking for the specific characteristics of the human vocal tract. Voiced sounds (like vowels) have a periodic structure and a clear Fundamental Frequency ($F_0$). Unvoiced sounds (like 's' or 'f') look like white noise but have specific spectral shapes. Background noise, like a humming air conditioner, is usually stationary (it doesn't change much), while speech is highly dynamic. By tracking these changes, a VAD can 'tune out' a noisy cafe and focus only on the speaker.
import librosa
# Detect fundamental frequency (F0)
f0, voiced_flag, _ = librosa.pyin(y, fmin=50, fmax=300)
# Check if frame is voiced
is_human = any(voiced_flag)3Tuning the Gatekeeper
Deploying a VAD in the real world requires careful tuning of two parameters. Sensitivity (or Threshold) determines how much energy is needed to trigger the 'Speech' state—too high and you miss quiet talkers; too low and you trigger on every passing car. Hangover Time is the duration the VAD stays active after speech seems to have stopped. Without a few hundred milliseconds of hangover, the VAD would cut off the natural pauses between words, resulting in fragmented and unusable transcripts.
class VADController:
def __init__(self):
self.sensitivity = 0.85
self.hangover_ms = 300
self.active = False