A silent room is never truly silent. VAD is the technology that identifies when a human starts speaking, acting as the trigger for every AI voice assistant.
1The Power of Silence
The simplest form of VAD is based on an Energy Threshold. If the RMS Energy of a frame exceeds a certain level, we assume it's speech. However, this fails in noisy environments (like a windy day or a busy street). To fix this, we combine Energy with Zero-Crossing Rate (ZCR). Human speech, especially vowels, has a very consistent, low ZCR compared to the chaotic, high ZCR of wind or static noise.
def basic_vad(audio_frame, energy_thresh, max_zcr):
energy = calculate_rms(audio_frame)
zcr = calculate_zcr(audio_frame)
# Speech has high energy but bounded ZCR
if energy > energy_thresh and zcr < max_zcr:
return True
return False2Industry Standards
Most production systems use WebRTC VAD, a highly optimized and robust tool developed by Google for the WebRTC project. It uses a series of filters and statistical models to distinguish between speech and noise with extremely low latency. It provides different 'Aggressiveness' modes, allowing you to choose between letting some noise through (low mode) or only triggering on very clear speech (high mode).
import webrtcvad
# Mode 3 is the most aggressive (least false positives)
vad = webrtcvad.Vad(3)
# Process 10ms, 20ms, or 30ms frames
if vad.is_speech(frame, sample_rate=16000):
buffer.append(frame)3Efficiency in the Cloud
Processing speech with an ASR model (like Whisper or Wav2Vec) is computationally expensive. If an app sent 24/7 audio to the cloud, it would bankrupt the company and drain the user's battery. VAD acts as a Gatekeeper. It runs locally on the device with minimal power. Only when it 'detects' speech does it wake up the main AI model to perform full transcription, saving over 90% of processing costs in most scenarios.
def system_loop(audio_stream):
for frame in audio_stream:
if vad.is_speech(frame):
# WAKE UP EXPENSIVE MODEL
text = whisper_model.transcribe(frame)
execute_command(text)