The most common form of Edge AI is always listening. Learn the signal processing and neural network techniques that power modern voice assistants.
1From Sound to Spectrogram
Microphones capture sound as a sequence of air pressure values over time. This raw 1D data is difficult for neural networks to process efficiently. Instead, we use Digital Signal Processing (DSP) to convert the audio into a Spectrogram. Specifically, we use MFCCs (Mel-frequency cepstral coefficients), which map audio frequencies to the non-linear way humans perceive sound. This turns a 1-second audio clip into a small 2D 'image' that a Convolutional Neural Network (CNN) can easily classify.
# Edge Voice AI
# Always-on Listening
# Privacy-First Processing2The Sliding Window
Wake word detection is a continuous process. The device uses a Sliding Window—it samples the last 1 second of audio every 100-200 milliseconds. This means the model is running inference several times per second. To save battery, many devices use a two-stage system: a tiny, ultra-low-power 'Analog Trigger' or simple energy detector wakes up the main MCU only when it hears significant noise, which then runs the full TFLite Micro model.
import librosa
# Load 1s audio at 16kHz
waveform, sr = librosa.load('audio.wav', sr=16000)
# Extract MFCCs (Mel-frequency cepstral coefficients)
mfccs = librosa.feature.mfcc(y=waveform, sr=sr, n_mfcc=10)
print(f'Spectrogram Shape: {mfccs.shape}') # (10, 32)3False Alarms & Rejections
The success of a wake word model is measured by two metrics: False Acceptance Rate (FAR)—the device wakes up when it shouldn't—and False Rejection Rate (FRR)—the device fails to wake up when you speak. Balancing these is critical. A high FAR destroys privacy and battery life, while a high FRR frustrates users. This balance is often tuned at the edge by adjusting the 'Threshold'—the probability score required to trigger the assistant.
Reason: ???