How does a device 'Listen' for years on a battery? The answer is a specialized, ultra-low-power neural network that only knows one thing: its name.
1Spectrograms and MFCCs
Raw audio is a high-frequency temporal wave, which is difficult for standard neural networks to analyze directly. In Keyword Spotting, we use a technique called MFCC (Mel-Frequency Cepstral Coefficients) to transform short snippets of audio into a 2D image (a spectrogram). This image represents the frequency energy over time. By treating sound as an image, we can leverage the power of Convolutional Neural Networks (CNNs) to identify the unique 'Visual fingerprint' of a wake word like 'Hey Alexa' with high precision and very low computational cost.
Audio_Stream: [44.1kHz_Mono]
Feature: Spectrogram_Slice
Classifier: CNN_Small
Output: [WAKE_DETECTED: 0.98]
Status: LISTENING_ACTIVE2The Cascaded Trigger Strategy
To save power, smart devices use Cascaded Architectures. A tiny, 'Dumb' analog or low-bit digital circuit continuously monitors sound levels. If a certain energy threshold is met, it wakes a small Micro-model (running on an NPU or DSP) to check for the wake word. Only if this micro-model is confident does the device wake its main application processor to handle the full user request. this multi-stage approach ensures that the battery-draining components stay asleep 99.9% of the time while maintaining the 'Always-on' feel.
Raw_Audio -> FFT -> Mel_Scale -> MFCC
Input_Shape: (32, 32, 1) // Spectrogram snippet
Status: AUDIO_TO_IMAGE_SUCCESS