TINYML /// WAKE WORD DETECTION /// EDGE AI /// TFLITE MICRO /// MFCC PROCESSING /// TINYML /// WAKE WORD DETECTION ///

Wake Word
Detection

Give your microcontrollers the power of hearing. Learn to extract audio features and run neural networks entirely offline at the edge.

model.py
1 / 8
12345
🎙️

System:Wake Word Detection (like 'Hey Google' or 'Alexa') runs entirely on the edge. Sending always-listening audio to the cloud would destroy battery and privacy.

Architecture Map

UNLOCK NODES BY MASTERING THE EDGE PIPELINE.

1. Audio Preprocessing

Raw audio is difficult for tiny models to analyze. We convert the 1D waveform into 2D MFCC spectrograms.

System Diagnostic

Why do we avoid sending raw audio waveforms directly into the Neural Network?


TinyML Hacker Space

Deploying on Real Hardware?

ONLINE

Share your Arduino or ESP32 wake word setups, debug memory allocation issues, and compare model sizes with the community!

Wake Word Detection:
Voice AI at the Edge

Author

Pascual Vila

Edge AI Architect // Code Syllabus

Always-listening devices pose massive privacy and bandwidth challenges. By embedding tiny neural networks directly onto microcontrollers, we enable devices to recognize specific phrases entirely offline.

1. The Audio Preprocessing Hurdle

Deep learning models, especially Convolutional Neural Networks (CNNs), excel at finding spatial patterns in images. Raw audio waveforms (1D time-series data) are noisy and high-dimensional, making them difficult for tiny models to parse efficiently.

To solve this, we convert audio into a 2D visual representation called a Spectrogram. Specifically, we extract MFCCs (Mel-frequency cepstral coefficients), which mathematically compress the audio frequencies into a spectrum that mirrors human ear sensitivity.

2. Architecture: The DS-CNN

A standard CNN requires millions of parameters, requiring Megabytes of RAM. Microcontrollers typically have between 16KB and 512KB of SRAM.

For Wake Word detection, the industry standard is the Depthwise Separable Convolutional Neural Network (DS-CNN). By splitting the standard convolution into a depthwise spatial convolution and a 1x1 pointwise convolution, we reduce the number of model parameters (and memory footprint) by up to 90% with minimal loss in accuracy.

3. TensorFlow Lite for Microcontrollers

Running a Python environment on a Cortex-M4 chip is impossible. Instead, we train the model in TensorFlow (Python), export it as a .tflite file, and use a tool like xxd to convert it into a C++ byte array.

The C++ library TFLite Micro then reads this byte array, maps it to statically allocated memory (the Tensor Arena), and executes inferences frame-by-frame as new audio arrives from the microphone buffer.

Core Engineering FAQs

How does wake word detection work on edge devices?

An edge device continuously captures audio into a circular buffer. This audio is chopped into small frames (e.g., 20ms) and converted into MFCC spectrograms. A tiny, pre-quantized neural network evaluates these spectrograms multiple times per second. If the network's confidence score exceeds a threshold, the "wake" action is triggered locally.

What is the difference between Cloud voice recognition and TinyML wake words?

TinyML Wake Words: Run entirely on-device, require zero internet connection, use milliwatts of power, and protect user privacy by not transmitting background noise. They have a very limited vocabulary (1 to 3 words).

Cloud Recognition: Require internet, have high latency, and handle massive vocabularies (Natural Language Processing) using server-grade GPUs. Wake words act as the "gatekeeper" to activate the cloud connection.

What are False Acceptance Rate (FAR) and False Rejection Rate (FRR)?

These are the two key metrics for wake word engines. FAR is how often the device wakes up when you didn't say the word (a privacy/annoyance issue). FRR is how often the device ignores you when you *did* say the word (a usability issue). Engineers adjust the confidence threshold to balance these based on the product's needs.

Edge Audio Glossary

MFCC
Mel-Frequency Cepstral Coefficients. A mathematical representation of the short-term power spectrum of a sound, designed to mimic human auditory perception.
reference.cpp
Tensor Arena
A statically allocated block of memory in C++ used by TFLite Micro to store all input, output, and intermediate variables, avoiding heap fragmentation.
reference.cpp
DS-CNN
Depthwise Separable Convolutional Neural Network. An optimized CNN architecture that drastically reduces parameters, perfect for edge deployments.
reference.cpp
False Acceptance
When the wake word model mistakenly identifies background noise or a similar-sounding word as the target wake word.
reference.cpp