Time-Domain Audio: Understanding Energy & ZCR

Pascual Vila

AI & DSP Instructor // Code Syllabus

Before diving into complex deep learning models like Wav2Vec, you must master the classic DSP features. Short-Time Energy and Zero Crossing Rate form the backbone of early Voice Activity Detection systems.

Short-Time Energy (STE)

Since audio signals are continuously changing, measuring the "volume" over the entire file isn't very useful. Instead, we use a sliding window approach to calculate the energy in short frames (usually 20-30ms).

The mathematical definition of STE for a discrete time signal $x(m)$ windowed by $w(n-m)$ is:

$E_n = \sum_&123;m = -\infty&125;^&123;\infty&125; [x(m)w(n-m)]^2$

Why square the signal? Audio waves swing between positive and negative values. If we just added them up, they would cancel each other out to zero. Squaring them turns all values positive and emphasizes higher amplitude peaks.

Zero Crossing Rate (ZCR)

ZCR is exactly what it sounds like: the rate at which the signal changes from positive to negative or back. It gives a rough estimate of the dominant frequency in a frame. High-frequency signals cross zero much more frequently than low-frequency signals.

$Z_n = \frac&123;1&125;&123;2&125; \sum_&123;m&125; |\text&123;sgn&125;[x(m)] - \text&123;sgn&125;[x(m-1)]| w(n-m)$

This is particularly useful in speech processing to differentiate between voiced and unvoiced speech.

❓ Audio Processing FAQ

What is Voice Activity Detection (VAD)?

VAD is a technique used in speech processing to detect the presence or absence of human speech in an audio signal. Simple VAD algorithms use a combination of Energy thresholds (speech is louder than background noise) and ZCR thresholds to separate speech from silence.

How do I use these features to find Voiced vs Unvoiced phonemes?

Voiced Speech (e.g., vowels 'a', 'e'): Produced by vocal cord vibration. They have high energy and low zero-crossing rates because they contain mostly low frequencies.

Unvoiced Speech (e.g., 's', 'f', 'sh'): Produced by forcing air through a constriction. They act like white noise, meaning they have low energy but very high zero-crossing rates.

Why use Librosa instead of writing it from scratch in Numpy?

While writing it in Numpy (like `np.sum(np.abs(np.diff(np.sign(x))))`) is great for learning, Librosa handles framing, windowing, and edge cases natively, making your pipeline much more robust for machine learning tasks.

Audio: ZCR & Energy

Architecture Matrix

Concept: Energy

Knowledge Check

Laboratory Missions

Research Lab

Discuss Audio Papers

Time-Domain Audio: Understanding Energy & ZCR

Short-Time Energy (STE)

Zero Crossing Rate (ZCR)

❓ Audio Processing FAQ

DSP Glossary