The raw waveform contains a wealth of information. Time-domain features allow us to quantify sound quality and energy without complex frequency transforms.
1Zero-Crossing Rate (ZCR)
The Zero-Crossing Rate (ZCR) is a count of how many times the signal changes sign (from positive to negative) within a given timeframe. In audio AI, ZCR is a powerful proxy for Noisiness. Smooth, melodic sounds have low ZCR, while percussive hits or 'fricative' speech sounds (like 's' and 'f') have very high ZCR. It is a vital, low-computation feature for voice activity detection and music genre classification. By just looking at where the wave crosses zero, you can often tell if someone is speaking or just breathing into the mic.
import librosa
import numpy as np
# Calculate ZCR for an audio array 'y'
zcr = librosa.feature.zero_crossing_rate(y)
# zcr is an array of rates per frame
mean_zcr = np.mean(zcr)
print(f"Average Noisiness (ZCR): {mean_zcr:.4f}")2RMS Energy
RMS (Root Mean Square) Energy provides a measure of the total power of an audio signal. Unlike peak amplitude (which only measures the single highest point), RMS averages the amplitude over a window of time. This more closely matches the human perception of Loudness. Calculating RMS is essential for tasks like 'Silent Interval Detection' and for normalizing audio clips so they all have comparable volume for training. If you train a model on unnormalized audio, it will mistake loud sounds for 'important' sounds.
# Calculate RMS Energy per frame
rms = librosa.feature.rms(y=y)
# Simple Silence Detector
threshold = 0.02
active_frames = np.where(rms > threshold)[1]
print(f"{len(active_frames)} frames containing speech.")3Framing & Overlap
Audio is non-stationary; its properties change constantly. To analyze it, we use Framing. We split the audio into small overlapping segments (frames), usually around 20-40 milliseconds long. The Hop Length determines how many samples the 'window' slides forward for each new frame. This allows us to track how features like ZCR and RMS change over the course of a sentence or a song, creating a 2D time-series of features that we can feed into an RNN or Transformer.
# Frame length: ~46ms at 22050 Hz
frame_length = 1024
# Hop length: ~11ms overlap
hop_length = 256
# Extracting framed features
rms_framed = librosa.feature.rms(
y=y, frame_length=frame_length, hop_length=hop_length
)