Audio AI begins with the physics of air. Before we can train a model to recognize speech, we must understand how sound travels and is measured.
1Waves of Pressure
Sound is a mechanical wave that results from the back-and-forth vibration of particles in a medium. These vibrations create alternating periods of high pressure (Compressions) and low pressure (Rarefactions). When these pressure changes hit our eardrums, our brain interprets them as sound. In the digital world, we simplify this into a graph called a Waveform, where the X-axis is time and the Y-axis is the instantaneous amplitude of that pressure. Understanding this physical reality is the first step before we can start applying algorithms to it.
// Basic Waveform Representation Concept
const sampleRate = 44100; // Hz
const duration = 1.0; // seconds
const numSamples = sampleRate * duration;
// A raw pressure array representing the wave
let audioBuffer = new Float32Array(numSamples);
// We measure the displacement (pressure)
// at each distinct point in time.2The Dimensions of Audio
We define sound using two primary dimensions. Frequency is the speed of the vibration, measured in Hertz (Hz) (cycles per second). It determines the Pitch—high frequencies sound like whistles, while low frequencies sound like thunder. Amplitude is the 'strength' of the vibration, measured in Decibels (dB). It determines the Loudness. Understanding these two properties is critical for Digital Signal Processing (DSP), as they allow us to filter, amplify, and transform sound mathematically. For example, if you want to remove background AC noise, you use a filter targeting its specific frequency.
// Simple Sine Wave Generator
function generateSineWave(freqHz, duration, amplitude) {
let buffer = [];
for (let i = 0; i < duration * 44100; i++) {
let t = i / 44100;
// Math.sin(2 * PI * f * t)
buffer.push(amplitude * Math.sin(2 * Math.PI * freqHz * t));
}
return buffer;
}3The Time Domain Interface
When we look at audio in its raw state, we are viewing it in the Time Domain. This is the classic wavy line you see in audio editors. While the time-domain view is perfect for seeing the rhythm, the silence gaps, and the volume envelopes of a signal, it's actually quite difficult for AI models to extract complex features like 'What vowel is being spoken?' or 'Is this a guitar?'. To solve that, we eventually convert this time-domain wave into the frequency domain using Fourier Transforms. But everything starts here, with the raw, temporal wave.
// Time Domain Analysis Concept
function calculateEnergy(audioBuffer) {
let sum = 0;
for (let sample of audioBuffer) {
sum += sample * sample; // Square the amplitude
}
return sum / audioBuffer.length; // Mean Square Energy
}