Physics • Perception • Signal Analysis
From the compression of air molecules to the firing of auditory neurons — a complete journey through the science of sound.
Sound is a mechanical, longitudinal wave — a rhythmic disturbance that propagates through an elastic medium by the successive compression and rarefaction of its constituent particles.
Unlike electromagnetic waves (light, radio), sound requires a material medium to travel. In air at sea level and 20 °C, it travels at approximately 343 m/s. In water, which is far less compressible, it moves at roughly 1,480 m/s, and through steel at an impressive 5,100 m/s.
When a vibrating object — a drum skin, a vocal cord, a loudspeaker cone — moves outward, it pushes neighbouring air molecules together, creating a region of high pressure (compression). As it springs back, it leaves a region of low pressure (rarefaction). This alternating pattern of compressions and rarefactions radiates outward spherically from the source, carrying energy but no net transport of matter.
In a longitudinal wave, particle displacement is parallel to the direction of propagation. Sound in gases and liquids is always longitudinal. In solids, both longitudinal (P-waves) and transverse (S-waves) can exist — a distinction critical in seismology.
For an ideal gas: v = √(γRT/M), where γ is the adiabatic index, R is the gas constant, T is absolute temperature and M is molar mass. Temperature is the dominant factor in air: speed increases about 0.6 m/s per °C.
Every sound can be described by a small set of physical parameters that together define what we hear.
Frequency (f) is the number of complete oscillation cycles per second, measured in Hertz (Hz). It is the primary physical correlate of the perceptual quality of pitch. A concert-A above middle C is defined at exactly 440 Hz. Humans can generally perceive frequencies between about 20 Hz and 20,000 Hz, though this range narrows significantly with age.
| Sound | Approx. Frequency | Notes |
|---|---|---|
| Infrasound (earthquakes, elephants) | < 20 Hz | Inaudible to humans; felt as vibration |
| Bass (electric bass, bass drum) | 40 – 200 Hz | Felt and heard |
| Male voice fundamentals | 85 – 180 Hz | Varies with individual |
| Female voice fundamentals | 165 – 255 Hz | Varies with individual |
| Middle C (piano) | 261.6 Hz | C4 in scientific notation |
| Concert A | 440 Hz | ISO standard tuning reference |
| Telephone / speech bandwidth | 300 – 3,400 Hz | Sufficient for intelligibility |
| High-frequency hearing limit (young adult) | ~20,000 Hz | Declines with age (presbycusis) |
| Ultrasound (diagnostic imaging) | 2 – 20 MHz | Far above human range |
| Bat echolocation | 20 – 200 kHz | Adaptive for insect detection |
Amplitude is the maximum displacement of air molecules from their rest position. Intensity (I) is the power carried per unit area, measured in W/m². Because the human ear spans an enormous dynamic range — roughly 12 orders of magnitude from the threshold of hearing to the threshold of pain — acousticians use a logarithmic scale: the decibel (dB).
Every increase of +3 dB doubles the acoustic power. Every increase of +10 dB is perceived as roughly twice as loud by the human ear (a psychoacoustic phenomenon quantified by the phon and sone scales).
| Source | Level (dB SPL) | Effect |
|---|---|---|
| Threshold of hearing | 0 dB | Just perceptible |
| Rustling leaves | 20 dB | Very quiet |
| Quiet library | 30 – 40 dB | Comfortable silence |
| Normal conversation | 60 – 65 dB | Comfortable |
| Heavy traffic | 75 – 85 dB | Prolonged exposure: mild risk |
| Chainsaw / nightclub | 100 – 110 dB | Damage after 15 min |
| Jet engine at 30 m | 140 dB | Threshold of pain |
| Krakatoa eruption (1883) | ~172 dB (at 160 km) | Heard 4,800 km away |
The phase of a wave describes where in its cycle it is at a given point in time. When two sound waves of identical frequency overlap, their relative phase determines whether they constructively interfere (adding together) or destructively interfere (cancelling). This principle underpins noise-cancelling headphones, standing waves in rooms, and many musical phenomena.
A pure tone is a single-frequency sinusoidal wave. Real-world sounds are almost always complex waveforms — superpositions of many sinusoids at different frequencies, amplitudes, and phases. The mathematical framework for decomposing these complex sounds is Fourier Analysis, explored in depth in Section 5.
As sound waves travel through the world, they interact with boundaries and media in rich and often complex ways.
When a sound wave strikes a surface, part of its energy is reflected. A single, discrete reflection heard distinctly from the original sound is an echo (requires the reflected path to be >17 m longer, i.e. >50 ms delay). In enclosed spaces, multiple reflections blend into reverberation — the persistence of sound after the source has stopped. Concert halls are designed to have reverb times (T60) of 1.5–2.5 s for orchestral music.
Sound refracts — bends — when it passes between media of different acoustic speeds, governed by Snell’s Law: sin(θ1)/v1 = sin(θ2)/v2. Temperature gradients in the atmosphere cause dramatic refractive effects: sound can bend upward (away from Earth) on warm days (explaining why you can’t hear a distant thunderstorm), or downward at night, making sounds carry farther.
Sound bends around obstacles and spreads through openings — a phenomenon called diffraction. It is most pronounced when the wavelength is comparable to or larger than the obstacle. At 100 Hz (λ ≈ 3.4 m), sound readily diffracts around a wall; at 10,000 Hz (λ ≈ 3.4 cm), the same wall creates a strong acoustic shadow. This is why bass frequencies are “omnidirectional” and high frequencies are directional.
All media convert some acoustic energy into heat. Absorption depends on frequency (higher frequencies are absorbed more quickly, hence thunder rumbles rather than cracks at a distance), humidity, temperature, and the material. Porous materials (foam, carpet, fabric) are highly absorptive; dense, hard surfaces (concrete, glass) are highly reflective. The absorption coefficient (α) ranges from 0 (perfect reflector) to 1 (perfect absorber).
When a source and observer are in relative motion, the received frequency differs from the emitted frequency. As a source approaches, compressed wavefronts raise the perceived pitch; as it recedes, expanded wavefronts lower it. This is the familiar shift in pitch of a passing ambulance siren.
When a wave reflects back on itself in a confined space, incident and reflected waves interfere to create a standing wave with fixed nodes (zero displacement) and antinodes (maximum displacement). Resonant modes occur at frequencies where the room or cavity dimensions are integer multiples of half-wavelengths. Room modes (eigenmodes) are a fundamental challenge in acoustic design.
In a free field (no reflections), sound radiates spherically from a point source. Because the surface area of a sphere grows as 4πr², the intensity falls as the square of distance:
Acoustic impedance (Z = ρ · v) is the resistance a medium presents to the passage of a sound wave. Impedance mismatches at boundaries cause reflections. The fraction of power transmitted depends on how well impedances match — a concept critical in ultrasonic transducer design and in understanding why sound reflects so strongly at an air-water interface.
The human auditory system is a marvel of biological engineering, capable of detecting pressure fluctuations as small as 20 micropascals — a displacement of the eardrum smaller than the diameter of a hydrogen atom.
The ear is classically divided into three regions, each performing a distinct signal-processing role:
The Outer Ear consists of the pinna (the visible cartilage structure) and the ear canal (external auditory meatus, ~2.5 cm long). The pinna’s folds and ridges create subtle frequency-dependent reflections that provide cues for vertical sound localisation (up/down). The ear canal acts as a quarter-wave resonator, boosting sensitivity around 2,000–4,000 Hz — exactly the frequency range most important for speech intelligibility.
The Middle Ear begins at the tympanic membrane (eardrum), which converts acoustic pressure fluctuations into mechanical vibrations. These are transmitted and amplified by three tiny ossicles — the malleus, incus, and stapes (hammer, anvil, stirrup). The ossicular chain achieves an impedance match between the low-impedance air of the outer ear and the high-impedance fluid of the inner ear, boosting transmission efficiency by roughly 25–30 dB.
The Inner Ear contains the cochlea, a fluid-filled spiral structure roughly 35 mm long when uncoiled. The stapes drives the oval window, setting up travelling waves on the basilar membrane. The cochlea performs a remarkable mechanical frequency analysis: high frequencies cause maximum displacement near the base; low frequencies near the apex. This spatial segregation of frequency — tonotopy — is preserved all the way to the auditory cortex and is, in essence, a biological implementation of Fourier decomposition.
Hair cells (3,500 inner and ~12,000 outer) sit on the basilar membrane. Their stereocilia deflect with membrane motion, opening ion channels and generating electrical signals — the conversion of mechanical vibration to neural impulse. Outer hair cells also act as mechanical amplifiers, achieving gains up to 40 dB through active electromotility (prestin-based).
Frequency Range Comparison Across Species
Psychoacoustics studies the relationship between physical acoustic stimuli and subjective auditory perception. Key phenomena include:
The ear’s sensitivity is highly frequency-dependent. We are most sensitive around 3–4 kHz and far less sensitive to very low or very high frequencies. The Fletcher–Munson curves (1933), refined as ISO 226 equal-loudness contours, map the SPL required at each frequency to produce the same perceived loudness. This informs the A-weighting filter used in sound level meters (dB(A)).
A loud sound can render softer nearby sounds inaudible — simultaneous masking. Critically, masking also occurs across time: forward masking (a loud sound masks quiet ones for up to 200 ms after) and backward masking. MP3 and AAC audio codecs exploit masking curves to discard psychoacoustically irrelevant information, achieving high compression ratios with minimal perceptual loss.
With two ears, the auditory system extracts spatial information using: Interaural Time Differences (ITD, ±0.7 ms, dominant below 1.5 kHz) and Interaural Level Differences (ILD, dominant above 1.5 kHz). The head-related transfer function (HRTF) models how the outer ear and head colour sound differently for each direction, enabling 3D audio rendering.
Pitch is not merely frequency. The missing fundamental phenomenon demonstrates that the brain can perceive a pitch even when the fundamental frequency is absent — reconstructed from the pattern of harmonics. This is why a small phone speaker, incapable of reproducing 100 Hz, can still convey a male voice with the correct perceived pitch.
The single most powerful mathematical tool in acoustics was developed by a French mathematician studying heat flow — and it fundamentally transformed our understanding of sound.
In 1807, Jean-Baptiste Joseph Fourier proposed a revolutionary idea: any periodic function, however complex, can be expressed as a sum of sinusoids (sines and cosines) at harmonically related frequencies. For a signal with period T (fundamental frequency f0 = 1/T), the Fourier Series is:
In acoustics, this means every periodic sound — a violin string, a vowel, a trumpet note — is built from a fundamental frequency plus harmonics (integer multiples of the fundamental). The relative amplitudes and phases of these harmonics determine the timbre (tonal colour) of the sound, explaining why a flute and an oboe playing the same note at the same loudness sound utterly different.
Real sounds are rarely perfectly periodic. By taking the limit as the period T → ∞, the Fourier Series becomes the Fourier Transform, valid for any aperiodic signal:
X(f) is a complex-valued function whose magnitude gives the amplitude spectrum and whose argument gives the phase spectrum. Together, they contain all the information of the original signal — a complete, invertible representation.
In the digital age, signals are sampled at discrete time intervals. The Discrete Fourier Transform (DFT) computes the frequency spectrum of a sequence of N samples. However, computing all N output bins directly requires O(N²) operations — prohibitively slow for large N.
In 1965, James Cooley and John Tukey published their landmark algorithm: the Fast Fourier Transform (FFT). By exploiting the symmetry of complex exponentials and recursively splitting the DFT, the FFT reduces computation to O(N log N) — for N = 1,048,576, that is a speedup factor of over 50,000×.
The FFT is ubiquitous in audio processing. Every spectrum analyser, every audio codec (MP3, AAC, Opus), every digital audio workstation, and every voice assistant uses the FFT as a core operation. In real-time applications, a Short-Time Fourier Transform (STFT) applies overlapping FFT windows to a continuous signal, producing a spectrogram — a two-dimensional time-frequency representation that makes features like formants, vibrato, and transients visually apparent.
Applying the FFT to a finite block of samples implicitly assumes the signal is periodic within that block — an assumption that introduces spectral leakage (smearing of energy across adjacent frequency bins) when the signal is not. Window functions — Hann, Hamming, Blackman, Kaiser — taper the signal to zero at the block boundaries, trading spectral resolution for reduced leakage. Choosing the right window for the task is a key skill in digital audio analysis.
Where Fourier Analysis is the tool of steady-state frequency analysis, the Laplace Transform handles transients, initial conditions, and the stability of acoustic systems.
The Laplace Transform, developed by Pierre-Simon Laplace in the late 18th century, generalises the Fourier Transform by introducing a complex frequency variable s = σ + jω, where σ is a real damping factor and ω = 2πf is angular frequency:
Setting σ = 0 (i.e., s = jω) recovers the Fourier Transform. The real part σ allows the transform to handle signals that grow or decay exponentially — making it the natural tool for analysing transient acoustic phenomena like the onset of a struck piano string or the decay of a resonant cavity.
In linear acoustic systems (resonators, filters, rooms modelled as LTI systems), the ratio of output to input in the Laplace domain is the transfer function H(s) = Y(s)/X(s). Convolution in the time domain becomes multiplication in the s-domain — a dramatic simplification.
The poles of H(s) (values where the denominator is zero) determine the system’s resonant frequencies and decay rates. A pole at s = −σ0 ± jω0 represents a damped resonance at f0 = ω0/2π with exponential decay rate σ0.
The acoustic wave equation, a second-order PDE in time, transforms cleanly under Laplace. Initial conditions (pressure distribution and particle velocity at t = 0) appear naturally as algebraic terms, making the Laplace approach essential for solving room acoustics problems with defined initial states.
The discrete-time analogue of the Laplace Transform is the Z-Transform, where z = esT and T is the sampling period. All digital audio filters (IIR lowpass, shelving EQ, reverberation algorithms) are designed and analysed using Z-domain techniques, mapped from their continuous-time Laplace prototypes via the bilinear transformation.
The human vocal tract is remarkably well-modelled as a time-varying acoustic tube whose resonances — called formants (F1, F2, F3, ...) — shape vowel identity. In the Laplace framework, formants are poles of the vocal tract transfer function. Speech synthesis and analysis systems (LPC — Linear Predictive Coding) estimate these poles directly from the speech signal, encoding a second of speech in as few as 10–12 complex numbers. This is the foundation of telephony codecs and voice synthesis technology.
Summary of transforms in audio: The Fourier Transform reveals steady-state frequency content. The Laplace Transform handles system dynamics and transients. The FFT makes Fourier analysis computationally practical on digital hardware. Together, they form the mathematical backbone of all modern audio engineering.
The principles of sound waves and their mathematical treatment permeate technology, medicine, architecture, and art.
Designing the acoustic character of concert halls, opera houses, cathedrals, recording studios, and open-plan offices. Key metrics include reverberation time (T20, T30, T60), clarity (C80), definition (D50), and speech transmission index (STI). Simulation software uses the FFT and geometric (ray tracing) or wave-based (finite element) methods to predict room behaviour before construction.
Diagnostic ultrasound (2–20 MHz) uses pulse-echo techniques to image soft tissue. The Doppler effect enables blood flow measurement. Therapeutic ultrasound delivers focused acoustic energy for physiotherapy, kidney stone disintegration (lithotripsy), and emerging cancer treatments. FFT-based signal processing is fundamental to reconstructing images from echo data.
Controlling industrial noise, environmental noise, and vehicle noise through absorption, insulation, and active noise control (ANC). ANC systems sample the noise with a microphone, compute an anti-phase signal in real time using digital signal processing, and emit it through a loudspeaker — exploiting destructive interference. Modern noise-cancelling headphones achieve 30+ dB of attenuation using these principles.
Because sound travels so much better than electromagnetic waves underwater, sonar (Sound Navigation And Ranging) is the primary sensing modality in the ocean. Active sonar emits pulses and detects echoes; passive sonar listens for target signatures. The FFT is used in signal processing chains to detect weak signals in noise and to estimate bearing and range of underwater objects.
Automatic speech recognition (ASR), text-to-speech synthesis (TTS), audio compression (MP3, AAC, FLAC), music information retrieval (pitch detection, beat tracking, chord recognition), and hearing aids all rely on combinations of Fourier analysis, psychoacoustic models, and machine learning operating in the frequency domain. The STFT and Mel-frequency cepstral coefficients (MFCCs), derived via the FFT, are the standard feature representations for audio ML.
Earthquakes generate both P-waves (longitudinal, >6 km/s in crust) and S-waves (transverse, slower, cannot travel through liquids). Analysing these waves’ arrival times at seismograph networks — using Fourier and Laplace methods — allows scientists to locate earthquake epicentres, determine magnitudes, and infer the structure of the Earth’s interior.
All digital audio rests on a single profound theorem: the Nyquist–Shannon Sampling Theorem (1928/1949). It states that a bandlimited signal can be perfectly reconstructed from discrete samples if the sampling rate fs is greater than twice the highest frequency present:
Frequencies above the Nyquist limit (fs/2) fold back into the audible band as aliasing — audible artefacts that are prevented by anti-aliasing filters applied before the analogue-to-digital converter. The Fourier Transform provides the mathematical proof that this sampling and perfect reconstruction is possible.