adsp 10 ac psycho acoustics ec623 adsp
TRANSCRIPT
Audio CodingPsychoacoustics
S. R. M. Prasanna
Dept of ECE,
IIT Guwahati,
Audio Coding – p. 1/45
ww
w.jntuw
orld.com
Motivation
Acoustics: Study of sounds
Psychoacoustics: Study of perception of sounds
Deals with characterizing human auditory perception
Particularly time-frequency analysis capabilities of innerear
Audio coders achieve significant compression byexploiting the property that perceptually irrelevantinformation cannot be heard
Perceptually irrelevant information is identified byincorporating several psychoacoustic principles
Audio Coding – p. 2/45
ww
w.jntuw
orld.com
Human Speech Perception
Figure 1: Cross Section of Human Ear
Audio Coding – p. 3/45
ww
w.jntuw
orld.com
Functions of Human Ear
Mainly three regions - outer ear, middle ear & inner ear
Outer ear - directs speech pressure variations towardsthe middle ear
Middle ear - transforms pressure variations intomechanical motion
Inner ear - converts mechanical vibrations into electricalfirings in the auditory neurons, which leads to brain
Language decoding and message understanding at thehigher centers of learning in brain which is lessunderstood
Audio Coding – p. 4/45
ww
w.jntuw
orld.com
Inner Ear
Figure 2: Figures Related to Inner Ear
Audio Coding – p. 5/45
ww
w.jntuw
orld.com
Frequency to Place Transformation
Sound waves to mechanical vibrations by middle ear
Mechanical vibrations to traveling waves by inner earalong the length of basilar membrane
Neural receptors are connected along the length of thebasilar membrane
Traveling waves generate peak responses at frequencyspecific membrane positions
Therefore different neural receptors are effectivelytuned to different frequency bands according to theirlocations.
Audio Coding – p. 6/45
ww
w.jntuw
orld.com
Freq. to Place Tfmn. (contd.)
For sinusoidal stimuli, the peak response occurs nearthe basilar membrane region with a resonant freq.equal to input sinusoid freq.
Location of peak is characteristic place for the stimulus
Freq. that best excites a particular place ischaracteristic frequency
Thus a frequency to place transformation takes place
Audio Coding – p. 7/45
ww
w.jntuw
orld.com
Signal Processing Perspective
Bank of highly overlapping band pass filters
Magnitude responses are asymmetric
Bandwidths increase with frequency
Audio Coding – p. 8/45
ww
w.jntuw
orld.com
Sound Pressure Level (SPL)
A std. metric that quantifies the intensity of anacoustical stimulus
SPL gives the level (intensity) of sound pressure in dBsrelative to an internationally defined ref. level
LSPL = 20log10(p/p0) (dB)where LSPL is the SPL of a stimulus p, which is thesound pressure in pascals and p0 is the std. ref level of20 µPa
About 150 dB SPL spans the dynamic range of intensityfor human auditory system
Min value is the limit of detection for low intensity (quiet)stimuli
Max value is the threshold of pain for high intensity(loud) stimuli
Audio Coding – p. 9/45
ww
w.jntuw
orld.com
Absolute Threshold for Hearing (ATH)
Amount of energy needed in a pure tone such that it canbe detected by a listener in a noiseless environment
ATH is expressed in dB SPL
ATH is frequency dependent parameter and is given byTq(f) =
3.64(f/1000)−0.8 − 6.5e−0.6(f/1000−3.3)2 + 10−3(f/1000)4
dB(SPL)
In the context of signal compression, Tq(f) could beinterpreted naively as a maximum allowable energylevel for coding distortions introduced in the frequencydomain (Fig 5.1 from Spanias book)
Use of ATH to shape the coding distortion spectrumrepresents the first step towards perceptual coding.
Audio Coding – p. 10/45
ww
w.jntuw
orld.com
ATH Diagram
Figure 3: Absolute Threshold for Hearing
Audio Coding – p. 11/45
ww
w.jntuw
orld.com
Critical Bands (CB)
Critical band is a function of frequency that quantifiesthe cochlear filter passbands
CB tends to remain constant (about 100 Hz) up to 500Hz and increases to approximately 20% of the centerfrequency about 500 Hz
For an average listener the critical bandwidth is givenby BWc(f) = 25 + 75[1 + 1.4(f/100)2]0.69 (Hz)
The functionZb(f) = 13tan−1(0.00076f) + 3.5tan−1((f/7500)2) (Bark)is often used to convert frequency in Hz to Bark scale
Nonuniform Hz spacing of the filter bank is actuallyuniform on a Bark scale
One critical band (CB) comprises one Bark. (Table 5.1and Fig. 5.4)
Audio Coding – p. 12/45
ww
w.jntuw
orld.com
Critical Bands
Figure 4: Table Showing Critical Bands
Audio Coding – p. 13/45
ww
w.jntuw
orld.com
Mapping from Hz to Bark
Figure 5: Mapping from Hz to Bark Scale
Audio Coding – p. 14/45
ww
w.jntuw
orld.com
Simultaneous Masking
Masking: One sound is rendered inaudible because ofthe presence of another sound
Simultaneous masking: When two or more stimuli aresimultaneously presented to the auditory system
Freq. Domain: Relative shapes of the masker andmaskee magnitude spectra determine to what extentpresence of certain spectral energy will mask thepresence of other spectral energy
Time Domain: Phase relationships between stimuli canalso affect masking outcomes
In simple words presence of a strong noise or tonemasker creates an excitation of sufficient strength onthe basilar membrane at the critical band location toblock effectively detection of a weaker (maskee) signal.
Audio Coding – p. 15/45
ww
w.jntuw
orld.com
Types of Simultaneous Masking
Noise-Masking-Tone (NMT), Tone-Masking-Noise(TMN) and Noise-Masking-Noise (NMN)
NMT:A NB noise (1 Bark) masks a tone within the sameCB, provided intensity of masked tone is below apredictable thresholdSignal-to-Mask Ratio (SMR) (dB) is the differencebetween the intensities of masking and maskeeMin. SMR at the threshold of detection occurs whenmaskee freq is close to center freq of masker andwill be about 5 dB
Audio Coding – p. 16/45
ww
w.jntuw
orld.com
TMN and NMN
TMN:Pure tone at the center of a CB masks noise of anysubcritical BW, provided noise spectrum is below apredictable thresholdMin SMR lie between 21 and 28 dB
NMN:A NB noise masks another NB noiseMin SMR is nearly about 26 dB
Audio Coding – p. 17/45
ww
w.jntuw
orld.com
Masking Schemes
Figure 6: Masking schemes
Audio Coding – p. 18/45
ww
w.jntuw
orld.com
Asymmetry of Masking
The NMT and TMN show asymmetry in masking powerbetween noise masker and tone masker
In spite of both maskers at same db SPL, associatedthreshold SMRs differ by 20 dB
Hence the interest in all types of masking
Knowledge of all three is critical to succeed in the taskof shaping coding distortion
For each temporal analysis interval, a codec’sperceptual model should identify across the freqspectrum noise-like and tone-like components withinboth the audio signal and the coding distortion
Model should then apply appropriate maskingrelationships to obtain global masking threshold
Audio Coding – p. 19/45
ww
w.jntuw
orld.com
Spread of Masking
Simultaneous masking is not bandlimited to within theboundaries of a single CB
Interband masking also occurs, i.e., a masker centeredwithin one critical band has some predictable effect ondetection thresholds in other CBs.
This effect is known as spread of masking
A triangular spreading function that has slopes of +25and -10 dB per Bark.
SFdB(x) = 15.81 + 7.5(x + 0.474) − 17.5√
1 + (x + 0.474)2
dBwhere x in Barks and SFdB(x) is expressed in dB.
Audio Coding – p. 20/45
ww
w.jntuw
orld.com
Just Noticeable Distortion (JND)
Global masking threshold comprises an estimate of thelevel at which quantization noise becomes justnoticeable
Hence global masking threshold is sometimes referredto as JND
Audio Coding – p. 21/45
ww
w.jntuw
orld.com
Nonsimultaneous Masking
Also termed temporal masking
Masking phenomenon extends beyond window ofsimultaneous stimulus presentation
Masking occurs both prior to masker onset and alsoafter masker removal
Forward (post) and backward (pre) masking are the two
Audio Coding – p. 22/45
ww
w.jntuw
orld.com
Figure 7: Temporal Masking
22-1
www.jntuworld.com
Perceptual Entropy
Entropy gives min. no. of bits/sample required to storeor transmit given message block
Johnstan combined notion of psychoacoustic maskingwith signal quantization principles to define PerceptualEntropy (PE).
Perceptual Entropy gives min. no. of bits/samplerequired to store or transmit perceptually relevantinformation in given audio message block.
While discussing PE, conventional entropy is termed asstatistical entropy.
Statistical entropy employs the statistical properties ofthe signal for computing entropy
Perceptual entropy employs both statical andperceptual properties of signal for computing entropy.
Audio Coding – p. 23/45
ww
w.jntuw
orld.com
Basis for PE
Masking threshold indicates amount of quantzn. in freq.dom. without perceptually corrupting signal.
Assume that step size and no. of levels in the quantizerfor each spectral line could be set independently.
Further choice of step size is such that total noiseinjected at each frequency corresponds to maskingthreshold i.e., min no of quantization levels are used.
Then no. of bits required to encode entire transformrepresents min. no. of bits necessary to transmit thatblock of the signal.
The total number of bits divided by the no. of samples inthe transform represents per-sample rate.
This per-sample bit rate is Perceptual Entropy of signal.
Audio Coding – p. 24/45
ww
w.jntuw
orld.com
PE v/s SE
Statistical entropy (SE) exploits signal statistics
Perceptual entropy (PE) exploits signal statistics andalso psychoacoustic masking
No. of quantization levels just to avoid perceptualdistortion due to quantization by exploiting maskingthresholds.
Audio Coding – p. 25/45
ww
w.jntuw
orld.com
Steps for PE Computation
DFT computation
Finding Masking thresholds
Calculating no. of bits to quantize DFT spectrum
Audio Coding – p. 26/45
ww
w.jntuw
orld.com
DFT Computation
Windowing and frequency transformation
2048 sample DFT by FFT
1024 are considered for further analysis
Audio Coding – p. 27/45
ww
w.jntuw
orld.com
Calculation of Masking Threshold
Critical band analysis
Applying spreading function to critical band spectrum
Calculating Masking Thresholds
Accounting for absolute thresholds
Relating spread masking threshold to critical bandmasking threshold
Audio Coding – p. 28/45
ww
w.jntuw
orld.com
Critical Band Analysis
DFT spectrum is complex: S(ω) = Re(ω) + Im(ω)
Power Spectrum: P (ω) = Re2(ω) + Im2(ω)
P (ω) is partitioned into CBs
Energy in each CB: Bi =∑bhi
ω=bli P (ω)
Bi represents CB spectrum
Audio Coding – p. 29/45
ww
w.jntuw
orld.com
Spreading Function (SF)
CB spectrum threshold is also influenced by adjacentCBs which is accounted using SF.
SF is used to estimate effects of masking across CBs
SF is calculated for abs(j − i) ≤ 25, where i is bark freqof masked and j is bark freq of masking and placed intoa matrix Sij
Spread CB Spectrum: Ci = Sij ∗ Bi
Effect of spreading function is to spread peaks in Bi andalso raise threshold values, especially at higherfrequencies.
Audio Coding – p. 30/45
ww
w.jntuw
orld.com
Masking Thresholds
TMN is estimated as 14 + i dB below Ci, where i is barkfreq.
NMT is estimated as 5.5 dB below Ci uniformly acrossCB spectrum
Audio Coding – p. 31/45
ww
w.jntuw
orld.com
Tone Like and Noise Like Components
Spect. Flatness Measure: SFM = GM/AM
GM geometric mean of P (ω) and AM is arithmetic meanof P (ω)
SFMdB = 10log10(GM/AM )
Coeff. of tonality: α = min(SFMdB/SFMdBmax, 1)
SFMdbmax = −60 dB is used to estimate tonality
SFMdB = 0 indicate complete noise like
SFMdB = −30 dB indicates α = 0.5
SFMdB = −75 dB indicates α = 1.0
Audio Coding – p. 32/45
ww
w.jntuw
orld.com
Offset for Masking Energy
Oi = α(14.5 + i) + (1 − α)5.5 (dB), in each band i
Index α is used to geometrically weight the twothresholds
Oi is then subtracted from Ci to yield spread thresholdestimate Ti = 10log10(Ci)−Oi/10
Since spectrum spread fns. do not have normalizedgain, it is normalized by the DC gain for each CB
After normalization, bark thresholds are compared toabsolute thresholds.
Any CB that has bark threshold lower than absolutethreshold is changed to the absolute threshold
This will be the threshold used for computing bit rate.
Audio Coding – p. 33/45
ww
w.jntuw
orld.com
Calculation of Bit Rate
No. of quantization levels to follow signal in freq domain
Ti is in power d omain
Quantization energy must be spread across ki spectrallines in each CB
Assuming noise to spread equally across the entireband, noise energy will be δ2/12
Energy at each spectral freq = Ti/ki
Real and imaginary are quantized independently,= Ti/2ki
δ2/12 = Ti/2Ki =⇒ δ = T ′
i =√
(6Ti)/ki
T ′
i is step size.
Audio Coding – p. 34/45
ww
w.jntuw
orld.com
Computing PE
NRe(ω) = abs(nint(Re(ω)/T ′
i )) andNIm(ω) = abs(nint(Im(ω)/T ′
i )) for each ω within CB i.
Let N∗ represents actual (integer) quantized value ofeach line
If N(ReorIm)(ω) = 0, then N ′
(ReorIm)(ω) = 0
If N(ReorIm)(ω) 6= 0, then N ′
(ReorIm)(ω) = log2(2N∗(ω) + 1)
This operation assigns a bit rate of zero bits to anysignal with an amplitude that does not need to bequantized and assigns a bit ate of log2(no.oflevels) tothose that must be quantized.
Total bit rate =∑π
ω=0(N′
Re(ω) + N ′
Im(ω))
Rate per sample, PE = Totalbitrate/2048
Audio Coding – p. 35/45
ww
w.jntuw
orld.com
Example codec perceptual model
ISO/IEC 11172-3 (MPEG-1) Psychoacoustic Model-1
Determines max. allowable quantization noise energyin each CB such that it remains inaudible.
Blocking i/p audio into frames
High resolution spectral computation for each frame
For each frame tonal and noise maskers estimation
Decimation and reorganization of maskers
Calculation of individual masking thresholds forcomponents in each CB
Calculation of global masking thresholds for each CB
Audio Coding – p. 36/45
ww
w.jntuw
orld.com
Spectral Analysis
512 point DFT computation
Power Spectral Density (PSD) P (k) estimation, wherek = 1, 2, . . . , 512
−10
60
50
40
30
20
10
0
0 2000 4000 6000 8000 10000 12000 14000 16000 18000
SP
L (
dB
)
Frequency (Hz)
Audio Coding – p. 37/45
ww
w.jntuw
orld.com
Identn. of Tonal and Noise Maskers
P (k) where k = 1, 2, . . . , 256 are considered
Local maxima in PSD within a certain Bark by at least 7dB are classified as tonal
Tonal set ST is defined as
ST = P (k)|P (k) > P (k ± 1)&P (k) > P (k ± ∆k) + 7dB
where
∆k ∈ 2 2 < k < 63(0.17 − 5.5kHz)
∆k ∈ [2, 3] 63 ≤ k < 127(5.5 − 11kHz)
∆k ∈ [2, 6] 127 ≤ k ≤ 256(11 − 20kHz)
Audio Coding – p. 38/45
ww
w.jntuw
orld.com
Tonal and Noise Maskers (contd.)
Tonal maskers PTM (k), are computed from spectralpeaks listed in ST :
PTM (k) = 10log10
1∑
j=−1
100.1P (k+j)(dB)
For each neighborhood max, energy from threeadjacent peaks combined to form a single tonal masker
For each CB, PNM (k̄) a single NM is then computedfrom (remaining) spectral lines not within the ±∆k
neighborhood of a tonal masker using the sumPNM (k̄) = 10log10
∑
j
100.1P (j)(dB)
∀P (j) 6= PTM (k, k ± 1, k ± ∆k)
where k̄ is geometric mean spectral line of CBAudio Coding – p. 39/45
ww
w.jntuw
orld.com
Decimation of Maskers
No. of maskers are reduced using two criteria
First, any tonal or noise maskers below abs. thresholdare discarded, i.e., PTM,NM (k) ≥ Tq(k) are retained.
Next, a sliding 0.5 Bark-wide window is used to replaceany pair of maskers occurring within a distance of 0.5Bark by the stronger of the two.
Masker freq. bins are reorganized using the decimationscheme
PTM,NM (i) = PTM,NM (k)
PTM,NM (k) = 0
Audio Coding – p. 40/45
ww
w.jntuw
orld.com
Decimation (contd.)
i = k, 1 ≤ k ≤ 48
i = k + (kmod2) 49 ≤ k ≤ 96
i = k + 3 − ((k − 1)mod4) 97 ≤ k ≤ 232
Net effect is 2 : 1 decimation of masker bins in CBs18-22
4:1 decimation of masker bins in CBs 22-35
With no loss of masking components.
Decimation reduces total no. of tone and noise maskerfreq. bins under consideration from 256 to 106
Audio Coding – p. 41/45
ww
w.jntuw
orld.com
Individual Masking Thresholds
Using decimated set of tonal and noise maskers,individual tone and noise masking thresholds arecomputed
Each individual threshold represents a maskingcontribution at freq. bin i due to the tone or noisemasker located at bin j
Tonal Masking Threshold, TTM (i, j) is given byTTM (i, j) = PTM (j)−0.2757zb(j)+SF (i, j)−6.025(dbSPL)where, PTM (j) is SPL of tonal masker in freq. bin j,zb(j) Bark freq of bin j and SF (i, j) is spreading ofmasking from bin j to bin i
Noise Masking Threshold, TNM (i, j) is given byTNM (i, j) = PNM (j)−0.175Zb(j)+SF (i, j)−2.025(dbSPL)where, PNM (j) is SPL of noise masker in freq bin j
Audio Coding – p. 42/45
ww
w.jntuw
orld.com
Global Masking Thresholds
Individual masking thresholds are combined to estimatea global masking threshold for each freq. binTg(i) = 10log10(100.1Tq(i) +
∑Ll=1 100.1TT M (i,l) +
∑Mm=1 100.1TNM (i,m))(db, SPL) where, L and M are the
number of tonal and noise maskers, respectively.
The number of bits are allocated based on the globalmasking thresholds and is termed as perceptual bitallocation.
Audio Coding – p. 43/45
ww
w.jntuw
orld.com
Expt. 5-AC- Audio Synthesis using MSE
Problem No. 2.25 (pp. 49) of Spanias book on AudioSignal Processing
Audio Coding – p. 44/45
ww
w.jntuw
orld.com
Expt. 6-AC- Audio Synthesis using Psychoacoustics
Problem No. 5.11 (pp. 142) of Spanias book on AudioSignal Processing
Audio Coding – p. 45/45
ww
w.jntuw
orld.com