perceptual audio coding - university of washington · 12/6/2007 1 perceptual audio coding henrique...
TRANSCRIPT
12/6/2007
1
Perceptual Audio Coding
Henrique MalvarManaging Director, Redmond Lab
UW Lecture – December 6, 2007
Contents
• Motivation
• “Source coding”: good for speech
• “Sink coding”: Auditory Masking
• Block & Lapped Transforms
• Audio compression
2
• Examples
12/6/2007
2
Contents
• Motivation
• “Source coding”: good for speech
• “Sink coding”: Auditory Masking
• Block & Lapped Transforms
• Audio compression
3
• Examples
Many applications need digital audio
• Communication– Digital TV, Telephony (VoIP) & teleconferencing– Voice mail, voice annotations on e-mail, voice recording
• Business– Internet call centers– Multimedia presentations
• Entertainment– 150 songs on standard CD
4
– thousands of songs onportable music players
– Internet / Satellite radio, HD Radio– Games, DVD Movies
12/6/2007
3
Contents
• Motivation
• “Source coding”: good for speech
• “Sink coding”: Auditory Masking
• Block & Lapped Transforms
• Audio compression
5
• Examples
Linear Predictive Coding (LPC)
periodic excitation LPC coefficients ( ) ( ) ( )
N
∑
CombineSynthesis Filter
pitchperiod
gainscoefficients
1( ) ( ) ( )r
rx n e n a x n r
== + −∑
e(n) x(n)
noise excitation synthesized speech
6
12/6/2007
4
LPC basics – analysis/synthesis
synthesis
Synthesis Filter
synthesis parameters
Analysis algorithm
residual waveform
1( ) ( ) ( )
N
rr
e n x n a x n r=
= − −∑
synthesized speechoriginal speech
7
LPC variant - CELP
selectionindex
Encoder
Synthesis Filter
gainLPC coefficients
.
.
.
originalspeech
Decoder
excitation codebook synthesized speech
8
12/6/2007
5
LPC variant - multipulse
LPC coefficients
Synthesis Filter
excitation
Compute pulse positions and amplitudes
synthesized speechoriginal speech
9
G.723.1 architecture
Synthesis FilterCoefficients
AudioInput
LPCFRAMER
ANALYSIS
PERCEPTUALNOISE SHAPING
HIGHPASSFILTER
SIMULATEDDECODER
ZERO-INPUTRESPONSE(MEMORY)
Excitation
Past DecodedResidualWeighted
Residual
PITCHESTIMATION
MP-MLQ/ACELPEXCITATION
COMPUTATION+
+ –Excitation
Indices
EncodedPitch Values
Residual
10
12/6/2007
6
Contents
• Motivation
• “Source coding”: good for speech
• “Sink coding”: Auditory Masking
• Block & Lapped Transforms
• Audio compression
11
• Examples
Physiology of the ear
• Automatic gain control– muscles around
transmission bones
• Directivity– pinna
• Boost of middle frequencies– auditory canal
• Nonlinear processing– auditory nerve
• Filter bank separationCOCHLEA
AUDITORYNERVE
EARDRUMTRANSMISSION
BONES
AUDITORY CANAL
PINNA
12
Filter bank separation– cochlea
• Thousands of “microphones”– hair cells in cochlea
EUSTACHIANTUBE
12/6/2007
7
Filter bank model
bandpass amplitude amplitudex(n)A1(m1)
group into
linear linear nonlinear
bandpassfilter 1incoming
sound
amplitudemeasurement
amplitude, band 1
bandpassfilter 2
amplitudemeasurement
amplitude, band 2
bandpass amplitude amplitude
.
.
....
( ) group into blocks
group into blocks
group into
.
.
.
A2(m2)
AN(mN)
13
• Explains frequency-domain masking
bandpassfilter N
amplitudemeasurement
amplitude, band N
group into blocks
Frequency-domain masking
0
20
40
60
ude,
dB
tone 1
102
103
104
-60
-40
-20
0
frequency, Hz
ampl
itu
20
40
60
e, d
B
tone 1
tone 2
tone 3
tone 1+3
tone 1+2+3
14
102
103
104
-60
-40
-20
0
frequency, Hz
ampl
itud
1 2 3
12/6/2007
8
Absolute threshold of hearing
• Fletcher-Munson curves
70
80
10
20
30
40
50
60
• Basis for loudness correction in audio amplifiers
15
102
103
104
-10
0
frequency, Hz
Example of masking
• Typical spectrum& masking threshold
• Original sound:
• Sound after removing components belowthe threshold
ampl
itude
, dB
16
the threshold(1/3 to 1/2 of the data):
12/6/2007
9
Contents
• Motivation
• “Source coding”: good for speech
• “Sink coding”: Auditory Masking
• Block & Lapped Transforms
• Audio compression
17
• Examples
Block signal processing
ExtractBl k
DirectOrthogonal
X xT= PxBlock Orthogonal
Transform
InverseOrthogonal Append
Bl k
Processing
InputSignal
OutputSignal~ ~x X= P
~X
18
OrthogonalTransform Block
Signal is reconstructed as alinear combination of basis functions
12/6/2007
10
• Pro: allows adaptability
Block processing: good and bad
19
• Con: blocking artifacts
Why transforms?
• More efficient signal representation– Frequency domainq y
– Basis functions ~ “typical” signal components
• Faster processing– Filtering, compression
• Orthogonality
20
g y– Energy preservation
– Robustness to quantization
12/6/2007
11
Compactness of representation
• Maximum energy concentration in as few coefficients as possible
• For stationary random signals, the optimal basis is the Karhunen-Loève transform:
• Basis functions are the columns of P
λ i i xx ip R p= =, P P IT
21
• Minimum geometric mean of transform coefficient variances
Sub-optimal transforms
• KLT problems:
– Signal dependencySignal dependency
– P not factorable into sparse components
• Sinusoidal transforms:
– Asymptotically optimal for large blocks
i i
22
– Frequency component interpretation
– Sparse factors - e.g. FFT
12/6/2007
12
Lapped transforms
• Basis functions have tails beyond block boundaries– Linear combinations of overlapping functions such as
generate smooth signals, without blocking artifacts
23
Modulated lapped transforms
• Basis functions = cosines modulating the same low-pass
(window) prototype h(n):
• Can be computed from the DCT or FFT
• Projection can be computed in
p n h nM
n M kMk a f a f= +
+FH
IK +FH
IK
LNM
OQP
2 12
12
cos π
X xT= P O Mlog2b g
24
• Projection can be computed in
operations per input point
X x= P O Mlog2b g
12/6/2007
13
Fast MLT computation
−h na( )z M−
u M n( / )2+x n( )
.
M-sampleblock
one-blockdelay
.h M na( )− −1
h na( )
u M n( / )2 1− −
X k( )
x M n( )− −1
w M n( / )2+
−h ns ( )
input signal
...MLT
coefficients
y n( )output signal
windowingDCT-IV
transform
... U k( )
25
( )
w M n( / )2 1− −
h M ns ( )− −1
z M−
h ns ( )
windowing
...
DCT-IVtransform
Y k( )
processedMLT
coefficientsy M n( )− −1
M-sampleblock
one-blockdelay
...
Basis functions
DCT: MLT:
26
12/6/2007
14
Contents
• Motivation
• “Source coding”: good for speech
• “Sink coding”: Auditory Masking
• Block & Lapped Transforms
• Audio compression
27
• Examples
Basic architecture
TIME-TO-FREQUENCY TRANSFORM
SIDE INFOTRANSFORM
(MLT)ORIGINALAUDIOSIGNAL
MUXMASKING THRESHOLD SPECTRUM
WEIGHTING
ENCODEDBITSTREAM
SI
28
UNIFORM QUANTIZER
ENTROPY ENCODER
SI
12/6/2007
15
Quantization of transform coefficients
• Quantization = rounding to nearest integer.• Small range of integer values = fewer bits needed to Small range of integer values fewer bits needed to
represent data• Step size T controls range of integer values
y
y T x T= int( / )are mappedto this value
29
x
all valuesin this range …
Encoding of quantized coefficients
• Typical plot of quantized transform coefficients
2
- 2
- 1
0
1
2
Qua
ntiz
ed v
alue
30
• Run-length + entropy coding
0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0F r e q u e n c y , H z
12/6/2007
16
Basic entropy coding
• Huffman coding: less Value Codeword-7 '1010101010001'
6 '10101010101'frequent values have longer codewords
• More efficient if groups of values are assembled in a vector before coding
-6 '10101010101'
-5 '101010100'
-4 '10101011'
-3 '101011'
-2 '1011'
-1 '01'
0 '11'
+1 '00'
+2 '100'
31
before coding +2 100
+3 '10100'
+4 '1010100'
+5 '1010101011'
+6 '101010101001'
+7 '1010101010000'
Side information & more about EC
• Side info: model of frequency spectrum– e.g. averages over subbands
• Quantized spectral model determines weighting– masking level used to scale coefficients
• Backward adaptation reduces need for SI
32
• Run-length + Vector Huffman works– Context-based AC can be better– Room for better context models via machine learning?
12/6/2007
17
Improved architectureTIME-TO-
FREQUENCY TRANSFORM
(MLT)ORIGINALAUDIO
SIDE INFO
AUDIOSIGNAL
MUXMASKING THRESHOLD SPECTRUM
WEIGHTING
ENCODEDBITSTREAM
SI
CONTEXT MODELING
33
UNIFORM QUANTIZER
ENTROPY ENCODER
SI
MODELINGSI
Examples of context modeling
• For strongly voiced segments, spectral energies may be well predicted by a “Linear Prediction” model, similar to those used in VoIP codersto those used in VoIP coders.
• For strongly periodic components, spectral energies may be predicted by a pitch model.
• For noisy segments, a noise-only model may allow for
34
very coarse quantization lower data rate.
12/6/2007
18
Other aspects & directions
• Stereo coding– (L+R)/2 & L-R coding, expandable to multichannel– Intensity + balance codingIntensity + balance coding– Mode switching – extra work for encoder only
• Lossless coding– Easily achievable via integer transforms– exactly reversible via integer arithmetic– example: lifting-based MLT (see Refs)
• Using complex subband decompositions (MCLT)
35
• Using complex subband decompositions (MCLT)– Potential for more sophisticated auditory models– Efficient encoding is an open problem
Audio coding standards
ISO/IEC MPEG-1 Layer III (MP3) · MPEG-1 Layer II · MPEG-1 Layer I · AAC · HE-AAC · HE-AAC v2
ITU-T G.711 · G.722 · G.722.1 · G.722.2 · G.723 ·G.723.1 · G.726 · G.728 · G.729 · G.729.1 · G.729ª
Others
AC3 · AMR · Apple Lossless · ATRAC · FLAC · iLBC ·Monkey's Audio · μ-law · Musepack · Nellymoser ·OptimFROG · RealAudio · RTAudio · SHN · Speex ·Vorbis · WavPack · WMA · TAK
36
From http://en.wikipedia.org/wiki/Advanced_Audio_Coding
12/6/2007
19
Contents
• Motivation
• “Source coding”: good for speech
• “Sink coding”: Auditory Masking
• Block & Lapped Transforms
• Audio compression
37
• Examples
WMA examples:
• Original clip(~1,400 kbps) 64 kbps (MP3) 64 kbps (WMA)
• Original clip WMA @ 32 kbps(Internet radio)
38
• More examples athttp://www.microsoft.com/windows/windowsmedia/demos/audio_quality_demos.aspx
12/6/2007
20
References
• S. Shlien, “The modulated lapped transform, its time-varying forms, and its applications to audio coding standards,” IEEE Trans. Speech and Audio Processing, vol. 5, pp. 359-366, July 1997.
• H. S. Malvar, “Fast Algorithms for Orthogonal and Biorthogonal Modulated Lapped Transforms,” IEEE Symposium Advances Digital Filtering and Signal Processing, Victoria, Canada, pp. 159–163, J 1998June 1998.
• H. S. Malvar, “Enhancing the performance of subband audio coders for speech signals,” IEEE International Symposium on Circuits and Systems, Monterey, CA, vol.5, pp. 98–101, June 1998.
• H. S. Malvar, “A modulated complex lapped transform and its applications to audio processing,” IEEE International Conference on Acoustics, Speech, and Signal Processing, Phoenix, AZ, pp. 1421–1424, March 1999.
• T. Painter and A. Spanias, “Perceptual coding of digital audio,” Proc. IEEE, vol. 88, pp. 451–513, Apr. 2000. Available at http://www.eas.asu.edu/~spanias/papers.html
• H. S. Malvar, “Auditory Masking in Audio Compression,” chapter in Audio Anecdotes, K. Greenebaum, Ed., A. K. Peters Ltd., 2004.
39
• J. Li, “Reversible FFT and MDCT via matrix lifting,” IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Canada, pp. IV-173–176, May 2004.
• H. S. Malvar, “Adaptive run-length/Golomb-Rice encoding of quantized generalized Gaussian sources with unknown statistics,” IEEE Data Compression Conference, Snowbird, UT, March 2006.
• http://en.wikipedia.org/wiki/Advanced_Audio_Coding