perceptual audio coding - university of washington · 12/6/2007 1 perceptual audio coding henrique...

12/6/2007

1

Perceptual Audio Coding

Henrique MalvarManaging Director, Redmond Lab

UW Lecture – December 6, 2007

Contents

• Motivation

• “Source coding”: good for speech

• “Sink coding”: Auditory Masking

• Block & Lapped Transforms

• Audio compression

2

• Examples

12/6/2007

2

Contents

• Motivation





3

• Examples

Many applications need digital audio

• Communication– Digital TV, Telephony (VoIP) & teleconferencing– Voice mail, voice annotations on e-mail, voice recording

• Business– Internet call centers– Multimedia presentations

• Entertainment– 150 songs on standard CD

4

– thousands of songs onportable music players

– Internet / Satellite radio, HD Radio– Games, DVD Movies

12/6/2007

3

Contents

• Motivation





5

• Examples

Linear Predictive Coding (LPC)

periodic excitation LPC coefficients ( ) ( ) ( )

N

∑

CombineSynthesis Filter

pitchperiod

gainscoefficients

1( ) ( ) ( )r

rx n e n a x n r

== + −∑

e(n) x(n)

noise excitation synthesized speech

6

12/6/2007

4

LPC basics – analysis/synthesis

synthesis

Synthesis Filter

synthesis parameters

Analysis algorithm

residual waveform

1( ) ( ) ( )

N

rr

e n x n a x n r=

= − −∑

synthesized speechoriginal speech

7

LPC variant - CELP

selectionindex

Encoder

Synthesis Filter

gainLPC coefficients

.

.

.

originalspeech

Decoder

excitation codebook synthesized speech

8

12/6/2007

5

LPC variant - multipulse

LPC coefficients

Synthesis Filter

excitation

Compute pulse positions and amplitudes

synthesized speechoriginal speech

9

G.723.1 architecture

Synthesis FilterCoefficients

AudioInput

LPCFRAMER

ANALYSIS

PERCEPTUALNOISE SHAPING

HIGHPASSFILTER

SIMULATEDDECODER

ZERO-INPUTRESPONSE(MEMORY)

Excitation

Past DecodedResidualWeighted

Residual

PITCHESTIMATION

MP-MLQ/ACELPEXCITATION

COMPUTATION+

+ –Excitation

Indices

EncodedPitch Values

Residual

10

12/6/2007

6

Contents

• Motivation





11

• Examples

Physiology of the ear

• Automatic gain control– muscles around

transmission bones

• Directivity– pinna

• Boost of middle frequencies– auditory canal

• Nonlinear processing– auditory nerve

• Filter bank separationCOCHLEA

AUDITORYNERVE

EARDRUMTRANSMISSION

BONES

AUDITORY CANAL

PINNA

12

Filter bank separation– cochlea

• Thousands of “microphones”– hair cells in cochlea

EUSTACHIANTUBE

12/6/2007

7

Filter bank model

bandpass amplitude amplitudex(n)A1(m1)

group into

linear linear nonlinear

bandpassfilter 1incoming

sound

amplitudemeasurement

amplitude, band 1

bandpassfilter 2


amplitude, band 2

bandpass amplitude amplitude

.

.

....

( ) group into blocks

group into blocks

group into

.

.

.

A2(m2)

AN(mN)

13

• Explains frequency-domain masking

bandpassfilter N


amplitude, band N

group into blocks

Frequency-domain masking

0

20

40

60

ude,

dB

tone 1

102

103

104

-60

-40

-20

0

frequency, Hz

ampl

itu

20

40

60

e, d

B

tone 1

tone 2

tone 3

tone 1+3

tone 1+2+3

14

102

103

104

-60

-40

-20

0

frequency, Hz

ampl

itud

1 2 3

12/6/2007

8

Absolute threshold of hearing

• Fletcher-Munson curves

70

80

10

20

30

40

50

60

• Basis for loudness correction in audio amplifiers

15

102

103

104

-10

0

frequency, Hz

Example of masking

• Typical spectrum& masking threshold

• Original sound:

• Sound after removing components belowthe threshold

ampl

itude

, dB

16

the threshold(1/3 to 1/2 of the data):

12/6/2007

9

Contents

• Motivation





17

• Examples

Block signal processing

ExtractBl k

DirectOrthogonal

X xT= PxBlock Orthogonal

Transform

InverseOrthogonal Append

Bl k

Processing

InputSignal

OutputSignal~ ~x X= P

~X

18

OrthogonalTransform Block

Signal is reconstructed as alinear combination of basis functions

12/6/2007

10

• Pro: allows adaptability

Block processing: good and bad

19

• Con: blocking artifacts

Why transforms?

• More efficient signal representation– Frequency domainq y

– Basis functions ~ “typical” signal components

• Faster processing– Filtering, compression

• Orthogonality

20

g y– Energy preservation

– Robustness to quantization

12/6/2007

11

Compactness of representation

• Maximum energy concentration in as few coefficients as possible

• For stationary random signals, the optimal basis is the Karhunen-Loève transform:

• Basis functions are the columns of P

λ i i xx ip R p= =, P P IT

21

• Minimum geometric mean of transform coefficient variances

Sub-optimal transforms

• KLT problems:

– Signal dependencySignal dependency

– P not factorable into sparse components

• Sinusoidal transforms:

– Asymptotically optimal for large blocks

i i

22

– Frequency component interpretation

– Sparse factors - e.g. FFT

12/6/2007

12

Lapped transforms

• Basis functions have tails beyond block boundaries– Linear combinations of overlapping functions such as

generate smooth signals, without blocking artifacts

23

Modulated lapped transforms

• Basis functions = cosines modulating the same low-pass

(window) prototype h(n):

• Can be computed from the DCT or FFT

• Projection can be computed in

p n h nM

n M kMk a f a f= +

+FH

IK +FH

IK

LNM

OQP

2 12

12

cos π

X xT= P O Mlog2b g

24

• Projection can be computed in

operations per input point

X x= P O Mlog2b g

12/6/2007

13

Fast MLT computation

−h na( )z M−

u M n( / )2+x n( )

.

M-sampleblock

one-blockdelay

.h M na( )− −1

h na( )

u M n( / )2 1− −

X k( )

x M n( )− −1

w M n( / )2+

−h ns ( )

input signal

...MLT

coefficients

y n( )output signal

windowingDCT-IV

transform

... U k( )

25

( )

w M n( / )2 1− −

h M ns ( )− −1

z M−

h ns ( )

windowing

...

DCT-IVtransform

Y k( )

processedMLT

coefficientsy M n( )− −1

M-sampleblock

one-blockdelay

...

Basis functions

DCT: MLT:

26

12/6/2007

14

Contents

• Motivation





27

• Examples

Basic architecture

TIME-TO-FREQUENCY TRANSFORM

SIDE INFOTRANSFORM

(MLT)ORIGINALAUDIOSIGNAL

MUXMASKING THRESHOLD SPECTRUM

WEIGHTING

ENCODEDBITSTREAM

SI

28

UNIFORM QUANTIZER

ENTROPY ENCODER

SI

12/6/2007

15

Quantization of transform coefficients

• Quantization = rounding to nearest integer.• Small range of integer values = fewer bits needed to Small range of integer values fewer bits needed to

represent data• Step size T controls range of integer values

y

y T x T= int( / )are mappedto this value

29

x

all valuesin this range …

Encoding of quantized coefficients

• Typical plot of quantized transform coefficients

2

- 2

- 1

0

1

2

Qua

ntiz

ed v

alue

30

• Run-length + entropy coding

0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 5 0 0 0 6 0 0 0F r e q u e n c y , H z

12/6/2007

16

Basic entropy coding

• Huffman coding: less Value Codeword-7 '1010101010001'

6 '10101010101'frequent values have longer codewords

• More efficient if groups of values are assembled in a vector before coding

-6 '10101010101'

-5 '101010100'

-4 '10101011'

-3 '101011'

-2 '1011'

-1 '01'

0 '11'

+1 '00'

+2 '100'

31

before coding +2 100

+3 '10100'

+4 '1010100'

+5 '1010101011'

+6 '101010101001'

+7 '1010101010000'

Side information & more about EC

• Side info: model of frequency spectrum– e.g. averages over subbands

• Quantized spectral model determines weighting– masking level used to scale coefficients

• Backward adaptation reduces need for SI

32

• Run-length + Vector Huffman works– Context-based AC can be better– Room for better context models via machine learning?

12/6/2007

17

Improved architectureTIME-TO-

FREQUENCY TRANSFORM

(MLT)ORIGINALAUDIO

SIDE INFO

AUDIOSIGNAL

MUXMASKING THRESHOLD SPECTRUM

WEIGHTING

ENCODEDBITSTREAM

SI

CONTEXT MODELING

33

UNIFORM QUANTIZER

ENTROPY ENCODER

SI

MODELINGSI

Examples of context modeling

• For strongly voiced segments, spectral energies may be well predicted by a “Linear Prediction” model, similar to those used in VoIP codersto those used in VoIP coders.

• For strongly periodic components, spectral energies may be predicted by a pitch model.

• For noisy segments, a noise-only model may allow for

34

very coarse quantization lower data rate.

12/6/2007

18

Other aspects & directions

• Stereo coding– (L+R)/2 & L-R coding, expandable to multichannel– Intensity + balance codingIntensity + balance coding– Mode switching – extra work for encoder only

• Lossless coding– Easily achievable via integer transforms– exactly reversible via integer arithmetic– example: lifting-based MLT (see Refs)

• Using complex subband decompositions (MCLT)

35

• Using complex subband decompositions (MCLT)– Potential for more sophisticated auditory models– Efficient encoding is an open problem

Audio coding standards

ISO/IEC MPEG-1 Layer III (MP3) · MPEG-1 Layer II · MPEG-1 Layer I · AAC · HE-AAC · HE-AAC v2

ITU-T G.711 · G.722 · G.722.1 · G.722.2 · G.723 ·G.723.1 · G.726 · G.728 · G.729 · G.729.1 · G.729ª

Others

AC3 · AMR · Apple Lossless · ATRAC · FLAC · iLBC ·Monkey's Audio · μ-law · Musepack · Nellymoser ·OptimFROG · RealAudio · RTAudio · SHN · Speex ·Vorbis · WavPack · WMA · TAK

36

From http://en.wikipedia.org/wiki/Advanced_Audio_Coding

12/6/2007

19

Contents

• Motivation





37

• Examples

WMA examples:

• Original clip(~1,400 kbps) 64 kbps (MP3) 64 kbps (WMA)

• Original clip WMA @ 32 kbps(Internet radio)

38

• More examples athttp://www.microsoft.com/windows/windowsmedia/demos/audio_quality_demos.aspx

12/6/2007

20

References

• S. Shlien, “The modulated lapped transform, its time-varying forms, and its applications to audio coding standards,” IEEE Trans. Speech and Audio Processing, vol. 5, pp. 359-366, July 1997.

• H. S. Malvar, “Fast Algorithms for Orthogonal and Biorthogonal Modulated Lapped Transforms,” IEEE Symposium Advances Digital Filtering and Signal Processing, Victoria, Canada, pp. 159–163, J 1998June 1998.

• H. S. Malvar, “Enhancing the performance of subband audio coders for speech signals,” IEEE International Symposium on Circuits and Systems, Monterey, CA, vol.5, pp. 98–101, June 1998.

• H. S. Malvar, “A modulated complex lapped transform and its applications to audio processing,” IEEE International Conference on Acoustics, Speech, and Signal Processing, Phoenix, AZ, pp. 1421–1424, March 1999.

• T. Painter and A. Spanias, “Perceptual coding of digital audio,” Proc. IEEE, vol. 88, pp. 451–513, Apr. 2000. Available at http://www.eas.asu.edu/~spanias/papers.html

• H. S. Malvar, “Auditory Masking in Audio Compression,” chapter in Audio Anecdotes, K. Greenebaum, Ed., A. K. Peters Ltd., 2004.

39

• J. Li, “Reversible FFT and MDCT via matrix lifting,” IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Canada, pp. IV-173–176, May 2004.

• H. S. Malvar, “Adaptive run-length/Golomb-Rice encoding of quantized generalized Gaussian sources with unknown statistics,” IEEE Data Compression Conference, Snowbird, UT, March 2006.

• http://en.wikipedia.org/wiki/Advanced_Audio_Coding

perceptual audio coding - university of washington · 12/6/2007 1 perceptual audio coding henrique...

Documents